Notation

[Next] [Up] [Previous] [Contents]
Next: Comments on labels for Up: MULTEXT lexical specifications and Previous: MULTEXT lexical specifications and

Notation

The notation format proposed to represent lexical descriptions consists of linear strings of characters representing the morphosyntactic information to be associated with word-forms. The string is constructed following the philosophy of the Intermediate Format proposed in the EAGLES Corpus proposal (Leech and Wilson, 1994), i.e. of having agreed symbols in predefined and fixed positions: the positions of a string of characters are numbered 0, 1,2, etc. in the following way:

a. the agreed character at position 0 encodes part-of-speech;
b. each character at position 1, 2, n, encodes the value of one attribute (person, gender, number, etc.);
c. if an attribute does not apply, the corresponding position in the string contains a special marker, in our case `-' (hyphen).

Example: Ncms- (noun,common,masculine,singular,nocase)

This notation adopts the EAGLES Intermediate Format with a small revision: the Intermediate Format encodes information by means of digits, while in MULTEXT characters of a mnemonic nature are preferred.

It is worth noting here that this representation is proposed for word-form lists which will be used for a specific application, i.e. corpus annotation. We have foreseen these lexical descriptions as containing a full description of lexical items. As noted above, the sets of tags, to be used properly for automatic corpus annotation tools, are expected to contain less information.

These lexical descriptions can be seen as notational variants of the feature-based notation in the form of attribute-value pairs. In fact, the string notation proposed, e.g.

Ex.: Ncms- (noun,common,masculine,singular,nocase)

is completely synonymous to a feature-structure representation:

Ex.: {cat=noun, type=common, gender=masculine, number=singular, case=none}

         {cat=noun, type=common, gender=masculine, number=singular}

The above feature structures are often also represented as follows:

                    +-                   -+
                    | Cat:    Noun        |
                    | Type:   common      |
                    | Gender: masculine   |
                    | Number: singular    |
                    +-                   -+

Formal characteristics relevant for our applications have been kept. Use of position in the string to encode attributes makes no restrictions on the set of characters to be used as values. It could then be inferred that, if we wanted to keep the formal characteristic of order independent notation, we would have to make sure that the characters meant to represent attribute-values are not ambiguous. As attributes and values are linked by positional criteria, the need of a special marker for void attribute-value pairs is evident if we want to keep descriptions coherent. Thus, the ``Ncms-" style can be viewed as a short-hand notation convenient for some users and straightforwardly mappable to the information used in unification-based attribute-value pairs formalisms.

When comparing MULTEXT lexical description representation format with other notations one must keep in mind that they are intended to describe word-forms, and are used in very large lexical lists which contain word-forms. It seems to us relevant to comment on this point because, although it can be justified (and we will do so below) that the same formal operations can be declared in both styles, there is little evidence for justifying the need of operations such as negation and disjunction of features and values when applying them to tagged word-forms as a result of corpus annotation.

The use of `-' (`not-applicable')

We call this marker `not-applicable'[+], and, as stated above, its function is just to keep the relationship established between attributes and values. It might be used for the following cases (it has been proposed to use the `not-applicable' marker in order to encode the case of a not-applicable feature for a particular language. However this decision is still under discussion due to the facts reported in section ``Comparison of attributes/values used by languages"):

a. not applicable given a particular combination of attributes/values, i.e. although the attribute applies to the category in a given language, it does not apply to a particular subclass of the category.

b. not applicable to a particular lexical item, although the attribute applies to the rest of its paradigm.

Example: in the description of pronouns, for personal pronouns the grammatical person is to be encoded, but for demonstrative pronouns it is avoided; in this case '-' is applied following (a). On the other hand, gender cannot be informative for some personal pronouns, but it is still relevant for other personal pronouns; the application of `-' follows (b):

Pd-ms     "Este"     Pronoun, demonstrative, masculine, singular.
Pp1-s     "Yo"       Pronoun, personal, first, singular.
Pp1mp     "Nosotros" Pronoun, personal, first, masculine, plural.
Pp1fp     "Nosotras" Pronoun, personal, first, feminine, plural.

Their uses are clearly not equivalent, but there would only be meaningful differences would occur in highly typed theories of lexical description. For illustrating this point let'us have the following type system for pronouns:

TYPES           SUBTYPES             ATTRIBUTES     VALUES

Pronoun                               gender        masculine
                                                    feminine

                                      number        singular
                                                    plural
               Demonstrative

               Personal                person          1
                                                       2
                                                       3

For this system, gender and number attributes belong to the set of features which describe all pronouns. Person will only belong to the set of features which describe personal pronouns - in addition to gender and to number. Applied to this type system, case (a) would mean that the attribute-value pair does not belong to the set of features which describe a subtype, while (b) would mean indeterminacy of a given word-form (which could be expressed as a disjunction of all the values for the particular attribute or leaving a void for the value, being open to unification; this choice mainly depends on the purpose of the description, e.g. syntactic parsing).
Different representations will result: `este' description corresponds to case (a), and `yo' description corresponds to case (b) (subtypes are represented `dem' and `pers'):

              |phon                este|
              |cat     |gender    masc||
              |   'dem'|number    sing||

              |phon                yo   |
              |cat      |gender    []  ||
              |   'pers'|number    sing||
              |         |person    1   ||

In simpler flat type systems where distinctions are made only for the generic type ``pronoun", both cases a. and b. will be treated by unification mechanisms in the same way.

From the conversion point of view, we have to be concerned with the output of the MULTEXT morphological tool, as it will be the source of word-form lexical lists. The Mmorph tool does not incorporate a highly hierarchical typing system and thus no problems are expected in converting Mmorph output into lexical descriptions of the proposed format, if desired. The results from applying the Mmorph tool will probably (it strongly depends on implementation strategies) be the following:

1. a non present attribute in the description attached to the word-form;
2. a disjunction expression, i.e. {gender=masc|fem};
3. encoded as a third possible value, i.e. {gender=none}.

The simplest case for converting would be the third one, as then automatic non-intelligent conversion is possible. In the first two cases the conversion routine will have to make some inferences on type declarations. It is also expected, that when converting from other lexical sources, special conversion routines will have to be used. As seen above, the conversion from ``Ncms" lexical description notation into other unification based format will only be difficult if the target formalism is a highly typed system. If this is not the case, the presence of the ``not-applicable" marker will have to be converted into a special value or into nothing, leaving it open. For conversion into highly typed system it might be useful to have cases (a) and (b) marked by different characters, in order to guide an intelligent conversion routine to the desired results.

Mapping of lexical descriptions onto corpus tags

The tags (see the examples below) used to exemplify issues and problems to be dealt with in the mapping between lexical descriptions and corpus tags, come from the tagsets proposed in the language-specific applications of four of the MULTEXT partners. These tagsets (containing dfferences among them, because constructed on the basis of tagging practices already used by the partners) should be considered as a preliminary proposal to be discussed for harmozation and refined after experimentations on the MULTEXT tagger.

Mapping of these lexical descriptions into corpus tags has also been taken into account. It is also considered desirable to see whether under-informative corpus tags can be directly mappable to the lexical descriptions each one subsumes.

Decisions about corpus tags are language dependent. The information to be encoded depends on the ability of a given tool to disambiguate between different potential lexical descriptions for a given word-form. We have already mentioned the key concepts to be applied for defining sets of corpus tags in the preceding sections. Therefore one can first assume that the mapping from lexical descriptions onto corpus tags can be done with conversion tables which relate two different items: corpus tags and lexical descriptions. These tables are likely to be modified many times in the course of the project, based on experimentation with the disambiguation tool.

An example of such mappings is:

Lex.spec. TAG   Definition

Pp1msa-   P1S   Personal pronoun, first person, masc. sing. accusative
Px1msa-   P1S   Reflexive pronoun, first person, masc. sing. accusative
Pp1fsa-   P1S   Personal pronoun, first person, fem. sing. accusative
Px1fsa-   P1S   Reflexive pronoun, first person, fem. sing. accusative
Pp1msd-   P1S   Personal pronoun, first person, masc. sing. dative
Px1msd-   P1S   Reflexive pronoun, first person, masc. sing, dative
Pp1fsd-   P1S   Personal pronoun, first person, fem. sing., dative
Px1fsd-   P1S   Reflexive pronoun, first person, fem. sing., dative

All these lexical descriptions correspond to the Spanish form ``me". For this word-form the tags P1S - which conflates all the possible lexical descriptions - has been decided on the basis of the assumption that an automatic tool would have disambiguation problems in assigning the correct analysis among all the lexical descriptions. The correct analyis of this word-form would require syntactic analysis.

The mapping from the lexical descriptions to the corpus tags should be applicative, that is, ``each lexical description should map to one and only one corpus tag, while it is not possible to do the reverse" due to the limitations of current tagging techniques. The situation where corpus tags are more precise than a lexical description (i.e. one lexical tag corresponds to more than one corpus tag) should be, in principle, avoided.

In order to avoid redundancy in the conversion tables and to make tag optimization work easier, it has been proposed to study the possibility of having intermediate representations which prepare the conflation of information and which facilitate automatic mapping from lexical descriptions onto tags. This intermediate internal notation makes use of ``regular expressions" which incorporate operators in order to sum up the information referred by different lexical descriptions and conflated in a given tag. For the example given above, the resulting regular expression may incorporate two operators: ``match any" (.), ``list" ([]) - other possible operators proposed are ``disjunction" | and negation .

P[px]1.s[ad]-     P1S

However, the application of such regular expressions is still being studied as its use conveys some requirements on the conflation of lexical descriptions and on the construction of corpus tags. An example will illustrate the issues to be taken into account. For Spanish, first and third person of some tenses are homographs. This can be taken into account when conflating information:

Verbal paradigm                  regular exp.       TAG

cantaba, comi'a, veni'a          Vmii[13]s-          VMIIS
cantari'a, comeri'a, vendri'a    Vmcs[13]s-          VMCSS
cante, coma, venga               Vmsp[13]s-          VMSPS
cantara, comiera, viniera        Vmsi[13]s-          VMSIS

For Italian, the conflation of information on homographs also in the verbal paradigm may cause problems to the applicative principle mentioned above:

Verbal paradigm lex.descr.    regular exp.           TAG

premiate        Vmip2p-       Vm([ims]p2p-)|(ps-pf)  VMP2IMCPP
                Vmmp2p-
                Vmsp2p-
                Vmps-pf

leggete         Vmip2p-       Vm[im]p2p-             VMP2IMP
                Vmmp2p-
leggiate        Vmsp2p-       Vmsp2p-                VP2CP
lette           Vmps-pf       Vmps-ps                VFPPR

As can be seen, if we use tags such as the ones above which are based on the principle ``one graphical form - one tag", there is a violation of the applicative principle, i.e. the same lexical description will correspond to two different tags, because of different conflation clusters.

In general, it is observed that the use of operators in regular expressions results in a form of marking the information which is not going to be expressed in the corpus tag. Thus, tags would have to contain less information than the regular expression and hence than the lexical description.

Another issue to be considered is the following. Having tags with little lexical information, as in the following French example, may lead to another problematic issue in cases where such regular expressions are also used in helping to recover all possible lexical information from a given ``under-specified" corpus tag. The mapping from the regular expression onto lexical descriptions will also have to take into account the word-form in order to reject possible descriptions which do not correspond to the tagged word-forms. Below are some examples from the proposed verbal tags and regular expressions:

TAG       Regular expression  Lexical descriptions  Possible word-forms

VM1P      Vm[iscm][pifs]1p--  Vmip1p--            venons
                              Vmii1p--            venions
                              Vmif1p--            viendrons
                               ...                  ....

Let us consider that the word ``venons" is tagged as ``VM1P". If we want to know which are the lexical description to which the tag can be referring to, the explosion of the information contained in the regular expression will also give lexical descriptions which do not correspond to the word ``venons", but to other words. Regular expressions can only map a given tag for a word-form into all possible lexical descriptions for such a word-form if the information conflated only reflects ambiguities due to homography. Only with this criterion for defining tags, all the possible lexical descriptions subsumed by the corpus tag and expressed in the regular expressions will be true of a given tagged word.

If the criterion for conflating information is limited to homograph ambiguities, we see - as in the following example - that all possible lexical descriptions expanded from the regular expression are true of a given word-form.

TAG       Regular expression  Lexical descriptions  Possible word-forms

VSXICP    Vm(sp.s)|(ip2s)-         Vmip2s-             ami
                                   Vmsp1s-             ami
                                   Vmsp2s-             ami
                                   Vmsp3s-             ami

As mentioned in the section ``Comparison of Attributes/values used by languages", the application of the proposed operators in regular expressions for avoiding redundancy, in some cases, is not needed if lexical expressions already encode the possibility of having, for a given word-form, more than one possible lexical description. This is the case with the proposed values ``common" for gender, ``invariant" for number (in Italian), or ``object" for case (in French pronouns).

Almost all the languages treated in MULTEXT have nouns, adjectives, determiners (among others) which have the same word-form both for feminine and masculine agreement. The Italian group has proposed a value for gender named ``common" which avoids having to write two different entries with the same word-form, but with different lexical descriptions. In fact, this use of a special value advances the possible use of proposed operators in the regular expression.

word-form      lexical description regular expression  TAG
insegnante     Nccs-               Nccs-               NNS

could also be expressed as:

word-form      lexical description regular expression  TAG
insegnante     Ncms-               Nc[mf]s- or Nc.s-   NNS
               Ncfs-               or Nc(m|f)s

The need, as well as the consequences, for the mapping between lexical descriptions and corpus tags, of the regular expressions must still be regulated. It should be noted that regular expressions can be regarded as a convenient way to map the lexical descriptions to the corpus tags since, in many cases, the information in the lexicon is more precise than the information we can/want to have in the corpus tag set. Such a mapping still seems very interesting because there are many corpus tag systems, even for the same language, which makes it extremely difficult to relate the one to the other. Regular expressions could act as a common reference for the different systems to make comparison easy. Besides, regular expressions could make translations between the lexical description and corpus tags easier and enable the automatic generation of conversion tables.

Attribute/value tables

The categories listed below with the relevant attributes and values are based on EAGLES documents and are the results of a first testing based on a proposal made by Veronis et al. 1994 for lexical specifications in MULTEXT.

As it has already been mentioned in the section ``Background considerations" that propose features for describing lexical items of different languages aiming at defining a set which can be said ``common" for all of them is a complex task. The underlying philosophy for this task has then be to lead different groups into a pragmatic solution where the concept of an "harmonized" set of features could be reached.

The groups have first worked out their lexical descriptions taking as input EAGLES and Veronis et al. (1994) documents. The very general criterion was to encode those proposed features which were considered relevant for the language in question. Therefore MULTEXT also followed EAGLES bottom-up methodology in trying to define extensively the features ``used" in the lexical descriptions for each group language, as this procedure will make evident the features commonly used. After this phase, whose result can now be seen in the section ``Comparison of attribute/values used by the groups", a new phase is envisaged as to accomodate language-specific considerations into a general model to be used by MULTEXT. This accomodation must take into account extensibility to other languages and also application motivated arguments, as well as internal coherence. For this new phase more specific criteria would be desirable with respect the addition of new features to the EAGLES Level-1 set. The aimed result is a ``harmonized" set of features which properly describe lexical items of the different languages.

Following the general aim of the project, these harmonized specifications - and the related resources - will contribute to the standarization of the corpus annotation work. They are supposed to serve as a user oriented additional characteristic of our tool package in the sense that end-users will have a common ground for inspecting and understanding the resources and tool results independently to a large extent of the language. This common set of features will also be a common ground to perform comparisons of different annotation tool results, because, as mentioned in the previous section, the existence of many lexical description systems is causing nowadays a problem for comparing results.

Therefore the categories and features listed below are the common reference for the work done by the different groups. Further discussion on this first proposal is to be found in the section ``Comparison of the attributes/values used by the groups" which is in turn to define criteria for changing this first proposal.

Tables of categories

=============== ====
Part-of-Speech  Code
=============== ====
Noun            N
Verb            V
Adjective       A
Pronoun         P
Determiner      D    (for those who do not have a separate category
Article         T     for Articles, these are included in Determiner)
Adverb          R
Adposition      S
Conjunction     C
Numeral         M
Interjection    I
Unique          U
Residual        X
Abbreviation    Y
=============== ====

Each character at  positions  1,  2,  etc.  encodes the  value  of one
attribute  (person,  gender,  number,  etc.),  according to the tables
given below.


2.2.2 Attribute/value tables
----------------------------


Abbreviations used:
  P   Position (starts with 0 for encoding PoS values)
  ATT Attribute name
  VAL Value
  C   Code


1. Nouns (N)

= ============== ============== =
P ATT            VAL            C
= ============== ============== =
1 Type           common         c
                 proper         p
- -------------- -------------- -
2 Gender         masculine      m
                 feminine       f
                 neuter         n
- -------------- -------------- -
3 Number         singular       s
                 plural         p
- -------------- -------------- -
4 Case           nominative     n
                 genitive       g
                 dative         d
                 accusative     a
= ============== ============== =

2. Verbs (V)

= ============== ============== =
P ATT            VAL            C
= ============== ============== =
1 Type           main           m
                 auxiliary      a
                 modal          o
- -------------- -------------- -
2 Mood/VForm     indicative     i
                 subjunctive    s
                 imperative     m
                 conditional    c
                 infinitive     n
                 participle     p
                 gerund         g
                 supine         s
                 base           b
- -------------- -------------- -
3 Tense          present        p
                 imperfect      i
                 future         f
                 past           s
- -------------- -------------- -
4 Person         first          1
                 second         2
                 third          3
- -------------- -------------- -
5 Number         singular       s
                 plural         p
- -------------- -------------- -
6 Gender         masculine      m
                 feminine       f
                 neuter         n
= ============== ============== =


3. Adjectives (A)

= ============== ============== =
P ATT            VAL            C
= ============== ============== =
1 Type           qualificative  f
                 ordinal        o
                 cardinal       c
                 indefinite     i
                 possessive     s
- -------------- -------------- -
2 Degree         positive       p
                 comparative    c
                 superlative    s
- -------------- -------------- -
3 Gender         masculine      m
                 feminine       f
                 neuter         n
- -------------- -------------- -
4 Number         singular       s
                 plural         p
- -------------- -------------- -
5 Case           nominative     n
                 genitive       g
                 dative         d
                 accusative     a
= ============== ============== =


4. Pronouns (P)

= ============== ============== =
P ATT            VAL            C
= ============== ============== =
1 Type           personal       p
                 demonstrative  d
                 indefinite     i
                 possessive     s
                 interrogative  t
                 relative       r
                 exclamative    e
                 reflexive      x
                 reciprocal     l
- -------------- -------------- -
2 Person         first          1
                 second         2
                 third          3
- -------------- -------------- -
3 Gender         masculine      m
                 feminine       f
                 neuter         n
- -------------- -------------- -
4 Number         singular       s
                 plural         p
- -------------- -------------- -
5 Case           nominative     n
                 genitive       g
                 dative         d
                 accusative     a
                 oblique        o
                 object         j
- -------------- -------------- -
6 Possessor      singular       s
                 plural         p
= ============== ============== =


5. Determiners (D)

= ============== ============== =
P ATT            VAL            C
= ============== ============== =
1 Type           demonstrative  d
                 indefinite     i
                 possessive     s
                 interrogative  t
- -------------- -------------- -
2  Person        first          1
                 second         2
                 third          3
- -------------- -------------- -
3  Gender        masculine      m
                 feminine       f
                 neuter         n
- -------------- -------------- -
4  Number        singular       s
                 plural         p
- -------------- -------------- -
5  Case          nominative     n
                 genitive       g
                 dative         d
                 accusative     a
                 oblique        o
- -------------- -------------- -
6  Possessor     singular       s
                 plural         p
= ============== ============== =


6. Articles (T)

= ============ ===============  =
P ATT          VAL              C
= ============ ===============  =
1 Type         definite         d
               indefinite       i
------------- ----------------  -
2 Gender       masculine        m
               feminine         f
               neuter           n
------------- ----------------  -
3 Number       singular         s
               plural           p
------------- ----------------- -
4 Case         nominative       n
               genitive         g
               dative           d
               accusative       a
= ============ ================ =


7. Adverbs (R)

= ============== ============== =
P ATT            VAL            C
= ============== ============== =
1 Type           general        g
                 particle       p
- -------------- -------------- -
2 Degree         positive       p
                 comparative    c
                 superlative    s
= ============== ============== =


8. Adpositions (S)

= ============== ============== =
P ATT            VAL            C
= ============== ============== =
1  Type          preposition    p
                 postposition   t
                 circumposition c
- -------------- -------------- -
2 Formation      simple         s
                 compound       c
= ============== ============== =


9. Conjunctions (C)

= ============== ============== =
P ATT            VAL            C
= ============== ============== =
1 Type           coordinating   c
                 subordinating  s
= ============== ============== =


10. Numerals (M)

= ============== ============== =
P ATT            VAL            C
= ============== ============== =
1 Type           cardinal       c
                 ordinal        o
- -------------- -------------- -
2 Gender         masculine      m
                 feminine       f
                 neuter         n
- -------------- -------------- -
3 Number         singular       s
                 plural         p
- -------------- -------------- -
5 Case           nominative     n
                 genitive       g
                 dative         d
                 accusative     a
= ============== ============== =


11. Interjections (I)


12. Unique membership class (U)


13. Residual (X)


14. Abbreviations (Y)

Comparison of Attribute/value features used by the groups

The following tables reflect the attributes and values used for lexical description in MULTEXT. They take into account input supplied by different groups which the reader can find further detailed in specific language application annexes (see section 5). It is worth noting that the tables reflect both features used by five groups using lexical descriptions as proposed in previous versions of this document and also features for Dutch resulting from morphological generation with the ``mmorph" tool. The comparison is certainly of help in order to have a clear picture of the level of consensus reached with respect to harmonization when elaborating lexical lists. It has already been mentioned in previous sections that the project is working towards defining criteria for the application of EAGLES guidelines on standardization of lexical resources for easy re-usability.

Looking at the tables below some very general issues arise with respect to the application criteria of the general tables to the particular languages and the interpretation of the general guidelines. Until now groups have been working on the assumption that they must encode recommended Level-1 EAGLES features if they are relevant to their languages. The possibility of adding new values and new attribute/value pairs was also foreseen if recommended features were not enough to describe lexical items with fine-granularity of lexical descriptions. It was also found useful in view of supplying lexical material to be used by other tools than the MULTEXT ones. This openness has led to a number of incoherencies with respect to application criteria which we summarize in the points below. A decision with respect to general criteria for application must be reached in the next phase. Hence, here the issues concerning harmonization which arise from comparing application sections follow.

There is an unbalanced treatment of features considered as ``general" for the different categories. The presence of a particular attribute seems to be mainly justified for two reasons:

- representative in most of the studied languages;
- linguistic tradition.

We see in the comparative tables that a particular language is allowed to add a new attribute because of its relevance for the lexical description of a given category (i.e. when the language items belonging to that category are inflected or marked with respect to it). The most evident case is the proposal made in order to encode Possessor-gender (among other features) for Pronouns and Determiners. It is obvious that this feature cannot be used by languages which do not have different forms regarding this particular distinction.

On the other hand, note that ``case" as a feature recommended as ``general" for describing Nouns, Adjectives, Pronouns, Determiners, Articles and Numerals, is in fact used only for Pronouns by most of the languages, and only German can apply it for the rest of the categories.

What we mean by ``unbalanced treatment" has to do with the fact that features being used by just a few languages, or even just one, receive different treatment when considering them ``general" or ``language specific".

Also arising from the possibility of adding language specific attributes and values where relevant for a given description, the procedure followed in this task has shown that it has not been easy to reach a consensus in order to harmonize a number of specific features and values considered by a given language. One of cases of such proliferation of features is seen, in fact, when considering the comparative table for Determiners. One of the groups suggests having language specific types to refer to ``definite article" and ``indefinite article", while other groups prefer to have a general type ``article" and other attributes, i.e. Quantification or Definiteness to encode this distinction at a lower level. The particular features suggested by the groups which, in our opinion, could be adapted to the EAGLES model will be discussed during the next phase.

Because of this openness with respect to adding attibutes and values, we would like to point out the case for the values ``common" and ``invariant" added by the Italian descriptions to all nominal inflected categories for the attributes ``gender" and ``number" respectively (where a disjunction of values could be used instead). It is a fact that most of the languages could easily adopt this value for the forms which are identical for masculine/feminine, singular/plural agreement features, but this issue has certainly to be clarified and further discussed. Probably a decision with respect to the ``fine granularity" of lexical descriptions should be devised. In fact there is another example in another category of the same strategy, that is to conflate in a new value for a given attribute a homography which causes explosion of entries. The French group has suggested conflating accusative/dative values for pronominal case into ``object" as a generic value. The new division proposed would also apply to other romance languages but it might compromise the ``fine granularity" tendency the project aimed at for lexical descriptions.

There can also be observed a certain confrontation of two different traditions when some groups propose to add a new attribute to characterise an element while others propose to add a new value to label a new class under a general, already available attribute such as ``type". To add a new attribute would correspond to the unification based grammar practice, and a label for a class would correspond to the so called ``taxonomic" theories. We see an example of this confrontation in the proposal of having an attribute/value ``wh" for marking relative particles in different categories: pronouns, determiners, adverbs. EAGLES level-1 seems to prefer separating relatives with a different value for the attribute ``type" of pronouns and determiners. Marking as an additional feature the relative characteristics of a given pronominal would help for instance to specifically characterise items such as the English ``whose" or Spanish ``cuyo, cuya, cuyos, cuyas" which are normally described as Possessive relative pronouns. Under the current classification a decision must be made either to put them under the Possessive or the Relative value of the attribute ``type". It has also to be mentioned that no special treatment can be made for relative adverbs which are not taken as a separate class under Adverb type in the EAGLES proposal. Thus, from the comparison made, it is worth mentioning that a new attribute ``wh" for adverbs or, as suggested by the German group, a new value for interrogative - and also for relatives - adverbs should be devised.

As we have seen, the EAGLES recommendations lack in some cases the desired fine-grained distinctions which groups working in MULTEXT consider desirable for our applications. Another example of this case is raised by German and English. The groups dealing with these languages - and it could also be applied to the rest of the languages - have suggested a specific value for comparative conjunctions. This addition seems reasonable under the argument that it is an important feature with respect to distributional criteria and can be of great importance for tagging purposes. Again, some guidelines must be defined for considering the addition of features not contained in EAGLES level-1, but it is worth noting that several of the features added for language specific reasons could be considered as applicable to the rest of the languages.

We recommend a new round of discussions on the new features suggested in specific language applications to see whether they can be of use in our concrete application and applicable to the rest of the languages. Once this discussion has led to conclusions, the approved features must be included in the general model. Besides linguistic considerations, having an agreed set of general features is of great concern for the chosen notation style in lexical descriptions. There must be regulations with respect to the encoding of language specific attributes by the other groups or on the ways of differentiating them from those of the general model. This is especially relevant if general conversion routines are to be developed. And because of theroretical coherence, the treatment given to these features must take into account the above mentioned ``unbalanced treatment of features". Some other doubts remain in connection with theoretical coherence and the applied nature of the lexica to be supplied. We would only mention one of them to illustrate the kind of issues which must be taken into account in the next phase. It has to do with agreement features of person, number and gender. Are they to be encoded with respect to grammatical agreement or with respect to semantic differentiations. As it is now, following EAGLES recommendations, it seems as if only semantic considerations are taken into account, i.e. Possessive-person of determiners is taken as the ``possessor person" for most of the languages which in fact does not trigger agreement.

A decision must be taken with respect to these cases and more specific guidelines must be established for further development of lexical descriptions. It seems from the comparison made that the general criteria ``relevant for your language" is not enough. New guidelines must also take the application side into account.

Comparison tables

Abbreviations used:
    P   = Position (starts with 0 for encoding PoS values)
    ATT = Attribute name
    VAL = Value
    C   = Code

    x   = value marked by a given group (any character other than x
          means that a given 'language group' codes, in their
          application, the relevant value with that character, not using
          the agreed one).
          The column of characters is left empty in correspondence of
          language specific attributes/values of
          Dutch: they  are attested, in fact, among the set of attributes
          and values for Dutch implementation of Mmorph, where they are
          not represented by means of single codes.


1. Nouns (N)
             Features used by the groups IT DE ES FR NL EN

= ============== ============== =
P ATT            VAL            C
= ============== ============== =
1 Type           common         c        x  x  x  x  x  x
                 proper         p        x  x  x  x  x  x
- -------------- -------------- -
2 Gender         masculine      m        x  x  x  x     x
                 feminine       f        x  x  x  x     x
                 neuter         n           x           x
        l-s.     common         c        x
        l-s.     De                                  x
        l-s.     Het                                 x
        l-s.     None                                x
- -------------- -------------- -
3 Number         singular       s        x  x  x  x  x  x
                 plural         p        x  x  x  x  x  x
        l-s.     invariant      n        x
- -------------- -------------- -
4 Case           nominative     n           x
                 genitive       g           x
                 dative         d           x
                 accusative     a           x
= ============== ============== =
5 Sem-gender     M                                   x
                 F                                   x
                 N                                   x
- -------------- -------------- -



2. Verbs (V)
                                Features used by the groups
                                        IT GE SP FR DU EN
= ============== ============== =
P ATT            VAL            C
= ============== ============== =
1 Type           main           m        x  x  x  x  x  v
                 auxiliary      a        x  x  x  x  x  x
                 modal          o           x  x        m
        l-s.     copula                              x
        l-s.     impersonal                          x
- -------------- -------------- -
2 Mood/VForm     indicative     i        x  x  x  x
                 subjunctive    s        x  x  x  x
                 imperative     m        x  x  x  x
                 conditional    c        x     x  x
                 infinitive     n        x  x  x  x  x
                 participle     p        x  x  x  x
                 gerund         g        x     x
                 supine         s
                 base           b                       x
        l-s. inf. + particle    u           x
        l-s. ImPart                                  x
        l-s. Past participle
        l-s. Present participle
        l-s. PerfPart                                x
        l-s. Fin                                     x
- -------------- -------------- -
3 Tense          present        p        x  x  x  x  x  x
                 imperfect      i        x  x  x  x
                 future         f        x     x  x
                 past           s        x     x  x  x  x
- -------------- -------------- -
4 Person         first          1        x  x  x  x  x  x
                 second         2        x  x  x  x  x  x
                 third          3        x  x  x  x  x  x
- -------------- -------------- -
5 Number         singular       s        x  x  x  x  x  x
                 plural         p        x  x  x  x  x  x
- -------------- -------------- -
6 Gender         masculine      m        x     x  x
                 feminine       f        x     x  x
                 neuter         n
        l-s.     common         c        x
= ============== ============== =
7 Clitic l-s.    no             n        x  x
                 yes            y        x  x
- -------------- -------------- -
8 Clitic  l-s.   both           t              x
                 accusa         a              x
                 dative         d              x
- -------------- -------------- -


3. Adjectives (A)

                                Features used by the groups
                                        IT DE ES FR NL EN
= ============== ============== =
P ATT            VAL            C
= ============== ============== =
1 Type           qualificative  f           x  x  x
                 ordinal        o           x     x
                 cardinal       c           x     x
                 indefinite     i                 x
                 possessive     s           x  x  x
        l-s.     part1          1           x
                 part2          2           x
- -------------- -------------- -
2 Degree         positive       p        x  x  x  x  x  x
                 comparative    c        x  x  x  x  x  x
                 superlative    s        x  x  x     x  x
- -------------- -------------- -
3 Gender         masculine      m        x  x  x  x
                 feminine       f        x  x  x  x
                 neuter         n           x
        l-s.     common         c        x
- -------------- -------------- -
4 Number         singular       s        x  x  x  x
                 plural         p        x  x  x  x
        l-spc.   invariant      n        x
- -------------- -------------- -
5 Case           nominative     n           x
                 genitive       g           x
                 dative         d           x
                 accusative     a           x
= ============== ============== =
6 Position l-spc. attributive   a                       x
                  predicative   p                       x
- -------------- -------------- -


4. Pronouns (P)

                                Features used by the groups

                                        IT DE ES FR NL EN
= ============== ============== =
P ATT            VAL            C
= ============== ============== =
1 Type           personal       p        x  x  x  x  x  x
                 demonstrative  d        x  x  x  x  x
                 indefinite     i        x  x  x  x
                 possessive     s        x     x  x     x
                 interrogative  t        x  x  x  x  x
                 relative       r        x  x  x  x  x
                 exclamative    e        x
                 reflexive      x           x        x  x
                 reciprocal     l                    x
      l-s.       general        g                       x
      l-s.       quantificational                    x
- -------------- -------------- -
2 Person         first          1        x  x  x  x  x  x
                 second         2        x  x  x  x  x  x
                 third          3        x  x  x  x  x  x
- -------------- -------------- -
3 Gender         masculine      m        x  x  x  x
                 feminine       f        x  x  x  x
                 neuter         n           x  x  x
        l-s.     common         c        x
- -------------- -------------- -
4 Number         singular       s        x  x  x  x  x  x
                 plural         p        x  x  x  x  x  x
        l-s.     invariant      n        x
- -------------- -------------- -
5 Case           nominative     n           x  x  x
                 genitive       g           x
                 dative         d           x  x
                 accusative     a           x  x
                 oblique        o              x  x
                 object         j                 x
        l-s.     1                                   x
        l-s.     4                                   x
- -------------- -------------- -
6 Possessor      singular       s              x  x     x
                 plural         p              x  x     x
= ============== ============== =
7 Wh             Not-wh         n                       x
                 Relative       r                       x
                 Int            q                       x
- -------------- -------------- -
8 Poss-person    First          1                       x
                 Second         2                       x
                 Third          3                       x
- -------------- -------------- -
9 Poss-gender    Masculine      m                       x
                 Femenine       f                       x
                 Neuter         n                       x
- -------------- -------------- -
10Sem-gender     M                                   x
                 F                                   x
                 N                                   x
- -------------- -------------- -



5. Determiners (D)
                                Features used by the groups
                                        IT DE ES FR NL EN

= ============== ============== =
P ATT            VAL            C
= ============== ============== =
1 Type           demonstrative  d        x  x  x  x  x
                 indefinite     i        x  x  x  x
                 possessive     s        x  x  x  x  x  x
                 interrogative  t        x  x  x  x
                 exclamative    e        x
                 relative       r        x
                 article        a                 x  x
        l-s.     Def-article    t                       x
        l-s.     Indef-article  a                       x
        l-s.     General        g                       x
        l-s.     quantificational                    x
- -------------- -------------- -
2  Person        first          1        x  x  x  x
                 second         2        x  x  x  x
                 third          3        x  x  x  x
- -------------- -------------- -
3  Gender        masculine      m        x  x  x  x  x
                 feminine       f        x  x  x  x  x
                 neuter         n           x  x     x
        l-s      common         c        x
- -------------- -------------- -
4  Number        singular       s        x  x  x  x  x  x
                 plural         p        x  x  x  x  x  x
        l-s      invariant      n        x
- -------------- -------------- -
5  Case          nominative     n           x
                 genitive       g           x
                 dative         d           x
                 accusative     a           x
                 oblique        o
- -------------- -------------- -
6  Possessor     singular       s               x  x
                 plural         p               x  x
= ============== ============== =
7  Quantif./or   definite       d                  x  x
   Defness       indefinite     i                  x  x
- -------------- -------------- -
8  Wh            Not-wh         n                        x
                 Relative       r                        x
                 Int/Ecl        q                        x

- -------------- -------------- -
9 Poss-person    First          1                        x
                 Second         2                        x
                 Third          3                        x
- -------------- -------------- -
10 Poss-gender   Masculine      m                        x
                 Feminine       f                        x
                 Neuter         n                        x
- ------------  --------------- -


6. Articles (T)
                                Features used by the groups
                                        IT DE ES FR NL EN

= ============ ===============  =
P ATT          VAL              C
= ============ ===============  =
1 Type         definite         d        x  x  x
               indefinite       i        x  x  x
------------- ----------------  -
2 Gender       masculine        m        x  x  x
               feminine         f        x  x  x
               neuter           n        x  x  x
        l-s.   common           c        x
------------- ----------------  -
3 Number       singular         s        x  x  x
               plural           p        x  x  x
------------- ----------------- -
4 Case         nominative       n           x
               genitive         g           x
               dative           d           x
               accusative       a           x
= ============ ================ =


7. Adverbs (R)
                                Features used by the groups
                                        IT DE ES FR NL EN

= ============== ============== =
P ATT            VAL            C
= ============== ============== =
1 Type           general        g              x  x
                 particle       p              x  x
        l-s.     degree         d           x
        l-s.     interrogative  i           x
        l-s.     conjunction    c           x
        l-s.     modal          m           x
        l-s.     pronom         p           x
        l-s.     temporal       t           x
        l-s.     place          l           x
- -------------- -------------- -
2 Degree         positive       p        x  x  x  x     x
                 comparative    c           x  x  x     x
                 superlative    s        x  x  x        x
        l-s.     negative       n                 x
= ============== ============== ==
3 Function       mod                                    x
                 spe                                    x
- -------------- -------------- --
4 Wh-ness        interrogative  q                       x
                 relative       r                       x
                 no             n                       x
- -------------- -------------- --




8. Adpositions (S)
                                Features used by the groups
                                        IT DE ES FR NL EN

= ============== ============== =
P ATT            VAL            C
= ============== ============== =
1  Type          preposition    p        x  x  x  x  x  x
                 postposition   t           x        x  x
                 circumposition c           x
        l-s.     part1          a           x
        l-s.     part2          z           x
- -------------- -------------- -
2 Formation      simple         s        x  x  x
                 compound       c        x  x
= ============== ============== =
3 Gender         masculine      m        x
                 femenine       f        x
                 common         c        x
- -------------- -------------- -
4 Number         singular       s        x
                 plural         p        x
- -------------- -------------- -


9. Conjunctions (C)
                                Features used by the groups
                                        IT DE ES FR NL EN

= ============== ============== =
P ATT            VAL            C
= ============== ============== =
1 Type           coordinating   c        x  x  x  x  x  x
                 subordinating  s        x  x  x  x  x  x
      l-spc.     compar         v           x           x
      l-spc.     infinitive     i           x
      l-spc.     part1          a           x
      l-spc.     part2          z           x
= ============== ============== =
2  ctype         finite         f                        x
                 that           t                        x
                 subjunctive    s                        x
- -------------- -------------- -
3 coord-posit.   initial        i                        x
                 non-initial    n                        x
- -------------- -------------- -

10. Numerals (M)
                                Features used by the groups
                                        IT DE ES FR NL EN

= ============== ============== =
P ATT            VAL            C
= ============== ============== =
1 Type           cardinal       c        x     x  x      x
                 ordinal        o        x     x         x
- -------------- -------------- -
2 Gender         masculine      m        x     x  x
                 feminine       f        x     x  x
                 neuter         n
- -------------- -------------- -
3 Number         singular       s        x     x  x
                 plural         p        x     x  x
- -------------- -------------- -
5 Case           nominative     n
                 genitive       g
                 dative         d
                 accusative     a
= ============== ============== =

                                Categories used by the groups
                                        IT DE ES FR NL EN
11. Interjections (I)                    x  x  x  x  x


12. Unique membership class (U)


13. Residual (X)                         x  x     x

14. Particle (Q)                            x

15. Punctuation (F)                      x  x

16. Abreviations (Y)                        x

[Next] [Up] [Previous] [Contents]
Next: Comments on labels for Up: MULTEXT lexical specifications and Previous: MULTEXT lexical specifications and

Multext