[Next] [Up] [Previous] [Contents]
Next: Lexical lists: lemmas and Up: Background Considerations Previous: Background Considerations

Lexical descriptions and corpus tags

After discussion on several issues, the MULTEXT partners agreed on the necessity of differentiating corpus tags used for the PoS disambiguator, or tagger, from the information which a lexicon can offer. This is because the former is an application-oriented representation of the information described by the latter and depends very much on the tool used. This decision was also in accordance with the orientations given by the EAGLES Lexicon and Corpus working groups (see Monachini and Calzolari, 1994).
Thus the terminology adopted in MULTEXT reflects this separation:

Hence, it was agreed that two different objects will be produced for each language:

a. a lexicon where morphosyntactic features for each word form are encoded with fine granularity, as close as possible to the recommended EAGLES level-1.

b. a set of tags for the purpose of automatic disambiguation. In practical terms these tags are to reflect broader categories on the basis of the limitations of a statistical tool. This set will be defined and refined upon experimentations with the tagger tool.

In Task 1.6, it was decided - in accordance to the Technical Annex - to begin taking EAGLES recommendations as input for deciding on the basic morphosyntactic information to be associated with the word-forms contained in the lexicon. The MULTEXT application of EAGLES recommendations needed to ensure that the information contained in the so-called Level-1 were significant for most of the languages to be treated. Thus the work under this task may also be considered as a concrete validation of EAGLES work on electronic lexica. The underlying aim of EAGLES is concerned with the re-usability of electronic lexica, and, following this general tendency, MULTEXT lexical descriptions also had to be (as far as possible) independent from the application, aiming at a general description of each language and containing a basic set of shared information. Also, for the sake of ``re-usability" of the lexical material supplied, it was judged that the lexical information to be encoded should be as detailed as possible. Thus fine-granularity of the information would allow other users to rearrange categories, when necessary, without much difficulty.

The actual corpus tags we will be using will depend on at least the following:

  1. the lexical features, and
  2. the capabilities of the MULTEXT tagger to disambiguate between different lexical decriptions, or different types of typical homographies present in different language types.

We can fix (1), but (2) is highly dependent on the tool. That is why we concentrated on (1) in the first phase.

The corpus tags will be developed for each language with a specific application in mind, i.e. that of producing a corpus tagged for part-of-speech (and possibly other morphosyntactic information) by means of automatic disambiguation. The set of corpus tags will, very likely, be revised many times during the course of the project, in order to find an optimal set for each language.

It would be ideal to tag a corpus with the lexical descriptions themselves for each word. However, it is well known that this is well beyond the capabilities of the state-of-the-art tagging techniques.

Corpus tags are, therefore, to be seen as kinds of underspecified lexical tags. There are two reasons why we may want underspecified corpus tags:

1. Experience shows that some distinctions are difficult to get right with a high accuracy.

For example, in some languages, the disambiguation between indicative present and subjunctive present in a corpus is extremely difficult to achieve by automatic means. If some verbs have different forms for the indicative and the subjunctive (e.g. Fr. venir: indic. = viens, subj. = vienne; It. indic. = vieni, subj. = venga), many have the same form (e.g. Fr. manger: indic., subj. and imper. = mange; It. indic. and subj. = ami). In this latter case, disambiguation can only be achieved with very complex parsing of sentences.

Therefore, lexical entries will contain the following detailed and granular information associated with the word-forms

    mange (manger) Main verb Indicative present, 1st person sing.
    mange (manger) Main verb Indicative present, 3rd person sing.
    mange (manger) Main verb Subjunctive present, 1st person sing.
    mange (manger) Main verb Subjunctive present, 3rd person sing.
    mange (manger) Main verb Imperative present, 2nd person sing.

    ami   (amare)  Main verb Indicative present,  2nd person sing
    ami   (amare)  Main verb Subjunctive present, 1st person sing.
    ami   (amare)  Main verb Subjunctive present, 2nd person sing.
    ami   (amare)  Main verb Subjunctive present, 3rd person sing.

wheras corpus tags will provide broader categories, collapsing several lexical descriptions.

2. In order to train the tagger, we need statistical tables (based on co-occurrences of tags). If we have a large tagset, we need a very large corpus to train the disambiguator, in order to observe rare co-occurrences. For example, in the proposal for French (see below), there are 249 different lexical descriptions, but only 74 collapsed corpus tags. Experience (Church, Penn Treebank, IBM France, etc.) shows that the tagset should be under 100. Actually the Penn Treebank collapsed many tags compared to the original Brown corpus, and got better results.

Two other observations are of relevance as regards the relation between lexical specifications and corpus tags.

(a) Sometimes tagging classes are in reality different from lexical descriptions. For example, classes for punctuation are needed, certain types of semantic or pragmatic or lexical information can be present in the tags (e.g. the days of the week).

(b) Furthermore, the ``collapsing" decisions in TAGS are language dependent, therefore it is not possible to have completely identical tagsets across languages. To illustrate, we can give as an example the differences related to person differentiations in verbal morphology.

In Spanish, first and third person of different tenses have the same spelling:

     Yo/El cantaba (Imperfect)
     Yo/El cantari'a (Conditional)
     Yo/El cante (present of subjunctive)

Taking into account that the subject in Spanish is not obligatory, and that the tagger cannot know if the preceeding NP is in fact the subject of the verb, there is no way to discriminate between the two forms. Hence a conflating tag is recommended, marked for instance as ``non-second-singular" form or as ``first- third singular". Also French has homographs for different verbal persons, but these are the first and the second person of some tenses:

     Je/Tu viens
     Je/Tu e'tais
The French tag cannot be the same as the Spanish one, but it could be ``non-third-singular" or ``first-second-singular". Moreover, having two different tags in French for the homograph could be justified, due to the obligatory presence of a lexical subject, as the tagger will be able to disambiguate among them due to the presence of a pronoun in a near context of most of their occurences.

For some languages (e.g. French, English and Italian) a lot of past experience and empirical evidence exists, which can be used to choose a reasonable initial tagset, that can be seen as preliminary and which can be refined later on in the project. For example, for English, the Penn tagset or the BNC are very good candidates. For French, the IBM tagset is a very good start (the French proposal presented in the following is very close to it). For Italian the tagset based on the DMI (Calzolari et al. 1983) is also a good starting point. These tagsets are the result of years of trial-and-error adjustments, and it seems reasonable not to ignore them. All of these tagsets are, moreover, compatible with the EAGLES proposal, i.e. mappable to it.


[Next] [Up] [Previous] [Contents]
Next: Lexical lists: lemmas and Up: Background Considerations Previous: Background Considerations

Multext