After discussion on several issues, the MULTEXT partners agreed on the
necessity of differentiating corpus tags used for the PoS
disambiguator, or tagger, from the information which a lexicon can
offer. This is because
the former is an application-oriented representation of the
information described by the latter and depends very much on the tool
used. This decision was also in accordance with the orientations
given by the EAGLES Lexicon and Corpus working groups (see Monachini
and Calzolari, 1994).
Thus the terminology adopted in MULTEXT reflects this separation:
Hence, it was agreed that two different objects will be produced for each
language:
a. a lexicon where morphosyntactic features for each word form are
encoded with fine granularity, as close as possible to the
recommended EAGLES level-1.
b. a set of tags for the purpose of automatic disambiguation. In
practical terms these tags are to reflect broader categories on the
basis of the
limitations of a statistical tool. This set will be defined
and refined upon experimentations with the tagger tool.
In Task 1.6, it was decided - in accordance to the Technical Annex -
to begin taking EAGLES recommendations as
input for deciding on the basic morphosyntactic information to be
associated with the word-forms contained in the lexicon.
The MULTEXT application of EAGLES
recommendations needed to ensure that the information contained in the
so-called
Level-1 were significant for most of the languages to be
treated. Thus the work under this task may also be considered as
a concrete
validation of EAGLES work on electronic lexica. The
underlying aim of EAGLES is concerned with the re-usability of
electronic lexica, and, following this general tendency, MULTEXT lexical
descriptions
also had to be (as far as possible) independent from the
application, aiming at a general description of each language and
containing a basic set of shared information.
Also, for
the sake of ``re-usability" of the lexical material supplied, it was
judged that the lexical information to be encoded should be as detailed
as possible. Thus fine-granularity of the information would allow
other users to rearrange categories, when necessary, without much
difficulty.
The actual corpus tags we will be using will depend on at least the following:
We can fix (1), but
(2) is highly dependent on the tool.
That is why we concentrated on (1) in the first phase.
The corpus tags will be developed for each language with a specific
application in mind, i.e. that of producing a corpus tagged for
part-of-speech
(and possibly other morphosyntactic information) by means of
automatic disambiguation. The set of corpus tags will, very likely, be
revised many times during the course of the project, in order to find
an optimal set for each language.
It would be ideal to tag a corpus with the lexical descriptions
themselves for each word. However, it is well known that this is well
beyond the capabilities of the state-of-the-art tagging techniques.
Corpus tags are, therefore, to be seen as kinds of underspecified
lexical tags. There are two reasons why we may want underspecified
corpus tags:
1. Experience shows that some distinctions are difficult to get right
with a high accuracy.
For example, in some languages, the disambiguation between indicative
present and subjunctive present in a corpus is extremely difficult
to achieve by
automatic means. If some verbs have different forms for the indicative
and the subjunctive (e.g. Fr. venir: indic. = viens, subj. = vienne;
It. indic. = vieni, subj. = venga), many have the same form (e.g. Fr.
manger: indic., subj. and imper. = mange; It. indic. and subj. = ami).
In this latter case, disambiguation can only be achieved with very
complex
parsing of sentences.
Therefore, lexical entries will contain the following detailed and granular information associated with the word-forms
mange (manger) Main verb Indicative present, 1st person sing. mange (manger) Main verb Indicative present, 3rd person sing. mange (manger) Main verb Subjunctive present, 1st person sing. mange (manger) Main verb Subjunctive present, 3rd person sing. mange (manger) Main verb Imperative present, 2nd person sing. ami (amare) Main verb Indicative present, 2nd person sing ami (amare) Main verb Subjunctive present, 1st person sing. ami (amare) Main verb Subjunctive present, 2nd person sing. ami (amare) Main verb Subjunctive present, 3rd person sing.
wheras corpus tags will provide broader categories, collapsing several
lexical descriptions.
2. In order to train the tagger, we need statistical tables (based on
co-occurrences of tags). If we have a large tagset, we need a very
large corpus to train the disambiguator, in order to observe rare
co-occurrences. For example, in the proposal for French (see below),
there are 249 different lexical descriptions, but only 74 collapsed
corpus tags. Experience (Church, Penn Treebank, IBM France, etc.)
shows that the tagset should be under 100. Actually the Penn Treebank
collapsed many tags compared to the original Brown corpus,
and got better
results.
Two other observations are of relevance as regards the relation
between lexical specifications and corpus tags.
(a) Sometimes tagging classes are in reality different from lexical
descriptions. For example,
classes for punctuation are needed, certain types of
semantic or pragmatic or lexical information can be present in the
tags (e.g. the days of the week).
(b) Furthermore, the ``collapsing" decisions in TAGS are language
dependent, therefore it is not possible to have completely identical
tagsets across languages. To illustrate, we can give as an example
the
differences related to person differentiations in verbal morphology.
In Spanish, first and third person of different tenses have the same spelling:
Yo/El cantaba (Imperfect) Yo/El cantari'a (Conditional) Yo/El cante (present of subjunctive)
Taking into account that the subject in Spanish is not obligatory, and that the tagger cannot know if the preceeding NP is in fact the subject of the verb, there is no way to discriminate between the two forms. Hence a conflating tag is recommended, marked for instance as ``non-second-singular" form or as ``first- third singular". Also French has homographs for different verbal persons, but these are the first and the second person of some tenses:
Je/Tu viens Je/Tu e'taisThe French tag cannot be the same as the Spanish one, but it could be ``non-third-singular" or ``first-second-singular". Moreover, having two different tags in French for the homograph could be justified, due to the obligatory presence of a lexical subject, as the tagger will be able to disambiguate among them due to the presence of a pronoun in a near context of most of their occurences.
For some languages (e.g. French, English and Italian) a lot of past experience and empirical evidence exists, which can be used to choose a reasonable initial tagset, that can be seen as preliminary and which can be refined later on in the project. For example, for English, the Penn tagset or the BNC are very good candidates. For French, the IBM tagset is a very good start (the French proposal presented in the following is very close to it). For Italian the tagset based on the DMI (Calzolari et al. 1983) is also a good starting point. These tagsets are the result of years of trial-and-error adjustments, and it seems reasonable not to ignore them. All of these tagsets are, moreover, compatible with the EAGLES proposal, i.e. mappable to it.