As stated in the Introduction, MULTEXT will supply a
morphological tool which will need some information on lemmas in
order to
produce the entire set of associated word-forms. A list of
word-forms will
constitute another lexicon which, as referred in the Technical Annex,
constitutes a value in itself. Hence, the lexica supplied by MULTEXT
are of two types:
(a) word-forms, containing Word-form, morphosyntactic information, lemma, TAG (b) lemmas, containing Lemma, morphosyntactic information, inflection information
The information and notation of the lemma dictionary is closely related to the morphological tool used and also on the rules implemented within the tool. Due to the fact, mentioned already in the Introduction, that the availability of word-form lists was considered of priority for corpus annotation tool development, we first concentrated on the definition of the word-form lists following EAGLES recommendations for the morphosyntactic annotation to be encoded, as explained in the preceding section. It was possible to define a representation of morphosyntactic information for these word-form lists independent from a morphological tool, in such a way as to ensure that lemma dictionaries and the output of morphological modules (the ones produced for MULTEXT or others) be compatible and easily mappable to such lists. Following current practices for NLP, the notation used should represent information in attribute/value formalisms (as was done also in EAGLES) and should also be self-informative for human inspection and understanding. Considerations concerning the desirability that these descriptions are able to provide information about language-specific characteristics, where also taken into account. Following these ideas, a notation format was suggested whose main characteristics are:
These characteristics make the proposed lexical description notation (see section 3.1 for more details) synonymous with attribute/value pairs used in current unification formalisms. The next sections introduce such formalism and the information to be encoded.