In Multext, the notation has been chosen following current practices for NLP, where information is represented in attribute-value formalisms and following the idea that it should also be self-informative for human understanding. At the same time, a relativelly compact encoding was maintained. The notation format suggested has the following main characteristics:
These characteristics make the proposed lexical notation similar to attribute-value pairs used in unification based formalisms (see the Multext D1-6-1B Deliverable for further details).
The linear strings of characters representing the morphosyntactic descriptions are constructed following the philosophy of the Intermediate Format proposed in the Eagles Corpus proposal (Leech and Wilson, 1994), i.e. of having agreed symbols in predefined and fixed positions: the positions of a string of characters are numbered 0, 1, 2, etc. in the following way:
Ncms- (Noun, common, masculine, singular, nocase)
This notation adopts the Eagles Intermediate Format with a small revision: the Intermediate Format encodes information by means of digits, while in Multext characters of a mnemonic nature are preferred.
The marker '-' has a special semantics and it means 'not-applicable'. As stated above, its function is just to keep the relationship established between attributes and values. It might be used for the following cases:
Finally, it should be noted that in the lexica trailing hyphens have
been omitted, as this looses no information and leads to a more
compact encoding. Hence codes like
Ncms- are written as