MULTEXT-East Morphosyntactic Specifications

1.3. Notation

In MULTEXT the notation has been chosen following current practices for NLP, where information is represented in attribute-value formalisms and following the idea that it should also be self-informative for human understanding. At the same time, a relativelly compact encoding was maintained. The notation format has the following main characteristics:

These characteristics make the proposed lexical notation similar to attribute-value pairs used in unification based formalisms (see the MULTEXT D1-6-1B Deliverable [mt:D161B] for further details).

The linear strings of characters representing the morphosyntactic descriptions are constructed following the philosophy of the Intermediate Format proposed in the Eagles Corpus proposal [eagles:morphana], i.e. of having agreed symbols in predefined and fixed positions: the positions of a string of characters are numbered 0, 1, 2, etc. in the following way:

Example: Ncms- (Noun, common, masculine, singular, nocase)

This notation adopts the Eagles Intermediate Format with a small revision: the Intermediate Format encodes information by means of digits, while in MULTEXT characters of a mnemonic nature are preferred.

The marker ‘-’ has a special semantics and means 'not-applicable'. It is used in the following cases:

Finally, it should be noted that in the lexica trailing hyphens have been omitted, as this often leads to a more compact encoding. Hence codes like Ncms- are written as Ncms.

Date: 2022-06-24
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International.