MULTEXT-East Morphosyntactic Specifications, Version 5 (draft)

1.3. Notation

Up: Previous: Next:

In MULTEXT the notation has been chosen following current practices for NLP, where information is represented in attribute-value formalisms and following the idea that it should also be self-informative for human understanding. At the same time, a relativelly compact encoding was maintained. The notation format has the following main characteristics:

attributes are marked by positions;
values are represented by a single character;
a special marker reflects the non applicability of a given attribute.

These characteristics make the proposed lexical notation similar to attribute-value pairs used in unification based formalisms (see the MULTEXT D1-6-1B Deliverable [mt:D161B] for further details).

The linear strings of characters representing the morphosyntactic descriptions are constructed following the philosophy of the Intermediate Format proposed in the Eagles Corpus proposal [eagles:morphana], i.e. of having agreed symbols in predefined and fixed positions: the positions of a string of characters are numbered 0, 1, 2, etc. in the following way:

the agreed character at position 0 encodes part-of-speech;
each character at position 1, 2, n, encodes the value of one attribute (person, gender, number, etc.);
if an attribute does not apply, the corresponding position in the string contains a special marker, the hyphen (‘-’).

Example: Ncms- (Noun, common, masculine, singular, nocase)

This notation adopts the Eagles Intermediate Format with a small revision: the Intermediate Format encodes information by means of digits, while in MULTEXT characters of a mnemonic nature are preferred.

The marker ‘-’ has a special semantics and means 'not-applicable'. It is used in the following cases:

not relevant to a particular language, e.g. Gender to Estonian;
not applicable to a particular combination of attribute-values, i.e. although the attribute is used by a category in a given language it does not apply to a particular subclass of the category; e.g., Person applies to Pronouns, but not to the Type demonstrative;
not applicable to a particular lexical item, i.e. although the attribute applies to the rest of its paradigm, e.g., Gender in the paradigm of English Personal Pronouns applies only to the 3rd person, I, you vs. she, he).

Finally, it should be noted that in the lexica trailing hyphens have been omitted, as this often leads to a more compact encoding. Hence codes like Ncms- are written as Ncms.

Up: Previous: Next:

Date: 2016-06-20
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International.