Next: Organization of the language-specific Up: Introduction Previous: Lexical lists

Notation

In Multext, the notation has been chosen following current practices for NLP, where information is represented in attribute-value formalisms and following the idea that it should also be self-informative for human understanding. At the same time, a relativelly compact encoding was maintained. The notation format suggested has the following main characteristics:

attributes are marked by positions;
values are represented by a single character;
a special marker reflects the non applicability of a given attribute.

These characteristics make the proposed lexical notation similar to attribute-value pairs used in unification based formalisms (see the Multext D1-6-1B Deliverable for further details).

The linear strings of characters representing the morphosyntactic descriptions are constructed following the philosophy of the Intermediate Format proposed in the Eagles Corpus proposal (Leech and Wilson, 1994), i.e. of having agreed symbols in predefined and fixed positions: the positions of a string of characters are numbered 0, 1, 2, etc. in the following way:

a.: the agreed character at position 0 encodes part-of-speech;
b.: each character at position 1, 2, n, encodes the value of one attribute (person, gender, number, etc.);
c.: if an attribute does not apply, the corresponding position in the string contains a special marker, the hyphen ('-').

Example: Ncms- (Noun, common, masculine, singular, nocase)

This notation adopts the Eagles Intermediate Format with a small revision: the Intermediate Format encodes information by means of digits, while in Multext characters of a mnemonic nature are preferred.

The marker '-' has a special semantics and it means 'not-applicable'. As stated above, its function is just to keep the relationship established between attributes and values. It might be used for the following cases:

(a): not relevant to a particular language , e.g. Gender to Estonian.
(b): not applicable to a particular combination of attribute-values , i.e. although the attribute is used by a category in a given language it does not apply to a particular subclass of the category; e.g., Person applies to Pronouns, but not to the Type demonstrative.
(c): not applicable to a particular lexical item , i.e. although the attribute applies to the rest of its paradigm, e.g., Gender in the paradigm of English Personal Pronouns applies only to the 3rd person, I, you vs. she, he).

Finally, it should be noted that in the lexica trailing hyphens have been omitted, as this looses no information and leads to a more compact encoding. Hence codes like Ncms- are written as Ncms.

Next: Organization of the language-specific Up: Introduction Previous: Lexical lists

Multext-East