next up previous contents
Next: The use of Up: Introduction Previous: Lexical lists

Notation

In MULTEXT , the notation has been chosen following current practices for NLP, where information is represented in attribute-value formalisms and following the idea that it should also be self-informative for human understanding. Considerations concerning the desirability that these descriptions are able to provide information about language-specific characteristics, have also been taken into account. To sum up, the notation format suggested has the following main characteristics:

These characteristics make the proposed lexical notation synonymous with attribute-value pairs used in current unification formalisms (see the D1-6-1B Deliverable for further details).

The linear strings of characters representing the morphosyntactic information to be associated with word-forms are constructed following the philosophy of the Intermediate Format proposed in the EAGLES \ Corpus proposal (Leech and Wilson, 1994), i.e. of having agreed symbols in predefined and fixed positions: the positions of a string of characters are numbered 0, 1, 2, etc. in the following way:

a.
the agreed character at position 0 encodes part-of-speech;
b.
each character at position 1, 2, n, encodes the value of one attribute (person, gender, number, etc.);
c.
if an attribute does not apply, the corresponding position in the string contains a special marker, in our case '-' (hyphen).

Example: Ncms- (noun, common, masculine, singular, nocase)

This notation adopts the EAGLES Intermediate Format with a small revision: the Intermediate Format encodes information by means of digits, while in MULTEXT characters of a mnemonic nature are preferred.

It is worth noting here that this representation is proposed for word-form lists which will be used for a specific application, i.e. corpus annotation.





Tomaz Erjavec
Wed Oct 16 12:08:36 MDT 1996