Romanian Language Technology

Dan Tufis & al * Corpora and Corpus-Based Morpho-Lexical Processing

That is, the positions in a string of characters are numbered 0, 1, 2, etc., and are used in the following way:

the character at position 0 encodes part-of-speech;
each character at position 1, 2,..., n, encodes the value of one attribute (person, gender, number, etc.), using the one-character code;
if an attribute does not apply, the corresponding position in the string contains the special marker '-' (hyphen).

By convention, trailing hyphens are not included in the lexical MSDs. Such specifications provide a simple and relatively compact encoding, and are in intention similar to feature-structure encoding used in unification-based grammar formalisms. When the word-form is the very lemma, then the equal sign is written in the lemma-field of the entry ('=').

The attributes and mainly the values of the attributes were chosen considering only word-level encoding. This is why, some values described in grammar text-books, but assuming compounding (such as compound tenses), were not considered in the MULTEXT-EAST encoding.

Concerning the "applicability" of a certain feature/value one has to distinguish at least 2 cases:

a specific attribute is not at all applicable to a given language (such as it is the case in Romanian for Referent_type or Owner_gender attributes of the pronouns);
a specific combination of feature-values applying to a word-form belonging to a given category makes some attributes of the category irrelevant in the case considered (for instance tense for some non-predicative verbs, or gender for personal pronouns, first and second persons).

These two cases were encoded by using a special attribute-value, namely '-'.

A special case was raised by token-indeterminable values. This is the case when a given word form cannot be assigned, without considering the context, a unique value for a given attribute, but a subset or even all of the permissible values of that attribute.

A typical example of such an attribute is the grammatical case. For the indefinite forms of nouns and adjectives nothing precise can be said about case unless the preceding word(s) are considered. In such a situation one could only say that all possible values of a considered attribute (in this example, the case) are applicable. Since the MULTEXT-EAST encoding schema does not use a special value "any" (which in some encoding proposals is denoted by a dot '.') two possible solutions could be envisaged:

a) to explicitly encode in the word-form lexicon all the combinations resulting from expanding the would be 'any' values. This solution is 'clean' but introduces a large degree of redundancy and ambiguity. Consider the 'dotted' encoding of the word 'munte':

munte munte Ncms.n.

Expanding the two dots (see tables on the next pages) this entry should be replaced by the following 10 fully expanded entries:

munte munte Ncmsnnn munte munte Ncmsnny
munte munte Ncmsgnn munte munte Ncmsgny
munte munte Ncmsdnn munte munte Ncmsdny
munte munte Ncmsann munte munte Ncmsany
munte munte Ncmsvnn munte munte Ncmsvny

In the initial phase of the project, we took this approach, but after a while, due to computational reasons (both in terms of space requirements and lookup performances) we opted for a second solution, that is:

b) to further overload the significance of the special value '-' with the 'any' interpretation.

A very simple script checked the word-forms of each lemma for attributes instantiated with all the possible values (as in the example above). Then, after indexing these attributes for later recovering, the corresponding positions in the MSD were replaced by '-'. Based on similar motivations, we decided to use two special cases (direct and oblique) to deal with the nominative-accusative and genitive-dative syncretism, and to eliminate neuter gender from the lexicon encoding. With duplicates eliminated, the word-form lexicon size decreased more than 4 times (from 1895837 to 440363).

munte	munte	Ncmsnnn	munte	munte	Ncmsnny
munte	munte	Ncmsgnn	munte	munte	Ncmsgny
munte	munte	Ncmsdnn	munte	munte	Ncmsdny
munte	munte	Ncmsann	munte	munte	Ncmsany
munte	munte	Ncmsvnn	munte	munte	Ncmsvny