Dan Tufis & al * Corpora and Corpus-Based Morpho-Lexical
Processing
That is, the positions in a string of characters are
numbered 0, 1, 2, etc., and are used in the following way:
- the character at position 0 encodes part-of-speech;
- each character at position
1, 2,..., n, encodes the value of one attribute (person, gender,
number, etc.), using the one-character code;
- if an attribute does not apply,
the corresponding position in the string contains the special
marker '-' (hyphen).
By convention, trailing hyphens
are not included in the lexical MSDs. Such specifications provide
a simple and relatively compact encoding, and are in intention
similar to feature-structure encoding used in unification-based
grammar formalisms.
When the word-form is the very
lemma, then the equal sign is written in the lemma-field of the
entry ('=').
The attributes and mainly the
values of the attributes were chosen considering only word-level
encoding. This is why, some values described in grammar text-books,
but assuming compounding (such as compound tenses), were not considered
in the MULTEXT-EAST encoding.
Concerning the "applicability"
of a certain feature/value one has to distinguish at least 2 cases:
- a specific attribute
is not at all applicable to a given language (such as it is the
case in Romanian for Referent_type or Owner_gender attributes
of the pronouns);
- a specific combination of feature-values applying
to a word-form belonging to a given category makes some attributes
of the category irrelevant in the case considered (for instance
tense for some non-predicative verbs, or gender for personal pronouns,
first and second persons).
These two cases were encoded by
using a special attribute-value, namely '-'.
A special case was raised by token-indeterminable
values. This is the case when a given word form cannot be assigned,
without considering the context, a unique value for a given attribute,
but a subset or even all of the permissible values of that attribute.
A typical example of such an attribute
is the grammatical case. For the indefinite forms of nouns and
adjectives nothing precise can be said about case unless the preceding
word(s) are considered. In such a situation one could only say
that all possible values of a considered attribute (in this example,
the case) are applicable. Since the MULTEXT-EAST encoding schema
does not use a special value "any" (which in some encoding
proposals is denoted by a dot '.') two possible solutions could
be envisaged:
a) to explicitly encode in the
word-form lexicon all the combinations resulting from expanding
the would be 'any' values. This solution is 'clean' but
introduces a large degree of redundancy and ambiguity. Consider
the 'dotted' encoding of the word 'munte':
Expanding the two dots (see tables on the next pages)
this entry should be replaced by the following 10 fully expanded
entries:
munte | munte | Ncmsnnn |
| munte | munte | Ncmsnny
|
munte | munte | Ncmsgnn |
| munte | munte | Ncmsgny
|
munte | munte | Ncmsdnn |
| munte | munte | Ncmsdny
|
munte | munte | Ncmsann |
| munte | munte | Ncmsany
|
munte | munte | Ncmsvnn |
| munte | munte | Ncmsvny
|
In the initial phase of the project, we took this
approach, but after a while, due to computational reasons (both
in terms of space requirements and lookup performances) we opted
for a second solution, that is:
b) to further overload the significance
of the special value '-' with the 'any' interpretation.
A very simple script checked the
word-forms of each lemma for attributes instantiated with all
the possible values (as in the example above). Then, after indexing
these attributes for later recovering, the corresponding positions
in the MSD were replaced by '-'. Based on similar motivations,
we decided to use two special cases (direct and oblique)
to deal with the nominative-accusative and genitive-dative syncretism,
and to eliminate neuter gender from the lexicon encoding. With
duplicates eliminated, the word-form lexicon size decreased more
than 4 times (from 1895837 to 440363).
31