Dan Tufis & al * Corpora and Corpus-Based Morpho-Lexical Processing
Put in other words, figures in Table 1.2 considered the homography1 (more precisely word-form homography), while those in Table 1 did not. The Lemma column gives the number of distinct lemmas in the lexicon, that is eliminating any duplications that might appear due to the lemma homography. The lemmas distribution over the POS is given in Table 1.3. Again, by summing up the figures in this table, one gets a higher figure than in Table 1. This is also due to homography at the lemma level (as is duce, a, dar, ºi, etc.). However, since lemma homography is much less frequent than word-form homography, the difference is in this case much smaller (2098). As one can observe from Table 1.3, the highest number of lemmas are nouns (18086) and adjectives (11052).
POS | N | V | A | P | D | M | R | T | S | C | Q | I | Y | X |
Number of lemmas | 18086 | 4247 | 11052 | 80 | 36 | 115 | 1257 | 7 | 145 | 67 | 7 | 196 | 11 | 51 |
The "=" field in Table 1 provides the number of entries which are themselves lemmas (i.e., have "=" in the lemma field of their entry), without eliminating the lemma homographs. Therefore, the difference between the Lemma and the "=" fields gives an estimation of the number of homographic lemmas2.
The "MSDs" field gives the total number of distinct MSDs used in the encoding of the lexicon stock. The distribution of the MSDs over parts of speech is given in Table 1.4. As one can see, the highest numbers of MSDs are defined for Pronouns (138), Verbs (135) and Determiners (114).
POS | N | V | A | P | D | M | R | T | S | C | Q | I | Y | X |
Number of MSDs | 57 | 135 | 60 | 138 | 114 | 85 | 11 | 33 | 7 | 8 | 7 | 1 | 17 | 1 |
While the AMB_MSD column in Table
1 shows the number of distinct MSD-ambiguity classes, Table 1.5
shows the number of words which are unambiguous3, two-way ambiguous,
etc. The ambiguity is calculated considering word-form homography.
The previously exemplified word-form vin has the ambiguity
level 4.
The hapaxes labeled with
8 MSDs (2) or 9 MSDs (2) are due to quite rare superpositions
of paradigmatic classes of different word-forms. For instance,
the word urâþi is the 9-way wordform homographic
word in our lexicon. It can be either a N, an A or a V. This wordform
could have as lemma form one of the following: urât
- ugly (N and A), urâþi/to make things ugly
(V), urî/to hate (V). Similarly, the word vii
can be either a N (vie/vineyard), an A (viu/alive),
a verb (veni/to come) or a numeral written in the roman style
(ºapte/seven).
37
Ambiguity level 1
2 3
4 5
6 7
8 9
Number of words 284549 46635
10183 2570 2864 725 429 3
2
1 We consider here a restricted definition of homography.
We distinguish between word-form homography, meaning identical
word-forms tagged with different MSD, POS homography, meaning identical
word-forms but tagged with different POSes and lemma homography,
meaning identical lemma-forms with different POSes (this is a restricted case
of the second one). For instance the word vin is a four-way
word-form homeographic (Ncms-n, Vmip1s, Vmip3p, Vmsp1s), two-way
POS homeographic (N, V) but it is not lemma homographic
(the lemmas veni (V) and vin (N) are different). The word
dar is two-way word-form homographic (Ncms-n, Ccssp), two-way
POS homographic (N, C) and two-way lemma homographic (the
same lemma for both N and C interpretations).
2 It would be the number of homographic lemmas just in case each
such lemma would belong exactly to two POSes.
3 Remember the previous discussion on the encoding of 'any'
values. For instance the entry munte munte Ncms-n
is not MSD-ambiguous unless the case value is considered. With
token-indeterminable values fully expanded, the number of unambiguous
words becomes much smaller. To be precise, the number of unambiguous tokens
in "1984" would have been 22% instead of 57% and in "Republic" the number of
unambiguous tokens would have been 23% instead of 53%.