Dan Tufis & al * Corpora and Corpus-Based Morpho-Lexical Processing


Put in other words, figures in Table 1.2 considered the homography1 (more precisely word-form homography), while those in Table 1 did not. The Lemma column gives the number of distinct lemmas in the lexicon, that is eliminating any duplications that might appear due to the lemma homography. The lemmas distribution over the POS is given in Table 1.3. Again, by summing up the figures in this table, one gets a higher figure than in Table 1. This is also due to homography at the lemma level (as is duce, a, dar, ºi, etc.). However, since lemma homography is much less frequent than word-form homography, the difference is in this case much smaller (2098). As one can observe from Table 1.3, the highest number of lemmas are nouns (18086) and adjectives (11052).

Table 1.3. The distribution of the lemmas over the parts of speech
POSN VA PD MR TS CQ IY X
Number of lemmas180864247 11052803611512577 1456771961151

The "=" field in Table 1 provides the number of entries which are themselves lemmas (i.e., have "=" in the lemma field of their entry), without eliminating the lemma homographs. Therefore, the difference between the Lemma and the "=" fields gives an estimation of the number of homographic lemmas2.

The "MSDs" field gives the total number of distinct MSDs used in the encoding of the lexicon stock. The distribution of the MSDs over parts of speech is given in Table 1.4. As one can see, the highest numbers of MSDs are defined for Pronouns (138), Verbs (135) and Determiners (114).

Table 1.4. MSD distribution over the parts of speech
POSN VA PD MR TS CQ IY X
Number of MSDs5713560 13811485113378 71171

The last two columns in Table 1 (AMB_POS and AMB_MSD) provide information about the number of ambiguity classification clusters. An ambiguity classification cluster gives the multiple ways a homographic word-form can be classified by. If the classification is based on the part of speech (POS), the ambiguity classification cluster is called POS-ambiguity cluster (AMB_POS) and it is a list of parts of speech that can be associated with a given word-form. The AMB_POS for the word-form vin above is: (N, V). If the classification is based on MSD, the ambiguity classification cluster is called MSD-ambiguity cluster (AMB_MSD) and it is a list containing all MSDs associated with a given word-form. Considering again the word-form vin, its AMB_MSD is: (Ncms-n, Vmip1s, Vmip3p, Vmsp1s). It is worth mentioning that AMB_POS does not change if the '-' values in an MSD, which are interpretable as an values, are expanded. Obviously this is not the case for AMB_MSD. The number of ambiguity classes (defined based either on POS or MSD) is a key figure in estimating the space resources needed in constructing a statistical language model (such as HMM) useful for morpho-syntactic disambiguation of natural language.

While the AMB_MSD column in Table 1 shows the number of distinct MSD-ambiguity classes, Table 1.5 shows the number of words which are unambiguous3, two-way ambiguous, etc. The ambiguity is calculated considering word-form homography. The previously exemplified word-form vin has the ambiguity level 4.

Table 1.5. Word-forms grouped according to the MSD ambiguity level
Ambiguity level1 23 45 67 89
Number of words28454946635 10183257028647254293 2

The hapaxes labeled with 8 MSDs (2) or 9 MSDs (2) are due to quite rare superpositions of paradigmatic classes of different word-forms. For instance, the word urâþi is the 9-way wordform homographic word in our lexicon. It can be either a N, an A or a V. This wordform could have as lemma form one of the following: urât - ugly (N and A), urâþi/to make things ugly (V), urî/to hate (V). Similarly, the word vii can be either a N (vie/vineyard), an A (viu/alive), a verb (veni/to come) or a numeral written in the roman style (ºapte/seven).


1 We consider here a restricted definition of homography. We distinguish between word-form homography, meaning identical word-forms tagged with different MSD, POS homography, meaning identical word-forms but tagged with different POSes and lemma homography, meaning identical lemma-forms with different POSes (this is a restricted case of the second one). For instance the word vin is a four-way word-form homeographic (Ncms-n, Vmip1s, Vmip3p, Vmsp1s), two-way POS homeographic (N, V) but it is not lemma homographic (the lemmas veni (V) and vin (N) are different). The word dar is two-way word-form homographic (Ncms-n, Ccssp), two-way POS homographic (N, C) and two-way lemma homographic (the same lemma for both N and C interpretations).
2 It would be the number of homographic lemmas just in case each such lemma would belong exactly to two POSes.
3 Remember the previous discussion on the encoding of 'any' values. For instance the entry munte munte Ncms-n is not MSD-ambiguous unless the case value is considered. With token-indeterminable values fully expanded, the number of unambiguous words becomes much smaller. To be precise, the number of unambiguous tokens in "1984" would have been 22% instead of 57% and in "Republic" the number of unambiguous tokens would have been 23% instead of 53%.

37

Previous Next