Dan Tufis & al * Corpora and Corpus-Based Morpho-Lexical Processing




By considering lemmas distributions over the parts of speech, one could notice some interesting facts: for instance, in spite of a quite productive inflectional potential for nouns, verbs and adjectives the average number of inflected word-forms from a given lemma that are used in the two texts is really small: less than 2 Noun or Adjective word-forms per Noun or Adjective lemma, less than 4 Verb word-forms per Verb lemma. Table 2.3 outlines also the lexical diversity in the two books, with "1984" having a lexical broader coverage in spite of its smaller number of word occurrences.

Table 2.3. The distribution of the lemmas over the parts of speech
POS N V A P D M R T S C Q I Y X
'1984' lemmas 3441 1370 1514 63 44 66 659 10 109 55 7 34 16 42
'Republic' lemmas 2324 996 940 62 47 65 422 6 93 46 5 11 4 1
Common lemmas 1015 656 451 43 36 32 277 5 70 37 5 5 0 1

The figures in Table 2.4 provide arguments for the adequacy of the two texts with respect to the morpho-syntactic descriptions of the lexical stock. By comparing data in Table 1.4 (lexicon) with data in Table 2.4, one could see that out of the 57 MSDs defined for Nouns, the two texts provide contexts for 47 MSDs (39+35-27=47). For the verbal MSDs, out of the 135 defined in the lexicon, the two texts cover 99. With functional words, the coverage is practically 100%.

Table 2.4. MSD distribution over the parts of speech
POS N V A P D M R T S C Q I Y X
'1984' MSDs 39 92 27 102 86 34 11 26 7 8 6 1 4 1
'Republic' MSD 35 74 20 105 83 32 7 21 7 6 6 1 2 1
Common MSD 27 67 17 85 73 25 7 20 7 6 6 1 2 1

Tables 2.5 and 2.6 show the ambiguity level for the word-forms, computed by first considering distinct MSDs that are attached by lexical lookup to the word-forms (Table 2.5) and second, by considering distinct lemmas a word-form may be attributed to. For instance, the word-form "urâþi" which appears in both texts is 9-way MSD ambiguous. However, it is only 4-way ambiguous when lemma ambiguity is considered (there are four lemmas to which the word-form "urâþi" may be attributed). Similarly, the word-form "VII" appearing in "Republic" (as such, that is in upper case letters) is 9-way MSD ambiguous but only 4-way lemma ambiguous (since it might not be obvious, here there are the MSDs and lemmas: Afpfp-n, Afpfson, Afpmp-n, Mc-p-r, Mo-s-r, Ncfp-n, Ncfson, Vmip2s, Vmsp2s and "ºapte/seven", "vie/vineyard", "viu/alive", "veni/come").

Table 2.5. Word-forms grouped according to the MSD ambiguity level
Ambiguity level 1 2 3 4 5 6 7 8 9
'1984' wordforms 9322 2703 1425 301 189 56 44 0 1
'Republic' wordforms 6191 2357 1312 243 174 49 31 0 2
Common wordforms
2237 895 641 133 81 24 15 0 1

Table 2.6. Word-forms grouped according to the lemma ambiguity level
Ambiguity level 1 2 3 4
'1984' wordforms 12432 1524 77 6
'Republic' wordforms 9121 1153 75 7
Common wordforms 3384 580 57 4

If we distinguish only between ambiguous word-forms (ambiguity level greater than 1) and unambiguous word-forms4 (ambiguity level equal to 1), the two texts considered here reveal slightly different ambiguity figures: 33.5% ambiguous words in "1984" and 40.2% ambiguous words in "Republic", the explanation resides in the use of a higher number of definite nouns and adjectives in "1984" as compared to "Republic". The definite forms of the nouns, adjectives and to some extent numerals are in the vast majority of cases unambiguous. Judging the ambiguity in terms of lemma ambiguity (as shown in Table 2.6) the two texts exhibit practically the same ambiguity level 11.5% in "1984" and 12% in "Republic".

The next two tables provide further evidence on inter and intra categorial ambiguity in Romanian. The intra-categorial ambiguity - "horizontal" homography, is much more significant than the inter-categorial ambiguity - "vertical" homography. This is in sharp contrast with English and might be considered a characteristic of highly inflectional languages.

Contrast for instance the verb horizontal ambiguity in Table 2.7 (561, 555, 459) and the verb vertical ambiguity in Table 2.8 (35, 34, 33). A natural expectation is that in highly inflectional languages, the horizontal ambiguity is much more difficult to solve than the vertical ambiguity.

Table 2.7. MSD distribution over the MSD-ambiguity classes
POS-msd N-msd V-msd A-msd P-msd D-msd M-msd R-msd T-msd S-msd C-msd Q-msd I-msd Y-msd X-msd
'1984' MSD-amb. classes 317 561 246 140 87 41 89 15 21 21 9 18 14 10
'Republic' MSD-amb. classes 289 555 211 142 84 47 87 14 20 19 10 14 14 7
Common MSD-amb. classes 229 459 169 133 79 29 77 14 19 19 9 14 11 7

Table 2.8. POS distribution over the POS-ambiguity-classes
POS N V A P D M R T S C Q I Y X
'1984' POS-amb. classes 28 35 17 27 11 12 29 12 17 12 8 12 11 9
'Republic' POS-amb. classes 25 34 17 26 10 14 27 11 16 10 8 11 10 6
Common POS-amb. classes 24 33 16 26 10 12 27 11 16 10 8 11 9 6



4 Remember the discussion in section 2 on ambiguity and the conventions we used in encoding the "any" values, the case syncretism, etc.

42

Previous Next