Dan Tufis & al * Corpora and Corpus-Based Morpho-Lexical Processing
POS | N | V | A | P | D | M | R | T | S | C | Q | I | Y | X |
'1984' lemmas | 3441 | 1370 | 1514 | 63 | 44 | 66 | 659 | 10 | 109 | 55 | 7 | 34 | 16 | 42 |
'Republic' lemmas | 2324 | 996 | 940 | 62 | 47 | 65 | 422 | 6 | 93 | 46 | 5 | 11 | 4 | 1 |
Common lemmas | 1015 | 656 | 451 | 43 | 36 | 32 | 277 | 5 | 70 | 37 | 5 | 5 | 0 | 1 |
The figures in Table 2.4 provide arguments for the adequacy of the two texts with respect to the morpho-syntactic descriptions of the lexical stock. By comparing data in Table 1.4 (lexicon) with data in Table 2.4, one could see that out of the 57 MSDs defined for Nouns, the two texts provide contexts for 47 MSDs (39+35-27=47). For the verbal MSDs, out of the 135 defined in the lexicon, the two texts cover 99. With functional words, the coverage is practically 100%.
POS | N | V | A | P | D | M | R | T | S | C | Q | I | Y | X |
'1984' MSDs | 39 | 92 | 27 | 102 | 86 | 34 | 11 | 26 | 7 | 8 | 6 | 1 | 4 | 1 |
'Republic' MSD | 35 | 74 | 20 | 105 | 83 | 32 | 7 | 21 | 7 | 6 | 6 | 1 | 2 | 1 |
Common MSD | 27 | 67 | 17 | 85 | 73 | 25 | 7 | 20 | 7 | 6 | 6 | 1 | 2 | 1 |
Tables 2.5 and 2.6 show the ambiguity level for the word-forms, computed by first considering distinct MSDs that are attached by lexical lookup to the word-forms (Table 2.5) and second, by considering distinct lemmas a word-form may be attributed to. For instance, the word-form "urâþi" which appears in both texts is 9-way MSD ambiguous. However, it is only 4-way ambiguous when lemma ambiguity is considered (there are four lemmas to which the word-form "urâþi" may be attributed). Similarly, the word-form "VII" appearing in "Republic" (as such, that is in upper case letters) is 9-way MSD ambiguous but only 4-way lemma ambiguous (since it might not be obvious, here there are the MSDs and lemmas: Afpfp-n, Afpfson, Afpmp-n, Mc-p-r, Mo-s-r, Ncfp-n, Ncfson, Vmip2s, Vmsp2s and "ºapte/seven", "vie/vineyard", "viu/alive", "veni/come").
Ambiguity level | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
'1984' wordforms | 9322 | 2703 | 1425 | 301 | 189 | 56 | 44 | 0 | 1 |
'Republic' wordforms | 6191 | 2357 | 1312 | 243 | 174 | 49 | 31 | 0 | 2 |
Common wordforms | 2237 | 895 | 641 | 133 | 81 | 24 | 15 | 0 | 1 |
Ambiguity level | 1 | 2 | 3 | 4 |
'1984' wordforms | 12432 | 1524 | 77 | 6 |
'Republic' wordforms | 9121 | 1153 | 75 | 7 |
Common wordforms | 3384 | 580 | 57 | 4 |
If we distinguish only between ambiguous word-forms (ambiguity level greater than 1) and unambiguous word-forms4 (ambiguity level equal to 1), the two texts considered here reveal slightly different ambiguity figures: 33.5% ambiguous words in "1984" and 40.2% ambiguous words in "Republic", the explanation resides in the use of a higher number of definite nouns and adjectives in "1984" as compared to "Republic". The definite forms of the nouns, adjectives and to some extent numerals are in the vast majority of cases unambiguous. Judging the ambiguity in terms of lemma ambiguity (as shown in Table 2.6) the two texts exhibit practically the same ambiguity level 11.5% in "1984" and 12% in "Republic".
The next two tables provide further evidence on inter and intra categorial ambiguity in Romanian. The intra-categorial ambiguity - "horizontal" homography, is much more significant than the inter-categorial ambiguity - "vertical" homography. This is in sharp contrast with English and might be considered a characteristic of highly inflectional languages.
Contrast for instance the verb horizontal ambiguity in Table 2.7 (561, 555, 459) and the verb vertical ambiguity in Table 2.8 (35, 34, 33). A natural expectation is that in highly inflectional languages, the horizontal ambiguity is much more difficult to solve than the vertical ambiguity.
POS-msd | N-msd | V-msd | A-msd | P-msd | D-msd | M-msd | R-msd | T-msd | S-msd | C-msd | Q-msd | I-msd | Y-msd | X-msd |
'1984' MSD-amb. classes | 317 | 561 | 246 | 140 | 87 | 41 | 89 | 15 | 21 | 21 | 9 | 18 | 14 | 10 |
'Republic' MSD-amb. classes | 289 | 555 | 211 | 142 | 84 | 47 | 87 | 14 | 20 | 19 | 10 | 14 | 14 | 7 |
Common MSD-amb. classes | 229 | 459 | 169 | 133 | 79 | 29 | 77 | 14 | 19 | 19 | 9 | 14 | 11 | 7 |
POS | N | V | A | P | D | M | R | T | S | C | Q | I | Y | X |
'1984' POS-amb. classes | 28 | 35 | 17 | 27 | 11 | 12 | 29 | 12 | 17 | 12 | 8 | 12 | 11 | 9 |
'Republic' POS-amb. classes | 25 | 34 | 17 | 26 | 10 | 14 | 27 | 11 | 16 | 10 | 8 | 11 | 10 | 6 |
Common POS-amb. classes | 24 | 33 | 16 | 26 | 10 | 12 | 27 | 11 | 16 | 10 | 8 | 11 | 9 | 6 |
4 Remember the discussion in section 2 on ambiguity and the conventions we used in encoding the "any" values, the case syncretism, etc.
42