Romanian Language Technology

Dan Tufis & al * Corpora and Corpus-Based Morpho-Lexical Processing

By considering lemmas distributions over the parts of speech, one could notice some interesting facts: for instance, in spite of a quite productive inflectional potential for nouns, verbs and adjectives the average number of inflected word-forms from a given lemma that are used in the two texts is really small: less than 2 Noun or Adjective word-forms per Noun or Adjective lemma, less than 4 Verb word-forms per Verb lemma. Table 2.3 outlines also the lexical diversity in the two books, with "1984" having a lexical broader coverage in spite of its smaller number of word occurrences.

Table 2.3. The distribution of the lemmas over the parts of speech
POS N V A P D M R T S C Q I Y X

'1984' lemmas 3441 1370 1514 63 44 66 659 10 109 55 7 34 16 42

'Republic' lemmas 2324 996 940 62 47 65 422 6 93 46 5 11 4 1

Common lemmas 1015 656 451 43 36 32 277 5 70 37 5 5 0 1

Table 2.3. The distribution of the lemmas over the parts of speech
POS	N	V	A	P	D	M	R	T	S	C	Q	I	Y	X
'1984' lemmas	3441	1370	1514	63	44	66	659	10	109	55	7	34	16	42
'Republic' lemmas	2324	996	940	62	47	65	422	6	93	46	5	11	4	1
Common lemmas	1015	656	451	43	36	32	277	5	70	37	5	5	0	1

The figures in Table 2.4 provide arguments for the adequacy of the two texts with respect to the morpho-syntactic descriptions of the lexical stock. By comparing data in Table 1.4 (lexicon) with data in Table 2.4, one could see that out of the 57 MSDs defined for Nouns, the two texts provide contexts for 47 MSDs (39+35-27=47). For the verbal MSDs, out of the 135 defined in the lexicon, the two texts cover 99. With functional words, the coverage is practically 100%.

Table 2.4. MSD distribution over the parts of speech
POS N V A P D M R T S C Q I Y X

'1984' MSDs 39 92 27 102 86 34 11 26 7 8 6 1 4 1

'Republic' MSD 35 74 20 105 83 32 7 21 7 6 6 1 2 1

Common MSD 27 67 17 85 73 25 7 20 7 6 6 1 2 1

Table 2.4. MSD distribution over the parts of speech
POS	N	V	A	P	D	M	R	T	S	C	Q	I	Y	X
'1984' MSDs	39	92	27	102	86	34	11	26	7	8	6	1	4	1
'Republic' MSD	35	74	20	105	83	32	7	21	7	6	6	1	2	1
Common MSD	27	67	17	85	73	25	7	20	7	6	6	1	2	1

Tables 2.5 and 2.6 show the ambiguity level for the word-forms, computed by first considering distinct MSDs that are attached by lexical lookup to the word-forms (Table 2.5) and second, by considering distinct lemmas a word-form may be attributed to. For instance, the word-form "urâþi" which appears in both texts is 9-way MSD ambiguous. However, it is only 4-way ambiguous when lemma ambiguity is considered (there are four lemmas to which the word-form "urâþi" may be attributed). Similarly, the word-form "VII" appearing in "Republic" (as such, that is in upper case letters) is 9-way MSD ambiguous but only 4-way lemma ambiguous (since it might not be obvious, here there are the MSDs and lemmas: Afpfp-n, Afpfson, Afpmp-n, Mc-p-r, Mo-s-r, Ncfp-n, Ncfson, Vmip2s, Vmsp2s and "ºapte/seven", "vie/vineyard", "viu/alive", "veni/come").

Table 2.5. Word-forms grouped according to the MSD ambiguity level
Ambiguity level 1 2 3 4 5 6 7 8 9

'1984' wordforms 9322 2703 1425 301 189 56 44 0 1

'Republic' wordforms 6191 2357 1312 243 174 49 31 0 2

Common wordforms
2237 895 641 133 81 24 15 0 1

Table 2.5. Word-forms grouped according to the MSD ambiguity level
Ambiguity level	1	2	3	4	5	6	7	8	9
'1984' wordforms	9322	2703	1425	301	189	56	44	0	1
'Republic' wordforms	6191	2357	1312	243	174	49	31	0	2
Common wordforms	2237	895	641	133	81	24	15	0	1

Table 2.6. Word-forms grouped according to the lemma ambiguity level
Ambiguity level 1 2 3 4

'1984' wordforms 12432 1524 77 6

'Republic' wordforms 9121 1153 75 7

Common wordforms 3384 580 57 4

Table 2.6. Word-forms grouped according to the lemma ambiguity level
Ambiguity level	1	2	3	4
'1984' wordforms	12432	1524	77	6
'Republic' wordforms	9121	1153	75	7
Common wordforms	3384	580	57	4

If we distinguish only between ambiguous word-forms (ambiguity level greater than 1) and unambiguous word-forms⁴ (ambiguity level equal to 1), the two texts considered here reveal slightly different ambiguity figures: 33.5% ambiguous words in "1984" and 40.2% ambiguous words in "Republic", the explanation resides in the use of a higher number of definite nouns and adjectives in "1984" as compared to "Republic". The definite forms of the nouns, adjectives and to some extent numerals are in the vast majority of cases unambiguous. Judging the ambiguity in terms of lemma ambiguity (as shown in Table 2.6) the two texts exhibit practically the same ambiguity level 11.5% in "1984" and 12% in "Republic".

The next two tables provide further evidence on inter and intra categorial ambiguity in Romanian. The intra-categorial ambiguity - "horizontal" homography, is much more significant than the inter-categorial ambiguity - "vertical" homography. This is in sharp contrast with English and might be considered a characteristic of highly inflectional languages.

Contrast for instance the verb horizontal ambiguity in Table 2.7 (561, 555, 459) and the verb vertical ambiguity in Table 2.8 (35, 34, 33). A natural expectation is that in highly inflectional languages, the horizontal ambiguity is much more difficult to solve than the vertical ambiguity.

Table 2.7. MSD distribution over the MSD-ambiguity classes
POS-msd N-msd V-msd A-msd P-msd D-msd M-msd R-msd T-msd S-msd C-msd Q-msd I-msd Y-msd X-msd

'1984' MSD-amb. classes 317 561 246 140 87 41 89 15 21 21 9 18 14 10

'Republic' MSD-amb. classes 289 555 211 142 84 47 87 14 20 19 10 14 14 7

Common MSD-amb. classes 229 459 169 133 79 29 77 14 19 19 9 14 11 7

Table 2.7. MSD distribution over the MSD-ambiguity classes
POS-msd	N-msd	V-msd	A-msd	P-msd	D-msd	M-msd	R-msd	T-msd	S-msd	C-msd	Q-msd	I-msd	Y-msd	X-msd
'1984' MSD-amb. classes	317	561	246	140	87	41	89	15	21	21	9	18	14	10
'Republic' MSD-amb. classes	289	555	211	142	84	47	87	14	20	19	10	14	14	7
Common MSD-amb. classes	229	459	169	133	79	29	77	14	19	19	9	14	11	7

Table 2.8. POS distribution over the POS-ambiguity-classes
POS N V A P D M R T S C Q I Y X

'1984' POS-amb. classes 28 35 17 27 11 12 29 12 17 12 8 12 11 9

'Republic' POS-amb. classes 25 34 17 26 10 14 27 11 16 10 8 11 10 6

Common POS-amb. classes 24 33 16 26 10 12 27 11 16 10 8 11 9 6

Table 2.8. POS distribution over the POS-ambiguity-classes
POS	N	V	A	P	D	M	R	T	S	C	Q	I	Y	X
'1984' POS-amb. classes	28	35	17	27	11	12	29	12	17	12	8	12	11	9
'Republic' POS-amb. classes	25	34	17	26	10	14	27	11	16	10	8	11	10	6
Common POS-amb. classes	24	33	16	26	10	12	27	11	16	10	8	11	9	6

⁴ Remember the discussion in section 2 on ambiguity and the conventions we used in encoding the "any" values, the case syncretism, etc.