Romanian Language Technology

Dan Tufis & al * Corpora and Corpus-Based Morpho-Lexical Processing

When considering lemma homography, the MSD-ambiguity figures are shown in Table 1.6.

Table 1.6. Lemmas grouped according to the MSD ambiguity level
Ambiguity level 1 2 3 4

Number of lemmas 31218 1927 107 7

Table 1.6. Lemmas grouped according to the MSD ambiguity level
Ambiguity level	1	2	3	4
Number of lemmas	31218	1927	107	7

The following two tables in this section show the distribution of the MSDs over the MSD-ambiguity classes and of POSes over POS-ambiguity classes, respectively. Table 1.7 reads as follows: a given POS-msd (an MSD belonging to a given part of speech) appears in k MSD-ambiguity classes. For instance, verbal MSDs (V-msd) appear in 490 out of the total 981 MSD-ambiguity classes, while nominal MSDs (N-msd) appear in 445 MSD-ambiguity classes.

Table 1.7. MSD distribution over the MSD-ambiguity classes
POS-msd N-msd V-msd A-msd P-msd D-msd M-msd R-msd T-msd S-msd C-msd Q-msd I-msd Y-msd X-msd

No. of MSD-amb. classes 445 490 334 131 93 73 102 15 22 18 7 19 34 9

Table 1.7. MSD distribution over the MSD-ambiguity classes
POS-msd	N-msd	V-msd	A-msd	P-msd	D-msd	M-msd	R-msd	T-msd	S-msd	C-msd	Q-msd	I-msd	Y-msd	X-msd
No. of MSD-amb. classes	445	490	334	131	93	73	102	15	22	18	7	19	34	9

If one considers only the part of speech and the POS-ambiguity classes, the corresponding distribution is shown in Table 1.8. Out of the total number of 90 POS-ambiguity classes, 34 contain the verb (V), 30 contain the noun (N) and so on.

Table 1.8. POS distribution over the POS-ambiguity-classes
POS N V A P D M R T S C Q I Y X

No. of POS-amb. classes 30 34 18 28 12 14 28 10 16 11 6 11 16 9

Table 1.8. POS distribution over the POS-ambiguity-classes
POS	N	V	A	P	D	M	R	T	S	C	Q	I	Y	X
No. of POS-amb. classes	30	34	18	28	12	14	28	10	16	11	6	11	16	9

Comparing the figures in Table 1.7 and Table 1.8 one may draw some interesting conclusions. For instance, considering the number of word-forms with more than 1 MSD (63411, i.e. 18,22% of the total number of word-forms) any such ambiguous word-form will have in almost 50% of the cases, one or more verbal readings; if considering only part of speech, in almost than 38% of the cases an ambiguous word-form would have a verb interpretation. Table 1.9 summarises this comparison for all parts of speech.

Table 1.9. POS percentual distribution over the POS and MSD-ambiguity-classes
% N V A P D M R T S C Q I Y X

MSD 45 50 34 13 9 7 10 2 2 2 0.7 2 3 0.9

POS 33 38 20 31 13 16 31 11 18 12 7 12 18 10

Table 1.9. POS percentual distribution over the POS and MSD-ambiguity-classes
%	N	V	A	P	D	M	R	T	S	C	Q	I	Y	X
MSD	45	50	34	13	9	7	10	2	2	2	0.7	2	3	0.9
POS	33	38	20	31	13	16	31	11	18	12	7	12	18	10

While these figures are hardly useful for disambiguation of a running text, they show that intra-categorial ambiguity is harder than discriminating among categories. Therefore, in designing the corpus tags (the set of codes an automatic tagger is supposed to work with) we concentrated on those attribute-values which would maximally discriminate among the MSDs belonging to the same part of speech. This way, out of the 674 possible morpho-syntactic descriptions of the word-forms in the lexicon, we defined a tagset containing 73 corpus tags. Since there is no generally accepted methodology on designing tagsets, except for the empirical trial and error approach, this tagset is considered just a working hypothesis towards a corpus-supported proposal for a Romanian tagset. The automating tagging literature reports excellent results in statistical disambiguation (accuracy above 95-96%) but none of such successful works considered highly inflectional languages as Romanian (or any other language in the MULTEXT-EAST project). Moreover, the number of tags we selected for the beginning is approximately four times higher than the number of tags used for English, the language with the most successful results in automatic tagging. That is why, we expect the final proposal for a tagset applying to Romanian to be the result of intensive experiments on large volumes of data. For these experiments, large training data is needed, prepared by semi-automatic annotation and manual validation. The next Section of the paper addresses this very issue.