Dan Tufis & al * Corpora and Corpus-Based Morpho-Lexical Processing


When considering lemma homography, the MSD-ambiguity figures are shown in Table 1.6.

Table 1.6. Lemmas grouped according to the MSD ambiguity level
Ambiguity level1 23 4
Number of lemmas312181927 1077

The following two tables in this section show the distribution of the MSDs over the MSD-ambiguity classes and of POSes over POS-ambiguity classes, respectively. Table 1.7 reads as follows: a given POS-msd (an MSD belonging to a given part of speech) appears in k MSD-ambiguity classes. For instance, verbal MSDs (V-msd) appear in 490 out of the total 981 MSD-ambiguity classes, while nominal MSDs (N-msd) appear in 445 MSD-ambiguity classes.

Table 1.7. MSD distribution over the MSD-ambiguity classes
POS-msdN-msd V-msdA-msd P-msdD-msd M-msdR-msd T-msdS-msd C-msdQ-msd I-msdY-msd X-msd
No. of MSD-amb. classes445490 334131937310215 2218719349

If one considers only the part of speech and the POS-ambiguity classes, the corresponding distribution is shown in Table 1.8. Out of the total number of 90 POS-ambiguity classes, 34 contain the verb (V), 30 contain the noun (N) and so on.

Table 1.8. POS distribution over the POS-ambiguity-classes
POSN VA PD MR TS CQ IY X
No. of POS-amb. classes3034 18281214281016 11611169

Comparing the figures in Table 1.7 and Table 1.8 one may draw some interesting conclusions. For instance, considering the number of word-forms with more than 1 MSD (63411, i.e. 18,22% of the total number of word-forms) any such ambiguous word-form will have in almost 50% of the cases, one or more verbal readings; if considering only part of speech, in almost than 38% of the cases an ambiguous word-form would have a verb interpretation. Table 1.9 summarises this comparison for all parts of speech.

Table 1.9. POS percentual distribution over the POS and MSD-ambiguity-classes
%N VA PD MR TS CQ IY X
MSD45503413 97102220.7 230.9
POS33382031 1316311118127 121810

While these figures are hardly useful for disambiguation of a running text, they show that intra-categorial ambiguity is harder than discriminating among categories. Therefore, in designing the corpus tags (the set of codes an automatic tagger is supposed to work with) we concentrated on those attribute-values which would maximally discriminate among the MSDs belonging to the same part of speech. This way, out of the 674 possible morpho-syntactic descriptions of the word-forms in the lexicon, we defined a tagset containing 73 corpus tags. Since there is no generally accepted methodology on designing tagsets, except for the empirical trial and error approach, this tagset is considered just a working hypothesis towards a corpus-supported proposal for a Romanian tagset. The automating tagging literature reports excellent results in statistical disambiguation (accuracy above 95-96%) but none of such successful works considered highly inflectional languages as Romanian (or any other language in the MULTEXT-EAST project). Moreover, the number of tags we selected for the beginning is approximately four times higher than the number of tags used for English, the language with the most successful results in automatic tagging. That is why, we expect the final proposal for a tagset applying to Romanian to be the result of intensive experiments on large volumes of data. For these experiments, large training data is needed, prepared by semi-automatic annotation and manual validation. The next Section of the paper addresses this very issue.


38

Previous Next