Dan Tufis & al * Corpora and Corpus-Based Morpho-Lexical Processing




Table 2 shows the global figures for the sub-corpora processed as described above. Since in the first phase of this research we were concerned with parallel corpora, the selected texts were Orwell's "1984" and Plato's "Republic". These two integral texts belong to two different registers (fiction and philosophy) and, as will be shown, have different distributional properties. The Occurrences column contains the number of word-forms occurrences in each of the two books without eliminating the duplicates. The Wordforms column shows the number of distinct word-forms in the two books. The MSDs column shows the number of distinct MSD used in each of the books. The AMB_MSD and AMB_POS give, respectively, the numbers of MSD-ambiguity clusters and POS-ambiguity classification clusters found in the corpora. The Common row displays for each category item the commonalties among the two texts. As one can see from Table 2, although "Republic" contains more occurrences than "1984", the number of different word-forms and lemmas is significantly smaller. This is not surprising given the two registers of the considered texts. In a philosophical text, most words are used in a rather technical way with as little as possible stylistic variations. Although the two texts contain altogether more than 200.000 words, the number of different words is just 20371 (less than 10%) with only 4.016 word-forms occurring in both texts. Therefore, from the lexical coverage point of view the two texts are far from being sufficient in drawing significant conclusions on Romanian language word-usage frequencies. However, for the distributional analysis and morpho-syntactical disambiguation purposes the selected texts offer enough evidence to extract reliable data and draw realistic conclusions. Out of the 674 MSDs defined in the lexicon, these texts contain 500 MSDs (more than 74%) which is a very good figure, considering that the number of distinct word-forms occurring in the selected texts represents approximately 5% of the lexical stock. This is to say that most of the words raising problems (from the ambiguity point of view) appeared at least in one of the two texts. This statement is even better supported if one considers the last column in Table 2: "1984" contains all but 4 POS-ambiguity clusters while "Republic" misses only 10. The MSD-ambiguity clusters are also very well represented in the selected texts (more than 63% of all MSD-ambiguity clusters defined in the lexicon).

Table 2. Corpora overview
Sub-Corpus Occurrences Wordforms Lemmas MSDs AMB_MSD AMB_POS
'1984' 101460 14037 7019 444 529 86
'Republic' 114720 10350 4697 400 498 80
Common 56804 4016 2527 344 402 78

Table 2.1. The distribution of the word-form occurrences over the parts of speech in "1984" and "Republic"
POS N V A P D M R T S C Q I Y X
'1984' occrs 22570 19031 6202 10967 2376 1228 10633 4236 12621 6389 4967 99 72 69
Republic occrs 19936 24890 6302 13577 3069 633 12965 4079 12141 10506 6383 199 37 3
Common occrs 5345 7768 2045 8294 1506 416 6556 2730 10542 5797 4482 36 0 1

The distribution of the word-form occurrences over the parts of speech shows some differences that are motivated by the different registers. The difference due to the linguistic registers the texts belong to, shows up, as one would expect in the counts for common content words (N, V, A, R). For the functional words (S, C, Q, T, P) the number of common occurrences is very high. This consideration is better supported when comparing (without counting duplicates) the word counts (Table 2.2). The abbreviations and residuals are also very different in the two texts. This can be easily understood, if one takes into account that abbreviations in most cases stand for content words. The residual class is poorly represented in Plato's "Republic", but it has several instances in Orwell's "1984" (mainly when the newspeak jargon is in use).

Table 2.2. The distribution of the word-forms over the parts of speech
POS N V A P D M R T S C Q I Y X
'1984' wordforms 6177 4518 2473 189 137 114 683 29 127 62 9 34 16 42
'Republic' wordforms 4537 3603 1682 179 128 111 424 22 101 49 7 12 4 1
Common wordforms 1467 1344 607 136 105 48 277 21 74 40 7 5 0 1

Comparing the figures in the previous tables, the discursive nature of Plato's book is quite obvious when contrasted with "1984": the smaller number of Nouns is compensated by an increased number of pronouns and the large number of verbal occurrences ensures an alert argumentative rhythm.


41

Previous Next