Romanian Language Technology

Dan Tufis & al * Corpora and Corpus-Based Morpho-Lexical Processing

Table 2 shows the global figures for the sub-corpora processed as described above. Since in the first phase of this research we were concerned with parallel corpora, the selected texts were Orwell's "1984" and Plato's "Republic". These two integral texts belong to two different registers (fiction and philosophy) and, as will be shown, have different distributional properties. The Occurrences column contains the number of word-forms occurrences in each of the two books without eliminating the duplicates. The Wordforms column shows the number of distinct word-forms in the two books. The MSDs column shows the number of distinct MSD used in each of the books. The AMB_MSD and AMB_POS give, respectively, the numbers of MSD-ambiguity clusters and POS-ambiguity classification clusters found in the corpora. The Common row displays for each category item the commonalties among the two texts. As one can see from Table 2, although "Republic" contains more occurrences than "1984", the number of different word-forms and lemmas is significantly smaller. This is not surprising given the two registers of the considered texts. In a philosophical text, most words are used in a rather technical way with as little as possible stylistic variations. Although the two texts contain altogether more than 200.000 words, the number of different words is just 20371 (less than 10%) with only 4.016 word-forms occurring in both texts. Therefore, from the lexical coverage point of view the two texts are far from being sufficient in drawing significant conclusions on Romanian language word-usage frequencies. However, for the distributional analysis and morpho-syntactical disambiguation purposes the selected texts offer enough evidence to extract reliable data and draw realistic conclusions. Out of the 674 MSDs defined in the lexicon, these texts contain 500 MSDs (more than 74%) which is a very good figure, considering that the number of distinct word-forms occurring in the selected texts represents approximately 5% of the lexical stock. This is to say that most of the words raising problems (from the ambiguity point of view) appeared at least in one of the two texts. This statement is even better supported if one considers the last column in Table 2: "1984" contains all but 4 POS-ambiguity clusters while "Republic" misses only 10. The MSD-ambiguity clusters are also very well represented in the selected texts (more than 63% of all MSD-ambiguity clusters defined in the lexicon).

Table 2. Corpora overview
Sub-Corpus Occurrences Wordforms Lemmas MSDs AMB_MSD AMB_POS

'1984' 101460 14037 7019 444 529 86

'Republic' 114720 10350 4697 400 498 80

Common 56804 4016 2527 344 402 78

Table 2. Corpora overview
Sub-Corpus	Occurrences	Wordforms	Lemmas	MSDs	AMB_MSD	AMB_POS
'1984'	101460	14037	7019	444	529	86
'Republic'	114720	10350	4697	400	498	80
Common	56804	4016	2527	344	402	78

Table 2.1. The distribution of the word-form occurrences over the parts of speech in "1984" and "Republic"
POS N V A P D M R T S C Q I Y X

'1984' occrs 22570 19031 6202 10967 2376 1228 10633 4236 12621 6389 4967 99 72 69

Republic occrs 19936 24890 6302 13577 3069 633 12965 4079 12141 10506 6383 199 37 3

Common occrs 5345 7768 2045 8294 1506 416 6556 2730 10542 5797 4482 36 0 1

Table 2.1. The distribution of the word-form occurrences over the parts of speech in "1984" and "Republic"
POS	N	V	A	P	D	M	R	T	S	C	Q	I	Y	X
'1984' occrs	22570	19031	6202	10967	2376	1228	10633	4236	12621	6389	4967	99	72	69
Republic occrs	19936	24890	6302	13577	3069	633	12965	4079	12141	10506	6383	199	37	3
Common occrs	5345	7768	2045	8294	1506	416	6556	2730	10542	5797	4482	36	0	1

The distribution of the word-form occurrences over the parts of speech shows some differences that are motivated by the different registers. The difference due to the linguistic registers the texts belong to, shows up, as one would expect in the counts for common content words (N, V, A, R). For the functional words (S, C, Q, T, P) the number of common occurrences is very high. This consideration is better supported when comparing (without counting duplicates) the word counts (Table 2.2). The abbreviations and residuals are also very different in the two texts. This can be easily understood, if one takes into account that abbreviations in most cases stand for content words. The residual class is poorly represented in Plato's "Republic", but it has several instances in Orwell's "1984" (mainly when the newspeak jargon is in use).

Table 2.2. The distribution of the word-forms over the parts of speech
POS N V A P D M R T S C Q I Y X

'1984' wordforms 6177 4518 2473 189 137 114 683 29 127 62 9 34 16 42

'Republic' wordforms 4537 3603 1682 179 128 111 424 22 101 49 7 12 4 1

Common wordforms 1467 1344 607 136 105 48 277 21 74 40 7 5 0 1

Table 2.2. The distribution of the word-forms over the parts of speech
POS	N	V	A	P	D	M	R	T	S	C	Q	I	Y	X
'1984' wordforms	6177	4518	2473	189	137	114	683	29	127	62	9	34	16	42
'Republic' wordforms	4537	3603	1682	179	128	111	424	22	101	49	7	12	4	1
Common wordforms	1467	1344	607	136	105	48	277	21	74	40	7	5	0	1

Comparing the figures in the previous tables, the discursive nature of Plato's book is quite obvious when contrasted with "1984": the smaller number of Nouns is compensated by an increased number of pronouns and the large number of verbal occurrences ensures an alert argumentative rhythm.