Romanian Language Technology

Dan Tufis & al * Corpora and Corpus-Based Morpho-Lexical Processing

The table below provides a comparative overview of the number of attributes that were considered in each of the MULTEXT-EAST languages for each category (a '-' in a cell signifies that the corresponding POS is not relevant for the considered language).

POS Rom Bul Cz Slo Est Hun

Noun 6 5 5 5 3 7

Verb 7 8 10 8 8 5

Adjective 7 3 7 5 3 8

Pronoun 8 8 12 10 4 7

Adverb 3 1 2 2 0 4

Adposition 4 1 3 3 1 1

Conjunction 5 2 3 2 1 3

Numeral 7 5 7 5 4 7

Interjection 0 1 0 0 0 1

Residual 0 0 0 0 0 0

Abbreviation 5 0 0 0 3 0

Particle 2 2 0 0 - -

Determiner 8 - - - - -

Article 5 - - - - 1

3. A statistical account of the lexicon

Once the harmonised set of morpho-syntactic specifications for the six MULTEXT-EAST languages was developed, lexicons incorporating these specifications were created for each language. The Romanian lexicon was created based on a 35.000-lemma lexicon by means of our EGLU natural language processing platform [15]. Since several words in the corpus were not in the EGLU lexicon, most of them were manually lemmatised, introduced in the unification-based lexicon and later on expanded to the full paradigms of every new lemma. The Romanian word-form lexicon is actually made of two parts: the main one contains only words attested by the Explanatory Dictionary of Romanian (DEX); all the other words, appearing in the corpus were entered an auxiliary lexicon. The auxiliary lexicon contains, among other things, proper names, technical terms and the weird (made-up) words from Orwell's "1984" (newspeak dialect).

The table below provides information on the data content of the main dictionary that is used for the corpus analysis. When interpreting the figures, one has to consider the treatment of syncretic forms as previously discussed. The any interpretation of the '-' attribute value was not considered in counting these figures since the expansion of the token-indeterminable values is external to the lexicon.

Table 1. Lexicon overview
Entries Wordforms Lemmas = MSDs AMB_POS AMB_MSD

440363 347960 33259 35421 674 90 981

Table 1. Lexicon overview
Entries	Wordforms	Lemmas	=	MSDs	AMB_POS	AMB_MSD
440363	347960	33259	35421	674	90	981

The first column (Entries) provides the number of dictionary entries, that is triplets:

<word-form lemma MSD>

The distribution of the entries over the parts of speech is shown in Table 1.1. As one would expect, given the large paradigms of verbs, the largest number of entries are verbal (192843). The percentage of noun and adjective word-forms is also very high.

Table 1.1. The distribution of the entries over the parts of speech
POS N V A P D M R T S C Q I Y X

Number
of entries 124135 192843 119420 347 231 693 1357 46 176 81 10 196 777 51

Table 1.1. The distribution of the entries over the parts of speech
POS	N	V	A	P	D	M	R	T	S	C	Q	I	Y	X
Number of entries	124135	192843	119420	347	231	693	1357	46	176	81	10	196	777	51

The second column (Wordforms) in Table 1 gives the number of distinct word-forms appearing in the lexicon, irrespective of their lemma and MSD. For instance, the following entries:

vin = Ncms-n
vin veni Vmip1s
vin veni Vmip3p
vin veni Vmsp1s

will contribute the Wordform column with only one item (vin). The word-forms distribution over the parts of speech is shown in Table 1.2.

Table 1.2. The distribution of the word-forms over the parts of speech
POS N V A P D M R T S C Q I Y X

Number of wordforms 113285 142438 111274 320 224 615 1336 45 176 77 10 196 665 51

Table 1.2. The distribution of the word-forms over the parts of speech
POS	N	V	A	P	D	M	R	T	S	C	Q	I	Y	X
Number of wordforms	113285	142438	111274	320	224	615	1336	45	176	77	10	196	665	51

Summing up the numbers in Table 1.2 gives 370712 which is 22752 more than the figure shown in theWordform column in Table 1. Since several identical word-forms (which were counted only once in the Wordform column of Table 1) could differ by the MSD, by lemma or by both, counting by taking into account these differences makes the noticed difference.

POS	Rom	Bul	Cz	Slo	Est	Hun
Noun	6	5	5	5	3	7
Verb	7	8	10	8	8	5
Adjective	7	3	7	5	3	8
Pronoun	8	8	12	10	4	7
Adverb	3	1	2	2	0	4
Adposition	4	1	3	3	1	1
Conjunction	5	2	3	2	1	3
Numeral	7	5	7	5	4	7
Interjection	0	1	0	0	0	1
Residual	0	0	0	0	0	0
Abbreviation	5	0	0	0	3	0
Particle	2	2	0	0	-	-
Determiner	8	-	-	-	-	-
Article	5	-	-	-	-	1