Dan Tufis & al * Corpora and Corpus-Based Morpho-Lexical Processing


The table below provides a comparative overview of the number of attributes that were considered in each of the MULTEXT-EAST languages for each category (a '-' in a cell signifies that the corresponding POS is not relevant for the considered language).

POSRom BulCz SloEst Hun
Noun655537
Verb7810885
Adjective737538
Pronoun88121047
Adverb312204
Adposition413311
Conjunction523213
Numeral757547
Interjection01000 1
Residual000000
Abbreviation50003 0
Particle2200--
Determiner8-----
Article5----1

3. A statistical account of the lexicon

Once the harmonised set of morpho-syntactic specifications for the six MULTEXT-EAST languages was developed, lexicons incorporating these specifications were created for each language. The Romanian lexicon was created based on a 35.000-lemma lexicon by means of our EGLU natural language processing platform [15]. Since several words in the corpus were not in the EGLU lexicon, most of them were manually lemmatised, introduced in the unification-based lexicon and later on expanded to the full paradigms of every new lemma. The Romanian word-form lexicon is actually made of two parts: the main one contains only words attested by the Explanatory Dictionary of Romanian (DEX); all the other words, appearing in the corpus were entered an auxiliary lexicon. The auxiliary lexicon contains, among other things, proper names, technical terms and the weird (made-up) words from Orwell's "1984" (newspeak dialect).

The table below provides information on the data content of the main dictionary that is used for the corpus analysis. When interpreting the figures, one has to consider the treatment of syncretic forms as previously discussed. The any interpretation of the '-' attribute value was not considered in counting these figures since the expansion of the token-indeterminable values is external to the lexicon.

Table 1. Lexicon overview
EntriesWordforms Lemmas= MSDsAMB_POS AMB_MSD
4403633479603325935421674 90981

The first column (Entries) provides the number of dictionary entries, that is triplets:

<word-form lemma MSD>

The distribution of the entries over the parts of speech is shown in Table 1.1. As one would expect, given the large paradigms of verbs, the largest number of entries are verbal (192843). The percentage of noun and adjective word-forms is also very high.

Table 1.1. The distribution of the entries over the parts of speech
POSN VA PD MR TS CQ IY X
Number
of entries
124135192843 119420347231693135746 176811019677751

The second column (Wordforms) in Table 1 gives the number of distinct word-forms appearing in the lexicon, irrespective of their lemma and MSD. For instance, the following entries:

vin=Ncms-n
vinveniVmip1s
vinveniVmip3p
vinveniVmsp1s

will contribute the Wordform column with only one item (vin). The word-forms distribution over the parts of speech is shown in Table 1.2.

Table 1.2. The distribution of the word-forms over the parts of speech
POSN VA PD MR TS CQ IY X
Number of wordforms113285142438 111274320224615133645 176771019666551

Summing up the numbers in Table 1.2 gives 370712 which is 22752 more than the figure shown in theWordform column in Table 1. Since several identical word-forms (which were counted only once in the Wordform column of Table 1) could differ by the MSD, by lemma or by both, counting by taking into account these differences makes the noticed difference.


36

Previous Next