Dan Tufis & al * Corpora and Corpus-Based Morpho-Lexical Processing
The table below provides a comparative overview of the number of attributes that were considered in each of the MULTEXT-EAST languages for each category (a '-' in a cell signifies that the corresponding POS is not relevant for the considered language).
POS | Rom | Bul | Cz | Slo | Est | Hun |
Noun | 6 | 5 | 5 | 5 | 3 | 7 |
Verb | 7 | 8 | 10 | 8 | 8 | 5 |
Adjective | 7 | 3 | 7 | 5 | 3 | 8 |
Pronoun | 8 | 8 | 12 | 10 | 4 | 7 |
Adverb | 3 | 1 | 2 | 2 | 0 | 4 |
Adposition | 4 | 1 | 3 | 3 | 1 | 1 |
Conjunction | 5 | 2 | 3 | 2 | 1 | 3 |
Numeral | 7 | 5 | 7 | 5 | 4 | 7 |
Interjection | 0 | 1 | 0 | 0 | 0 | 1 |
Residual | 0 | 0 | 0 | 0 | 0 | 0 |
Abbreviation | 5 | 0 | 0 | 0 | 3 | 0 |
Particle | 2 | 2 | 0 | 0 | - | - |
Determiner | 8 | - | - | - | - | - |
Article | 5 | - | - | - | - | 1 |
The table below provides information
on the data content of the main dictionary that is used for the
corpus analysis. When interpreting the figures, one has to consider
the treatment of syncretic forms as previously discussed. The
any interpretation of the '-' attribute value was not considered
in counting these figures since the expansion of the
token-indeterminable values is external to the lexicon.
The first column (Entries)
provides the number of dictionary entries, that is triplets:
The distribution of the entries over the parts of
speech is shown in Table 1.1. As one would expect, given the large
paradigms of verbs, the largest number of entries are verbal (192843).
The percentage of noun and adjective word-forms is also very high.
The second column (Wordforms)
in Table 1 gives the number of distinct word-forms appearing in
the lexicon, irrespective of their lemma and MSD. For instance,
the following entries:
will contribute the Wordform column with only one
item (vin). The word-forms distribution over the parts of speech
is shown in Table 1.2.
Summing up the numbers in Table
1.2 gives 370712 which is 22752 more than the figure shown in
theWordform
column in Table 1. Since several identical word-forms (which were
counted only once in the Wordform column of Table 1) could differ
by the MSD, by lemma or by both, counting by taking into account
these differences makes the noticed difference.
36
Entries Wordforms
Lemmas =
MSDs AMB_POS
AMB_MSD
440363 347960 33259 35421 674
90 981
POS N
V A
P D
M R
T S
C Q
I Y
X
Number
of entries124135 192843
119420 347 231 693 1357 46
176 81 10 196 777 51
vin = Ncms-n
vin veni Vmip1s
vin veni Vmip3p
vin veni Vmsp1s
POS N
V A
P D
M R
T S
C Q
I Y
X
Number of wordforms 113285 142438
111274 320 224 615 1336 45
176 77 10 196 665 51