Table 2 shows the global figures
for the sub-corpora processed as described above. Since in the
first phase of this research we were concerned with parallel corpora,
the selected texts were Orwell's "1984" and Plato's
"Republic". These two integral texts belong to two different
registers (fiction and philosophy) and, as will be shown, have
different distributional properties. The Occurrences column
contains the number of word-forms occurrences in each of the two
books without eliminating the duplicates. The Wordforms column
shows the number of distinct word-forms in the two books. The
MSDs column shows the number of distinct MSD used
in each of the books. The AMB_MSD and AMB_POS
give, respectively, the numbers of MSD-ambiguity clusters and POS-ambiguity
classification clusters found in the corpora. The Common
row displays for each category item the commonalties among
the two texts. As one can see from Table 2, although "Republic"
contains more occurrences than "1984", the number of
different word-forms and lemmas is significantly smaller. This
is not surprising given the two registers of the considered texts.
In a philosophical text, most words are used in a rather technical
way with as little as possible stylistic variations. Although
the two texts contain altogether more than 200.000 words, the
number of different words is just 20371 (less than 10%) with only
4.016 word-forms occurring in both texts. Therefore, from the
lexical coverage point of view the two texts are far from being
sufficient in drawing significant conclusions on Romanian language
word-usage frequencies. However, for the distributional analysis
and morpho-syntactical disambiguation purposes the selected texts
offer enough evidence to extract reliable data and draw realistic
conclusions. Out of the 674 MSDs defined in the lexicon, these
texts contain 500 MSDs (more than 74%) which is a very good figure,
considering that the number of distinct word-forms occurring
in the selected texts represents approximately 5% of the lexical
stock. This is to say that most of the words raising problems
(from the ambiguity point of view) appeared at least in one of
the two texts. This statement is even better supported if one
considers the last column in Table 2: "1984" contains
all but 4 POS-ambiguity clusters while "Republic" misses
only 10. The MSD-ambiguity clusters are also very well represented
in the selected texts (more than 63% of all MSD-ambiguity clusters
defined in the lexicon).
Table 2. Corpora overview
Sub-Corpus
| Occurrences
| Wordforms
| Lemmas
| MSDs
| AMB_MSD
| AMB_POS
|
'1984'
| 101460
| 14037
| 7019
| 444
| 529
| 86
|
'Republic'
| 114720
| 10350
| 4697
| 400
| 498
| 80
|
Common
| 56804
| 4016
| 2527
| 344
| 402
| 78
|
Table 2.1. The
distribution of the word-form occurrences over the parts of speech
in "1984" and "Republic"
POS
| N
| V
| A
| P
| D
| M
| R
| T
| S
| C
| Q
|
I
| Y
| X
|
'1984' occrs
| 22570
| 19031
| 6202
| 10967
| 2376
| 1228
| 10633
| 4236
| 12621
| 6389
| 4967
| 99
| 72
| 69
|
Republic occrs
| 19936
| 24890
| 6302
| 13577
| 3069
| 633
| 12965
| 4079
| 12141
| 10506
| 6383
| 199
| 37
| 3
|
Common occrs
| 5345
| 7768
| 2045
| 8294
| 1506
| 416
| 6556
| 2730
| 10542
| 5797
| 4482
| 36
| 0
| 1
|
The distribution of the word-form occurrences over the parts of
speech shows some differences that are motivated by the different
registers. The difference due to the linguistic registers the
texts belong to, shows up, as one would expect in the counts for
common content words (N, V, A, R). For the functional words (S,
C, Q, T, P) the number of common occurrences is very high. This
consideration is better supported when comparing (without counting
duplicates) the word counts (Table 2.2). The abbreviations and
residuals are also very different in the two texts. This can be
easily understood, if one takes into account that abbreviations
in most cases stand for content words. The residual class is poorly
represented in Plato's "Republic", but it has several
instances in Orwell's "1984" (mainly when the newspeak
jargon is in use).
Table 2.2. The
distribution of the word-forms over the parts of speech
POS
| N
| V
| A
| P
| D
| M
| R
| T
| S
| C
| Q
| I
| Y
| X
|
'1984' wordforms
| 6177
| 4518
| 2473
| 189
| 137
| 114
| 683
| 29
| 127
| 62
| 9
| 34
| 16
| 42
|
'Republic' wordforms
| 4537
| 3603
| 1682
| 179
| 128
| 111
| 424
| 22
| 101
| 49
| 7
| 12
| 4
| 1
|
Common wordforms
| 1467
| 1344
| 607
| 136
| 105
| 48
| 277
| 21
| 74
| 40
| 7
| 5
| 0
| 1
|
Comparing the figures in the previous tables, the discursive nature
of Plato's book is quite obvious when contrasted with "1984":
the smaller number of Nouns is compensated by an increased number
of pronouns and the large number of verbal occurrences ensures
an alert argumentative rhythm.
41