Dan Tufis & al * Corpora and Corpus-Based Morpho-Lexical
Processing
4. A statistical account of the Romanian sub-corpus
of the multilingual corpus
From our collection of plain texts (about 15 million
words) belonging to different genres and registers and more than
100 authors (poetry-Eminescu's complete poetic work, original
Romanian fiction-14 novels, translations - Orwell's "1984"
and Plato's "Republic", journalism, technical reports
in computer science and linguistics, informal notes, electronic
mails, etc.) we selected about 220.000 words for standardised
corpus encoding with the purpose of research in statistical natural
language processing (lexicographic statistics, automatic morpho-syntactic
tagging, stochastic parsing, parallel texts alignment, statistics-based
machine translation).
The European projects MULTEXT (LRE) and EAGLES (in
particular, the EAGLES Text Representation subgroup), together
with the Vassar/CNRS collaboration (supported by the U.S. National
Science Foundation), have joined efforts to develop a Corpus Encoding
Specification (CES) optimally suited for use in language engineering,
which can serve as a widely accepted set of encoding standards
for corpus-based work. The overall goal is the identification
of a minimal encoding level that corpora must achieve to be considered
standardized in terms of descriptive representation (marking of
structural and linguistic information) as well as general architecture
(so as to be maximally suited for use in a text database). It
also provides encoding conventions for more extensive encoding
and for linguistic annotation. The CES is an application of SGML
(ISO-8879, Standard Generalized Markup Language), conformant to
the TEI Guidelines for Electronic Text Encoding and Interchange
of the Text Encoding Initiative [6].
The TEI Guidelines are expressly
designed to be applicable across a broad range of applications
and disciplines and therefore treat not only a vast array of textual
phenomena, but are also designed with an eye toward the maximum
of generality and flexibility. Most applications will use only
those parts of the TEI that are required to meet their needs.
The CES is such an application that have utilized the TEI modular
DTD and the TEI customization mechanisms to select those pieces
of the TEI that are appropriate for corpus encoding. The TEI is
an ongoing project and for some areas it is not complete; and
as a result, there are some areas of importance for corpus encoding
that the TEI Guidelines do not cover. Therefore, developing the
CES has involved not only selecting from, but also in some cases
extending the TEI Guidelines to meet the specific needs of corpus-based
work in language engineering. All results and specifications developed
for the CES are fed back to the TEI as input for further revisions
of the Guidelines. The current version of the CES is a first draft
of the standard. It has not been widely implemented, and the intention
is to continue to develop the CES on the basis of input and feedback
from users after it is put to greater use. Therefore, this document
will continue to evolve and should not be regarded as "final".
All current CES documents and DTDs will continue to be available
at the following site:
http://www.cs.vassar.edu/CES/. Anyone actively
implementing the standard should consult this site regularly.
The statistics given in the previous chapter referred
to the dictionary. In running texts, as one would expect they
are different, reflecting the actual usage of the lexical stock.
The procedure for evaluating the numbers in this Section involved
automatic segmentation of the text, MSD annotating the segmented
text and manual disambiguation of approximately 220.000 lexical
tokens.
Please note that a token (a lexical
unit as identified by the segmenter) is not necessarily a word:
one orthographic word may be split into several tokens (the Romanian
"dã-mi-l" - give it to me - is split into 3
tokens) or several orthographic words may be combined into one
token (the Romanian words "de la" are combined into
one token "de_la"). In Figure 1 is a fragment of text
from our corpora, Figure 2 shows the segmented text, Figure 3
gives the MSD ambiguously annotated text and Figure 4 presents
the MSD disambiguated text.
The selected texts from the corpora were segmented
by means of a tokenizer, part of the tool-set implemented within
the MULTEXT project with the language resources we developed
within the MULTEXT-EAST. The segmenter is a language independent
and configurable processor used to tokenize an input text, given
in one of the three possible formats: plain text (as in Figure
1), a normalized SGML form (nSGML) as output by another MULTEXT
tool (MTSgmlQl) and a tabular format (also specific to MULTEXT
processing chain). The output of the segmenter is a tokenized
form of the input text, with paragraph and sentence boundary marked-up.
Punctuation, lexical items, numbers and several alpha numeric
sequences (such as dates and hours) are annotated with various
tags out of a hierarchy class structured tagset. The language
specific behavior of the segmenter is driven by several language
resources (abbreviations, compounds, clitics, etc.). The general
behavior of the segmenter (valid over several languages) can also
be parametrised by means of external resources such as definition
for space and punctuation, number orthography, sentence delimiters,
etc.
39