Romanian Language Technology

Dan Tufis & al * Corpora and Corpus-Based Morpho-Lexical Processing

4. A statistical account of the Romanian sub-corpus of the multilingual corpus

From our collection of plain texts (about 15 million words) belonging to different genres and registers and more than 100 authors (poetry-Eminescu's complete poetic work, original Romanian fiction-14 novels, translations - Orwell's "1984" and Plato's "Republic", journalism, technical reports in computer science and linguistics, informal notes, electronic mails, etc.) we selected about 220.000 words for standardised corpus encoding with the purpose of research in statistical natural language processing (lexicographic statistics, automatic morpho-syntactic tagging, stochastic parsing, parallel texts alignment, statistics-based machine translation).

The European projects MULTEXT (LRE) and EAGLES (in particular, the EAGLES Text Representation subgroup), together with the Vassar/CNRS collaboration (supported by the U.S. National Science Foundation), have joined efforts to develop a Corpus Encoding Specification (CES) optimally suited for use in language engineering, which can serve as a widely accepted set of encoding standards for corpus-based work. The overall goal is the identification of a minimal encoding level that corpora must achieve to be considered standardized in terms of descriptive representation (marking of structural and linguistic information) as well as general architecture (so as to be maximally suited for use in a text database). It also provides encoding conventions for more extensive encoding and for linguistic annotation. The CES is an application of SGML (ISO-8879, Standard Generalized Markup Language), conformant to the TEI Guidelines for Electronic Text Encoding and Interchange of the Text Encoding Initiative [6]. The TEI Guidelines are expressly designed to be applicable across a broad range of applications and disciplines and therefore treat not only a vast array of textual phenomena, but are also designed with an eye toward the maximum of generality and flexibility. Most applications will use only those parts of the TEI that are required to meet their needs. The CES is such an application that have utilized the TEI modular DTD and the TEI customization mechanisms to select those pieces of the TEI that are appropriate for corpus encoding. The TEI is an ongoing project and for some areas it is not complete; and as a result, there are some areas of importance for corpus encoding that the TEI Guidelines do not cover. Therefore, developing the CES has involved not only selecting from, but also in some cases extending the TEI Guidelines to meet the specific needs of corpus-based work in language engineering. All results and specifications developed for the CES are fed back to the TEI as input for further revisions of the Guidelines. The current version of the CES is a first draft of the standard. It has not been widely implemented, and the intention is to continue to develop the CES on the basis of input and feedback from users after it is put to greater use. Therefore, this document will continue to evolve and should not be regarded as "final". All current CES documents and DTDs will continue to be available at the following site: http://www.cs.vassar.edu/CES/. Anyone actively implementing the standard should consult this site regularly.

The statistics given in the previous chapter referred to the dictionary. In running texts, as one would expect they are different, reflecting the actual usage of the lexical stock. The procedure for evaluating the numbers in this Section involved automatic segmentation of the text, MSD annotating the segmented text and manual disambiguation of approximately 220.000 lexical tokens.

Please note that a token (a lexical unit as identified by the segmenter) is not necessarily a word: one orthographic word may be split into several tokens (the Romanian "dã-mi-l" - give it to me - is split into 3 tokens) or several orthographic words may be combined into one token (the Romanian words "de la" are combined into one token "de_la"). In Figure 1 is a fragment of text from our corpora, Figure 2 shows the segmented text, Figure 3 gives the MSD ambiguously annotated text and Figure 4 presents the MSD disambiguated text.

The selected texts from the corpora were segmented by means of a tokenizer, part of the tool-set implemented within the MULTEXT project with the language resources we developed within the MULTEXT-EAST. The segmenter is a language independent and configurable processor used to tokenize an input text, given in one of the three possible formats: plain text (as in Figure 1), a normalized SGML form (nSGML) as output by another MULTEXT tool (MTSgmlQl) and a tabular format (also specific to MULTEXT processing chain). The output of the segmenter is a tokenized form of the input text, with paragraph and sentence boundary marked-up. Punctuation, lexical items, numbers and several alpha numeric sequences (such as dates and hours) are annotated with various tags out of a hierarchy class structured tagset. The language specific behavior of the segmenter is driven by several language resources (abbreviations, compounds, clitics, etc.). The general behavior of the segmenter (valid over several languages) can also be parametrised by means of external resources such as definition for space and punctuation, number orthography, sentence delimiters, etc.