Dan Tufis & al * Corpora and Corpus-Based Morpho-Lexical Processing


In the context of MULTEXT-EAST Copernicus project (http://www.lpl.univ-aix.fr/MTE and http://nl.ijs.si/ME) it was developed a parallel corpus consisting of 6 integral translations of George Orwell's "1984": besides the original English version, the corpus contains translations in Bulgarian, Czech, Estonian, Hungarian, Romanian and Slovene. This corpus is consistently SGML encoded over the seven languages (CES conformant [4] up to level 3), sentence aligned and morpho-syntactically annotated. MULTEXT-EAST also produced a second multilingual and balanced corpus (but not parallel) consisting of original novels and newspaper articles for each language.

A second parallel corpus was developed within the TELRI Copernicus Concerted Action (Trans European Language Resource Infrastructure - http://www.ids-mannheim.de/TELRI) and consists of integral translations of Plato's "Republic" (besides the original Greek version, the sub-corpus contains translations in Bulgarian, Czech, English, German, Lithuanian, Polish, Romanian, Slovak and Slovene).

It is worth mentioning that both "1984" and "Republic" are available in many other languages but at the time of this writing they had not been aligned (in terms of text encoding) to the conventions used for the languages mentioned above. The encoding of the Romanian version of "Republic" is described in [5].

There were several reasons for the choice of the source texts ("1984" and Republic") based on which the parallel corpora were built, but the most important were the availability of the translations in all the languages of the partners involved and copyright permissions. The two corpora are complementary as far as the linguistic registers and (human) translation techniques are concerned (whereas "1984" was translated more literary - in the spirit of the book, "Republic" was translated in a rather technical way, providing support for Sinclair's hypothesis on "translation protected elements").

In order to develop reusable resources, it has been essential to establish standardized methods and specifications for these resources. For corpus encoding, MULTEXT-EAST and to some extent TELRI, adopted the SGML-based CES (Corpus Encoding Specifications) schema [4], which in its turn is based on the TEI recommendations [6, 7]. While corpus-encoding schema posed no significant problems in accommodating the several language specific renderings of the paralleled texts, harmonising the lexicon encoding was a serious task. Based on the specifications developed in the EAGLES [8] and their extension by the MULTEXT Project [9] to six western European languages (English, French, Dutch, Italian, German, Spanish), MULTEXT-EAST Project further extended these specifications to cover new languages (Bulgarian, Czech, Estonian, Hungarian, Romanian and Slovene) with different characteristics. Because these language families include many features and properties not found in western European languages, such as heavy inflection and agglutination, adapting the specifications initially developed for western European languages to these languages posed many interesting and difficult problems and demanded substantial assessment and modification of the pre-existing specifications. The work carried out in MULTEXT-EAST has thus broadened the base and contributed significantly to defining a general mechanism for lexical specification (<URL: http://www.lpl.univ-aix.fr/pub/multext/docs/ME1.1.tex> and <URL: http://www.ijs.si/ME/docs/>). A significant benefit from conforming to corpus and lexicon specifications originating in EAGLES and MULTEXT, was an easy adaptation or extension of corpus-annotation tools developed initially within the MULTEXT project.

Since presentations from a multilingual point of view of MULTEXT-EAST and TELRI corpora are given elsewhere [10,11,12], in the following, we will dwell on the Romanian component.

2. Word-form Dictionary and morpho-syntactic descriptions

For corpus morpho-lexical processing purposes (fast text lemmatisation, language model constructing, automatic tagging, etc.) the MULTEXT-EAST consortium developed several language specific word-form dictionaries covering at least the words appearing in the corpus. A dictionary entry has the following structure:

word-form <TAB> lemma <TAB> MSD <TAB> comments

where word-form represents an inflected form of the lemma, characterised by a combination of feature values encoded by MSD code (Morpho Syntactic Description); the forth column, comments, which is optional, is currently ignored and may contain either comments or information processable by other tools. The morpho-syntactic descriptions are provided as strings, using a linear encoding. In this notation, the position in a string of characters corresponds to an attribute, and specific characters in each position indicate the value for the corresponding attribute.


30

Previous Next