The MULTEXT-East Resources Revisited

Tomaz Erjavec
Institute Jozef Stefan, Ljubljana, Slovenia

Published in ElsNews 10.1, Spring 2001

The MULTEXT-East project (Multilingual Text Tools and Corpora for Eastern and Central European Languages) was a spin-off of the EU MULTEXT project and financed under the INCO-Copernicus programme. The project ran from 1995 to 1997 and developed language resources for six languages: Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovene, as well as for English, as the 'hub' language of the project. The main results of the project were morpholexical resources and an annotated multilingual corpus for the seven languages. The centrepiece of the corpus is the novel ``1984'' in the English original and translations; the novel is sentence aligned and its words annotated for context disambiguated lemmas and morphosyntactic descriptions.

This makes the corpus a unique dataset for studying word-class syntactic tagging, bi-lingual lexicon extraction and other issues relevant to language engineering applications for a number of Eastern and Central European Languages. With free word order and rich inflection or agglutination these languages present significantly different linguistic problems than do those of Western Europe.

One of the objectives of MULTEXT-East has been to make its resources freely available for research purposes. In the scope of the TELRI concerted action (Trans European Language Resources Infrastructure), the results MULTEXT-East were in 1998 released on the second volume of a double CD-ROM and, recently, made available via the TRACTOR (TELRI Research Archive of Computational Tools and Resources) Web site. In the years since the CD-ROM release the MULTEXT-East resources have served as models for reference corpora and have been applied to new languages. They have been used in a number of experiments, e.g. in evaluating part-of-speech tagger performance, developing new taggers and lemmatisers, automatic extraction of bi- and multi-lingual lexicons and studies on multilingual sense disambiguation.

For most of the languages in question, the original MULTEXT-East annotation work was a pioneering effort, so it was hardly surprising that during use a number of errors and inconsistencies were discovered in the data and specifications. These errors were subsequently corrected, but because the work was done at different sites and in different manners, the corpus encodings had begun to drift apart.

The EU project Concede (Consortium for Central European Dictionary Encoding), which ran from 1998 to 2000 and comprised most of the same partners as MULTEXT-East, offered the possibility to bring the versions back on a common footing. Although Concede was primarily devoted to machine readable dictionaries and lexical databases, one of its workpackages did consider the integration of the dictionary data with the MULTEXT-East corpus. In the scope of this workpackage, the corrected ``1984'' corpus was normalised and the primary data re-encoded according to the TEI (Text Encoding Initiative) guidelines and, largely, to the XML recommendation.

This ``Concede'' version of the resources has recently been pre-released. Version 2 of MULTEXT-East resources contains: the revised and expanded MULTEXT and EAGLES based morphosyntactic specifications, in print form and as (over 5000) TEI feature structures; the morphosyntactic lexica, totalling at least 15.000 lemmas per language; and the corrected and TEI encoded ``1984'' annotated corpus, with about 100.000 words per language. The corpus includes 2-way and 7-way sentence alignments in CES (Corpus Encoding Standard).

In the same spirit as version 1, the second release is also being made available to the research community free of charge. The resources will be incorporated in the TRACTOR archive and also mounted on the MULTEXT-East Web site. Here interested parties will be able to download them after filling out a Web based licensing agreement for non-commercial use. Commercial exploitation is more complex, not the least as the resource owners span seven countries. However, we hope to reach an agreement with ELRA, which had been set-up especially to make such dissemination possible.

The Concede team posing above the Danube. Photo taken during a project meeting held in April 1999

