Next: Outline description Up: MULTEXT-East Technical Annex Previous: Contents

Overview

The language industries rely increasingly on the availability of large-scale language resources, appropriate software tools, and standards to make them maximally reusable. However, while the development of resources, tools, and standards is well on its way for EU languages, there have been very few comparable efforts for the languages of Central and Eastern Europe (CEE). MULTEXT-East is intended to fill this gap by developing resources in CEE languages and adapting existing tools and standards to them.

MULTEXT-East is a spin-off of the LRE project MULTEXT , one of the largest EU projects in the domain of language tools and resources. MULTEXT has three main objectives:

Standardization: development of a software standard based on a ``software Lego'' approach for corpus handling tools, together with TEI -based encoding conventions specifically suited to multilingual corpora and language engineering applications.
Tool and corpus development: development of an extensive set of tools for corpus annotation and exploitation as well as the first annotated large-scale multilingual corpus for EU languages, intended to serve as a reference and test-bed for multilingual tools and applications.
Industrial validation: integration by six major European companies of project results into high-level NLP applications such as term extraction and machine translation lexicon generation, thus providing a first indication of downstream applicability.

MULTEXT-East will extend the scope of MULTEXT by transferring its expertise, methodologies, and tools to CEE countries. Because projects funded under Copernicus will begin at approximately MULTEXT 's mid-point, its tools and methods will be well-developed enough to extend to additional languages. At the same time, the timing will enable MULTEXT to incorporate feedback from application of its tools to vastly different language types (especially Slavic and Finno-Ugric) while they are still under development.

Together, MULTEXT and MULTEXT-East will create a unique network of more than twenty academic research centers and companies, all developing and using common lingware and methodologies, as well as producing the first annotated large-scale multilingual corpus for 12 EU and CEE languages. MULTEXT-East will include the following work packages:

Linguistic resource building, to provide lexicons and morphological rules for CEE languages;
Tool application on CEE language corpora, followed by adaptation of the tools and hand validation of the data;

All of the work within the project will be performed in conjunction with Eagles and the TEI , and will thus provide an extension of their work on standardization to a new range of languages.

Next: Outline description Up: MULTEXT-East Technical Annex Previous: Contents

Tomaz Erjavec
Mon May 20 13:01:13 MDT 1996