The language industries rely increasingly on the availability of large-scale language resources, appropriate software tools, and standards to make them maximally reusable. However, we are still quite some way from ready availability and complete reusability of corpora and tools. A number of tools for corpus annotation and exploitation have been developed in industry and private research centers, but very few are generally available and/or reusable (or, in particular, applicable to a wide range of languages). Existing methods, such as stochastic methods for corpus annotation, are well-developed for English only, although adaptation to other EU languages is under way.
As for corpora, there exist several international data collection initiatives amassing and distributing large amounts of mainly English-language data (e.g., the ACL Data Collection Initiative and the U.S. Linguistic Data Consortium). In Europe, data collection initiatives such as the European Corpus Initiative (ECI ) and the Multi-Lingual Corpus Collection (MLCC) project have improved the widespread and low-cost availability of corpora comprising texts in a range of EU languages as well as parallel translations. In general, all these initiatives collect whatever data they can acquire, without serious concern for balance or representativeness, and provide only minimal cleanup of the original data.
Beyond providing tools and data, it is essential to develop standards in order to maximize their usability and reusability. Efforts to develop encoding standards are under way: the Text Encoding Initiative (TEI ), whose Guidelines for the Encoding and Interchange of Machine-readable Texts were published in May 1994, provides a comprehensive and general set of encoding solutions for texts across a wide range of domains and applications that serve as a basis for such standards. However, the TEI standard needs to be adapted and extended to serve the specific needs of language engineering applications. Efforts to provide standards for software are less well-developed. As a result, there is currently a serious lack of generally usable tools to manipulate and analyze text corpora that are widely available for research, especially for multilingual applications.
A number of recent efforts have been initiated to develop reliable, reusable resources and tools for EU languages, and significant headway is being made to enable their widespread availability. However, there have been no comparable efforts for the languages of Central and Eastern European (CEE). No large-scale, systematic attempts at corpus collection currently exist (in particular for multilingual, parallel corpora in these languages), and tools specifically adapted to corpora in CEE languages are not freely available.
MULTEXT-East is intended to fill this gap by developing significant resources in CEE languages and adapting existing tools and standards to them. MULTEXT-East is a spin-off of the LRE project MULTEXT , one of the largest EU projects in the domain of language tools and resources. MULTEXT consists of three main aspects:
MULTEXT-East will extend the scope of MULTEXT by transferring its expertise, methodologies, and tools to CEE countries, thus enabling the validation of these tools and methods on CEE languages. At the same time, it will enable the development of linguistic resources for these languages, especially corpora.
Because projects funded under Copernicus are beginning at approximately MULTEXT 's mid-point, its tools and methods are well-developed enough to extend to additional languages. At the same time, the timing enables MULTEXT to incorporate feedback from application of its tools to vastly different language types (especially Slavic and Finno-Ugric) while they are still under development.
Together, MULTEXT and MULTEXT-East will create a unique network of more than twenty academic research centers and companies, all developing and using common lingware and methodologies as well as producing the first annotated large-scale multilingual corpus for 12 EU and CEE languages.