The MULTEXT-East resources are a multilingual dataset for language engineering research and development. This dataset contains, for Bulgarian, Croatian, Czech, English, Estonian, Hungarian, Lithuanian, Macedonian, Persian, Polish, Resian, Romanian, Russian, Serbian, Slovak, Slovene, and Ukrainian, some, or all of the following language resources: the MULTEXT-East morphosyntactic specifications, lexica, and annotated "1984" corpus; the MULTEXT-East parallel and comparable text and speech corpora; and associated documentation.
The MULTEXT-East project was a spin-off of MULTEXT and ran from '95 to '97. MULTEXT-East developed language resources for six languages: Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovene, as well as for English, as the 'hub' language of the project. It also adapted existing tools and standards to these languages. The main results of the project were an annotated multilingual corpus and lexical resources for the seven languages.
The extended results of the project were made available in 1998, first on CD-ROM and then via TRACTOR, the TELRI Research Archive of Computational Tools and Resources. This first release is also mirrored here.
In the scope of the Concede project, a new release was made available in 2002; it contained only the (updated and corrected) morphosytntactic resources from the first release. This second release was made freely available for research use via the Web and is available here.
The third release was made in 2004 - it brought together the first two, added Serbian annotated "1984", the Resian morphosyntactic specifications, and corrected errors from the previous two versions. It is also available via the Web, here.
Finally, the fourth release, made in 2010, extends the resources with five new languages and presents all the resources uniformly encoded in TEI P5 and is available here.