MULTEXT-East Home Page

MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages

The MULTEXT-East resources are a multilingual dataset for language engineering research and development. It consists of the (1) MULTEXT-East morphosyntactic specifications, defining categories (parts-of-speech), their morphosyntactic features (attributes and values), and the compact MSD tagset representations; (2) morphosyntactic lexica, (3) the annotated parallel "1984" corpus; and (4) some comparable text and speech corpora. The specifications are available for the following macrolanguages, languages and language varieties: Albanian, Bulgarian, Chechen, Czech, Damaskini, English, Estonian, Hungarian, Macedonian, Persian, Polish, Resian, Romanian, Russian, Serbo-Croatian, Slovak, Slovene, Torlak, and Ukrainian, while the other resources are available for a subset of these languages.

MULTEXT-East resources Version 6

What's new in V6:

Related resources

Key Publications

History of MULTEXT-East

The MULTEXT-East project was a spin-off of the MULTEXT project and ran from 1995 to 1997. MULTEXT-East developed language resources for six languages: Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovene, as well as for English, as the 'hub' language of the project. It also adapted existing tools and standards to these languages. The main results of the project were an annotated multilingual corpus and lexical resources for the seven languages.

The extended results of the project were made available in 1998, first on CD-ROM and then via TRACTOR, the TELRI Research Archive of Computational Tools and Resources. This first release is also mirrored here.

In the scope of the EU Concede project, a new release was made available in 2002; it contained only the (updated and corrected) morphosytntactic resources from the first release. This second release was made freely available for research use via the Web and is available here.

The third release was made in 2004 - it brought together the first two, added the Serbian annotated "1984", the Resian morphosyntactic specifications, and corrected some errors from the previous versions. It is also available via the Web, here.

The fourth release was made in 2010 and extended the resources with five new languages and presented all the resources uniformly encoded in TEI P5 and is available here.

The fifth release was never finalised and remains in a draft state, with only the morphosyntactic specifications available. Apart from introducing a new category for punctuation, the main change was the addition of Bosnian to the specifications and updating the Croatian specifications, which had never been operationalised. However, with a suite of resources and tools that used the "Bosnian" specifications but in fact also covered Croatian and Serbian, this attempt was abandoned. For reference, Version 5 is available here.

The sixth release removes the Croatian, Serbian, and Bosnian specifications but introduces the ones for Serbo-Croatian, which cover the Croatian, Serbian, Bosnian and Montenegrin languages. Furthermore, it updates the Macedonian specifications and introduces the ones for Albanian, for the Torlak dialect of Serbian, and for the so called "Damaskini" specifications, developed esp. for a diachronic corpus of Balkan Slavic texts from 16th-19th centuries. Finally, the maintenance of the specifications was moved to GitHub. Version 6 is available here.

Page, last updated 2021-06-04, Tomaž Erjavec

Valid HTML 4.01!