Multext-East Resources Version 6 "CLARIN"

This is the home page of Version 6 of the MULTEXT-East resources, a multilingual dataset for language engineering research and development. This dataset contains, for Albanian, Bulgarian, Chechen, Czech, Damaskini, English, Estonian, Hungarian, Lithuanian, Macedonian, Persian, Polish, Resian, Romanian, Russian, Serbo-Croatian, Slovak, Slovene, Torlak, and Ukrainian, the MULTEXT-East morphosyntactic specifications Version 6, defining harmonised word-level syntactic features and their mapping to MSD tagsets:

about
read the specifications
browse the data
clone or report issues on GitHub (note that GitHub also contains scripts for conversion and maintenance)

The other resources are still at version 4 (but will be upgraded in time) and consit of:

MULTEXT-East lexicons 4.0, containing word-forms, lemmas and MSDs
- about
- download free lexicons from CLARIN.SI (Bulgarian, Czech, English, Estonian, French, Hungarian, Romanian, Slovak, Slovenian, Ukrainian);
- download non-commerical lexicons from CLARIN.SI (Persian, Macedonian, Polish, Russian, Serbian)
MULTEXT-East "1984" annotated corpus 4.0, sentence aligned parallel "1984" corpus with words tagged with lemma and MSD:
- about
- download from CLARIN.SI (Bulgarian, Czech, English, Estonian, Persian, Hungarian, Macedonian, Polish, Romanian, Slovak, Slovenian, Serbian)
MULTEXT-East "1984" document corpus 4.0, sentence aligned parallel "1984" corpus with structural markup only:
- about
- download from CLARIN.SI (Bulgarian, Czech, English, Estonian, Hungarian, Lithuanian, Romanian, Russian, Slovenian, Serbian)
MULTEXT-East parallel speech corpus 4.0, consisting of 40 short block with about 5 sentences each, text and speech:
- about;
- browse;
- download
and associated documentation:
- read;
- download

The specifications and corpora use the TEI Guidelines for their XML encoding; the schema and its documentation is available in the schema/ directory.

In published research please acknowledge the use of MULTEXT-East resources by citing one of the following papers:

Tomaž Erjavec (2017). MULTEXT-East. In (Nancy Ide, James Pustejovsky, eds.): Handbook of linguistic annotation. pp. 441-462. Springer. DOI 10.1007/978-94-024-0881-2_17 [BibTeX].
Tomaž Erjavec (2012). MULTEXT-East: Morphosyntactic Resources for Central and Eastern European Languages. Language Resources and Evaluation, 46/1, pp. 131-142, DOI 10.1007/s10579-011-9174-8. [BibTeX].

Page last updated 2022-03-24, Tomaž Erjavec