This is the home page of Version 6 of the MULTEXT-East resources, a
multilingual dataset for language engineering research and
development. This dataset contains, for
Albanian,
Bulgarian,
Chechen,
Czech,
Damaskini,
English,
Estonian,
Hungarian,
Lithuanian,
Macedonian,
Persian,
Polish,
Resian,
Romanian,
Russian,
Serbo-Croatian,
Slovak,
Slovene,
Torlak,
and
Ukrainian,
the MULTEXT-East morphosyntactic specifications Version 6,
defining harmonised word-level syntactic features and their mapping to MSD tagsets:
The other resources are still at version 4 (but will be upgraded in time)
and consit of:
- MULTEXT-East lexicons 4.0, containing word-forms, lemmas and MSDs
- about
- download free lexicons from CLARIN.SI
(Bulgarian, Czech, English, Estonian, French, Hungarian, Romanian, Slovak, Slovenian, Ukrainian);
- download non-commerical lexicons from CLARIN.SI
(Persian, Macedonian, Polish, Russian, Serbian)
- MULTEXT-East "1984" annotated corpus 4.0,
sentence aligned parallel "1984" corpus with words tagged with lemma and MSD:
- about
- download from CLARIN.SI
(Bulgarian, Czech, English, Estonian, Persian, Hungarian, Macedonian, Polish, Romanian, Slovak, Slovenian, Serbian)
- MULTEXT-East "1984" document corpus 4.0,
sentence aligned parallel "1984" corpus with structural markup only:
- MULTEXT-East parallel speech corpus 4.0,
consisting of 40 short block with about 5 sentences each, text and speech:
- and associated documentation:
The specifications and corpora use the
TEI Guidelines for their XML encoding;
the schema and its documentation is available in the
schema/ directory.
In published research please acknowledge the use of MULTEXT-East resources by citing one of the following papers:
Page last updated 2022-03-24,
Tomaž Erjavec