This is the home page of Version 4 of the MULTEXT-East resources, a
multilingual dataset for language engineering research and
development. This dataset contains, for
Bulgarian,
Croatian,
Czech,
English,
Estonian,
Hungarian,
Lithuanian,
Macedonian,
Persian,
Polish,
Resian,
Romanian,
Russian,
Serbian,
Slovak,
Slovene,
Ukrainian,
some or all of the following resources:
- MULTEXT-East morphosyntactic specifications 4.0,
defining harmonised word-level syntactic features and their mapping to MSD tagsets:
- MULTEXT-East lexicons 4.0, containing word-forms, lemmas and MSDs
- about
- download free lexicons from CLARIN.SI
(Bulgarian, Czech, English, Estonian, French, Hungarian, Romanian, Slovak, Slovenian, Ukrainian);
- download non-commerical lexicons from CLARIN.SI
(Persian, Macedonian, Polish, Russian, Serbian)
- MULTEXT-East "1984" annotated corpus 4.0,
sentence aligned parallel "1984" corpus with words tagged with lemma and MSD:
- about
- download from CLARIN.SI
(Bulgarian, Czech, English, Estonian, Persian, Hungarian, Macedonian, Polish, Romanian, Slovak, Slovenian, Serbian)
- MULTEXT-East "1984" document corpus 4.0,
sentence aligned parallel "1984" corpus with structural markup only:
- MULTEXT-East parallel speech corpus 4.0,
consisting of 40 short block with about 5 sentences each, text and speech:
- and associated documentation.
The specifications and corpora use the
TEI P5 Guidelines for the XML encoding;
the schema and its documentation is available in the
schema/ directory.
In published research please acknowledge the use of MULTEXT-East resources by citing the
following paper:
Tomaž Erjavec (2012):
MULTEXT-East: Morphosyntactic Resources for Central and Eastern European Languages.
Language Resources and Evaluation, 46/1, pp. 131-142.
Publications:
- Tomaž Erjavec (2012):
MULTEXT-East: Morphosyntactic Resources for Central and Eastern European Languages.
Language Resources and Evaluation, 46/1, pp. 131-142.
- Tomaž Erjavec:
MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora.
Proc. of the LREC 2010, Malta, 19-21 May, 2010.
[PDF]
- Tomaž Erjavec:
MULTEXT-East Morphosyntactic Specifications: Towards Version 4
In: Proc. of the MONDILEX Third Open Workshop, Bratislava, Slovakia, 15-16 April, 2009.
[PDF]
- Tomaž Erjavec:
MULTEXT-East Version 3:
Multilingual Morphosyntactic Specifications, Lexicons and Corpora.
In: Proc. of the Fourth Intl. Conf. on
Language Resources and Evaluation,
LREC'04,
ELRA, Paris, 2004.
[PDF]
Page last updated 2015-06-15,
Tomaž Erjavec