This is the home page of Version 4 of the MULTEXT-East resources, a
multilingual dataset for language engineering research and
development. This dataset contains, for
Bulgarian,
Croatian,
Czech,
English,
Estonian,
Hungarian,
Lithuanian,
Macedonian,
Persian,
Polish,
Resian,
Romanian,
Russian,
Serbian,
Slovak,
Slovene,
Ukrainian,
some or all of the following resources:
- Morphosyntactic Specifications,
defining harmonised word-level syntactic features and their mapping to MSD tagsets
[about, read, browse, download]
New:
Morphosyntactic specifications in OWL
(released: 2011-06-18)
- Morphosyntactic Lexica, containing word-forms, lemmas and MSDs
[about, licence: browse, download]
- Annotated Parallel "1984" Corpus,
aligned by sentences with words tagged with lemma and MSD
[about, licence: browse, download]
- Structurally Annotated Corpus,
consisting the parallel "1984" corpus, comparable corpus (novels, news), and the text of the speech corpus, with structural markup
[about, licence: browse, download]
- Parallel Speech Corpus,
consisting of 40 short block with about 5 sentences each, text and speech
[about, browse, download]
- and associated documentation.
The specifications and corpora use the
TEI P5 Guidelines for the XML encoding;
the schema and its documentation is available in the
schema/ directory.
Publications:
- Tomaž Erjavec (2012):
MULTEXT-East: Morphosyntactic Resources for Central and Eastern European Languages.
Language Resources and Evaluation, 46/1, pp. 131-142.
- Tomaž Erjavec:
MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora.
Proc. of the LREC 2010, Malta, 19-21 May, 2010.
[PDF]
- Tomaž Erjavec:
MULTEXT-East Morphosyntactic Specifications: Towards Version 4
In: Proc. of the MONDILEX Third Open Workshop, Bratislava, Slovakia, 15-16 April, 2009.
[PDF]
- Tomaž Erjavec:
MULTEXT-East Version 3:
Multilingual Morphosyntactic Specifications, Lexicons and Corpora.
In: Proc. of the Fourth Intl. Conf. on
Language Resources and Evaluation,
LREC'04,
ELRA, Paris, 2004.
[PDF]
[about]
- free download:
- MULTEXT-East morphosyntactic specifications and documentation;
- MULTEXT-East speech corpus.
- licenced download:
- MULTEXT-East morphosyntactic resources (lexica, linguistically annotated "1984" corpus);
- MULTEXT-East cesDoc structurally annotated corpus.
To get access to the licenced resources, please
read the licence under which they are avaialbe and,
if you agree with it, send an email requesting the resources to
Tomaž Erjavec.
You will then receive by email a username and password for full
access to the resources, which you can browse on-line or download.
The download files unpack into a mirror of this WWW site.
In published research please acknowledge the use of MULTEXT-East resources by citing the
following paper:
Tomaž Erjavec (2012):
MULTEXT-East: Morphosyntactic Resources for Central and Eastern European Languages.
Language Resources and Evaluation, 46/1, pp. 131-142
Page last updated 2012-12-22,
Tomaž Erjavec