MULTEXT-East Resources, Concede Edition

A brief description of this release is given in the ElsNews article "The MULTEXT-East Resources Revisited", and a longer one in:
Tomaž Erjavec: Harmonised Morphosyntactic Tagging for Seven Languages and Orwell's 1984.
In the Proceedings of the 6th Natural Language Processing Pacific Rim Symposium, NLPRS'01, pp. 487-492, 2001.

Version 2 of MULTEXT-East language resources contains, for English, Romanian, Czech, Slovene, Bulgarian, Estonian, and Hungarian:

  1. The revised and expanded MULTEXT and EAGLES based lexical morphosyntactic specifications, in print form (HTML, PDF, LaTeX) and as TEI encoded feature structures.
    The specifications are freely available for downloading or browsing at

  2. The morphosyntactic lexica, totalling at least 15.000 lemmas per language, where each entry contains the word-form, its lemma and morphosyntactic description; included is also a high-precision automatically generated 7-way multilingual lexicon.

  3. The corrected and TEI encoded "1984" morphosyntactically annotated corpus, with about 100.000 words per language. The corpus includes 2-way and 7-way sentence alignments in CES (Corpus Encoding Standard).
The lexica and corpus are freely available for research use - to obtain them, please fill out and submit the license agreement.
Page, last updated 2002-12-09, Tomaž Erjavec