MULTEXT-East Language Resources Version 3
1. Introduction

This document describes the third edition of the MULTEXT-East language resources, a multilingual dataset for language engineering research and development. This standardised and linked set of resources covers a large number of mainly Central and Eastern European languages and includes annotated parallel, comparable and speech corpora with morphosyntactic lexica and specifications. The most important component is the linguistically annotated corpus consisting of Orwell's novel ‘1984’ in the English original and translations.

These release builds on the results of several EU projects: MULTEXT-East (produced linked resources for Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian and English), TELRI (added resources for Lithuanian, Croatian, Serbian, and Russian; first release), and CONCEDE (validation, re-encoding; partial re-release); and the TEI Task Force on SGML to XML migration (conversion to XML).

Version 3 of MULTEXT-East resources brings together the first two releases (TELRI and CONCEDE), makes them available in TEI P4 XML, and introduces further extensions, e.g. the annotated ‘1984’ in Serbian and the specification for Resian, a dialect of Slovene. This dataset, unique in terms of languages and the wealth of encoding, is extensively documented, and freely available for research purposes, according to the MULTEXT-East research licence; a local text copy is provided for reference.

The published paper describing this version of the resources is:
Tomaž Erjavec: MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora. In the Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC'04 (ELRA). Paris 2004. [PDF]

Please acknowledge the use of resources in publications by citing the above paper, or, if more relevant, others from the MULTEXT-East annotated bibliography.

The rest of this document is structured as follows: In the next subsection, we give a synopsis of languages that this distribution provides resources for. Section 2 ‘MULTEXT-East resources’ describes the resouces offered in detail, containing sections on the East morphosyntactic resources, MULTEXT-East cesDoc corpus, and the MULTEXT-East ‘1984’ cesAna corpus. Section 3 ‘Bibliography and related research’ essentially points to the annotateed bibliography, Section 4 ‘Distribution’ discusses the availability of the corpus, and Section 5 ‘Contributors’ gives lists the contact points for particular languages, and acknowledgment to those that were involved in producing the resources.

1.1. The languages of MULTEXT-East

Below we list the languages represented in the resources, with the link to their respective Ethnologue ISO 639 entries.

2. MULTEXT-East resources

This section gives the details of the MULTEXT-East resources Version 3: Section ‘MULTEXT-East morphosyntactic resources’ details the three linked word-level syntactic resources: the specification for morphosyntactic descriptions, the morphosyntactic lexica, and word-level annotated corpus (‘1984’); Section ‘MULTEXT-East cesDoc corpus’ introduces the structurally marked up corpus, consisting of a parallel part (again ‘1984’), two comparable parts (fiction, newspapers), and a small speech corpus; Section ‘MULTEXT-East ‘1984’ corpus’ revisits this corpus and recaps the relevant data in somewhat more detail.

2.1. MULTEXT-East morphosyntactic resources

By far the most useful part of the MULTEXT-East project deliverables proved to be the morphosyntactic resources, which were later re-released in the CONCEDE edition and consist of three layers:

  1. The morphosyntactic specifications, which set out the grammar and vocabulary of valid morphosyntactic descriptions, MSDs. The specifications determine what, for each language, is a valid MSD and what it means, e.g.,
    Ncnp = PoS:Noun, Type:common, Gender:neuter, Number:plural
  2. The morphosyntactic lexicons, which contain the full inflectional paradigms of at least 15,000 lemmas and cover the ‘1984’ corpus. Each entry gives the word-form, its lemma and MSD, e.g.,
    clocks clock Ncnp
  3. The morphosyntactically annotated ‘1984’ corpus, where each word is assigned its context disambiguated MSD and lemma, e.g.,
    <w lemma="clock" ana="Ncnp">clocks</w>

In the next sections we detail each of these layers in turn.

2.1.1. Morphosyntactic specifications

Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian, Croatian, Resian.

The syntax and semantics of the morphosyntactic descriptions (MSDs) are given in the MULTEXT-East morphosyntactic specifications, which have been developed in the formalism and on the basis of specifications for six Western European languages of the MULTEXT project and in cooperation with EAGLES, the Expert Advisory Group on Language Engineering Standards.

Originally, these specifications were released as a report of the MULTEXT-East project but have been revised for both subsequent releases. The complete specifications are structured as a report, and contain introductory chapters, followed by the list of defined categories (parts-of-speech), and then, for each category, a table of attribute-values, and the languages the features are appropriate for. These so called common tables are followed by language particular sections. Each language section is further subdivided, and can contain feature co-occurrence restrictions, examples, notes, and full lists of valid MSDs, as well as localisation information. The formal core of the specifications resides in the common tables, as they define the features, their codes for MSD representation, and their appropriateness for each language.

Technically, the complete specifications are a LaTeX document (with derived PDF and HTML renderings), where the common tables are plain ASCII in a strictly defined format. This format is suitable for a viewing, and reasonably manageable for modification and addition of new languages. However, it is not appropriated for processing, in particular for enabling smooth manipulation and linking to an XML encoded corpus using the MSDs. We have therefore implemented a (Perl) conversion of the common tables into XML, to the TEI.fs module, a tagset devoted to encoding feature-structures. This tagset is currently being used as the basis of an evolving ISO standard (currently a Draft International Standard), as part of work of ISO/TC 37/SC4 Language Resource Management.

In the distribution, the specifications are given in the original and several derived encodings; furthermore, closely related reports of MULTEXT and EAGLES are provided:

Copies or links to reports that served as the basis for MULTEXT-East morphosytactic specifications:
  • Nuria Bel, Nicoletta Calzolari and Monica Monachini: MULTEXT Deliverable D1.6.1B Common Specifications and Notation for Lexicon Encoding and Preliminary Proposal for the Tagsets. 1995, ILC, Pisa. [ HTML, PDF, PDF2up, LaTeX ]
  • Simone Teufel and Christine Stockert: EAGLES specification for German morphosyntax. 1996. [ PDF, PS.gz ]
  • Geoffrey Leech and Andrew Wilson: EAGLES Report EAG-TCWG-MAC/R Recommendations for the Morphosyntactic Annotation of Corpora. 1996, ILC, Pisa.
  • Nicoletta Calzolari and Monica Monachini (eds.): EAGLES Report EAG-CLWG-MORPHSYN/R Synopsis and Comparison of Morphosyntactic Phenomena Encoded in Lexicons and Corpora: A Common Proposal and Applications to European Languages. 1996, ILC. Pisa.

2.1.2. Lexicons

Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian.

The MULTEXT-East morphosyntactic lexicons have a simple structure, where each lexical entry is composed of three fields:

  • the word-form, which is the inflected form of the word, as it appears in the text, modulo sentence-initial capitalisation;
  • the lemma, which is the base-form of the word; where the entry is itself the base-form, the lemma is given as the equal sign;
  • the MSD, i.e., the morphosyntactic description.

To produce the lexica, the token lists of the MULTEXT-East corpus were first fed through morphological analysers in order to produce the lemma list; this list was further extended from the comparable corpus, to arrive at at least 15,000 lemmas - some languages have further extended this, e.g., Romanian to 41,000 lemmas. In the next step, the lemmas were fed back to morphological generators (except for the agglutinative languages) in order to produce the complete inflected lists, i.e., the full paradigms of the lemmas, which constituted the final lexica of the project.

The MULTEXT-East lexica serve as medium sized morphological lexica for the languages. In addition to explicating the inflectional behaviour of the most common (and, typically, morphologically the most complex) words of the languages, the lexica also serve to establish the definitive set of valid MSDs for the languages.

To serve as a standard registry of MSDs, we converted the lexical MSDs to TEI feature structure libraries, fsLib, one for each category. Here each MSD is expressed as a feature structure specifying its id, the language(s) it is appropriate for, and its decomposition into features.

The structure and contents of the lexica is explained in MULTEXT-East report MTE D1.2 M: Language-specific Resources (but note that the lexica have in the meantime been revised, so the details in these report are no longer correct).

2.1.3. Linguistically annotated 1984

Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian.

This is the centrepiece of the resources, as it contains word level markup, namely context disambiguated lemmas and MSDs, e.g., <w lemma="it" ana="Pp3ns">it</w>. This, so called cesAna corpus, is suitable for PoS tagging experiments; because it was the first such resource for many of the languages it was also the most difficult to produce as the work had to proceed mostly manually.

2.2. MULTEXT-East cesDoc corpus

The extended MULTEXT-East resources were first released in the scope of the TELRI concerted action, first on a CD-ROM and later through TRACTOR, the TELRI Research Archive of Computational Tools and Resources.

For Version 3 the structurally annotated texts from this edition (so called cesDoc corpus, as this was the format it was originally encoded in) were converted from SGML to XML. The corpus is stored as one TEI P4 document with the <teiCorpus.2> root element, comprising the header, and the component texts, i.e. the multilingual parallel speech, comparable fiction and news, and parallel ‘1984’ texts. The corpus is further documented in its corpus and text headers, and also in the original MULTEXT-East report D2.1 F: Corpus Collection and Preparation, although the information there is no longer current in all respects.

2.2.1. Quick HTML view

The table below links to the HTML view of the MULTEXT-East cesDoc corpus, namely to the teiHeaders and the texts.

1, MULTEXT-East cesDoc corpus, HTML view

MULTEXT-East cesDoc corpus, HTML view
Language "1984" Speech Fiction News
English Header Text Header Text
Bulgarian Header Text Header Text Header Text Header Text
Czech Header Text Header Text Header Text Header Text
Estonian Header Text Header Text Header Text Header Text
Hungarian Header Text Header Text Header Text Header Text
Romanian Header Text Header Text Header Text Header Text
Slovene Header Text Header Text Header Text Header Text
Lithuanian Header Text
Serbian Header Text
Russian Header Text

2.2.2. Speech corpus components

Languages: Romanian, Slovene, Estonian, Hungarian, (English, Czech, Bulgarian)

MULTEXT-East produced a small corpus of spoken texts taken from the EUROM-1 speech corpus. It comprises the translations (from English) of forty short passages of five thematically connected sentences. For four languages, the texts have also been read, recorded and included in the distribution. The corpus texts contain links to the spoken passages, which have for V3 been normalised in terms of volume, and stored as .wav files. The speech files are, due to their size, stored and distributed in a separate bundle.

2.2.3. Comparable corpus components

Languages: Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian.

The multilingual comparable corpus contains a fiction part and a news part, where the data is comparable across the languages in terms of the number and size of texts; each of the 12 parts has approx. 100,000 words. The corpus is structurally marked up with over 40 different elements; however. sub-paragraph markup has not been harmonised across the languages.

2.2.4. Structural 1984 and alignments

Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, (Latvian), Lithuanian, Serbian, Russian.

The multilingual parallel corpus consists of the novel ‘1984’, about 100,000 words in length. The corpus contains extensive headers and markup for document structure, sentences, and various sub-sentence annotations, these similar to the comparable corpus, but better harmonised over languages.

The translations of ‘1984’ have been automatically sentence aligned with the English original, and the alignments hand-validated. The bilingual alignments are valid to xcesAlign.dtd, i.e., are stored not with the primary data but in separate documents, as references to sentence IDs, e.g., <link xtargets="Osl. ; Ocs. Ocs."/>.

The cesDoc encoded novel served as the basis for producing the linguistically annotated version. The link between the two is maintained via sentence identifiers.

2.3. MULTEXT-East 1984 corpus

The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel corpus annotated contains the novel in the English original (about 100,000 words in length), and its translations into a number of languages.

George Orwell
Nineteen Eighty-Four

Big Brother is Watching You!
It was an enormous pyramidal structure of glittering white concrete, soaring up, terrace after terrace, 300 metres into the air. From where Winston stood it was just possible to read, picked out on its white face in elegant lettering, the three slogans of the Party:
War is peace Freedom is slavery Ignorance is strength
Războiul este pace Libertatea este sclavie Ignoranţa este putere
Vojna je mir Svoboda je suženjstvo Nevednost je moč
Válka je mír Svoboda je otroctví Nevědomost je síla
Войната е мир Свободата е робство Невежеството е сила
Sõda on rahu Vabadus on orjus Teadmatus on jõud
Rat je mir Sloboda je ropstvo Neznanje je moć
A háború: béke A szabadság: szolgaság A tudatlanság: erő
Karas — tai taika Laisvė — tai vergija Nežinomas — tai jėga
Rat je mir Sloboda je ropstvo Neznanje je moć
Война — это мир Свобода — это рабство Незнание — сила

2.3.1. Quick HTML view

The table below gives, for each language, the HTML rendering of: the cesDoc corpus Text of the novel from the cesDoc corpus; the relevant section from the MULTEXT-East D2.1 Report the TEI cesDoc header of "1984", and, where available, the TEI cesAna header with the language particular section of the MULTEXT-East Morphosyntactic Specification for MSD.

4, MULTEXT-East "1984", HTML view

MULTEXT-East "1984", HTML view
All Text Report cesDoc header cesAna header MSD
English Text Report cesDoc header cesAna header MSD
Bulgarian Text Report cesDoc header cesAna header MSD
Czech Text Report cesDoc header cesAna header MSD
Estonian Text Report cesDoc header cesAna header MSD
Hungarian Text Report cesDoc header cesAna header MSD
Romanian Text Report cesDoc header cesAna header MSD
Slovene Text Report cesDoc header cesAna header MSD
Latvian Report
Lithuanian Text Report cesDoc header
Serbian Text Report cesDoc header cesAna header MSD
Russian Text cesDoc header

2.3.2. Description

The novel is encoded in TEI P4 and exists in two versions:

  • as part of the cesDoc corpus (originally from the TELRI release), where it is structurally marked up and harmonised down to paragraph level; various sub-paragraph markup is also included; this version is also available in HTML, via the TEI XSLT stylesheet.
  • as the cesAna corpus (from the CONCEDE release), where the texts have been structurally normalised and tokenised, and each word tagged with hand-validated context-disambiguated lemma and morphosyntactic description, so it can be used e.g. in PoS tagging and morphological analysis experiments.

Apart from the marked-up texts themselves the corpus has also two other components:

  • validated sentence alignments (mostly two-way with the English original), which are stored in accordance with the CES conventions for parallel text alignment, i.e. in separate documents containing links to <s>entence elements of the texts.
  • the formal core of the MULTEXT-East morphosyntactic specification, which comprises the TEI header, feature-structure and feature libraries. The former contain the full set of valid morphosyntactic descriptions, and the latter their decomposition into features; this specification also constitutes a part of the cesAna corpus.

The first edition of the "1984" corpus is detailed in the MULTEXT-East project reports MTE D2.1 F ""1984" Corpus Collection" (with Appendix) and MTE D2.3 F "Sentence Alignment" and "Morphosyntactic Tagging". However, note that the encoding has substantially changed from the first MULTEXT-East version described in the preceeding reports. The details on the encoding for Version 3 are to be found in the TEI headers in the two corpora, i.e. the cesDoc teiHeader and cesAna teiHeader. The details on the morphosyntactic descriptions used to word-tag V3 of the corpus are given in the MULTEXT-East Morphosyntactic Specifications, and also in the MSD library teiHeader, which is a part of the cesAna corpus.

2.4. Distribution

The resources are mounted on the Web, on the MULTEXT-East Version 3 Web site. The documentation and the morphosyntactic specification are freely available, while the corpora and lexica are restricted to research use only. To get access to these resources, the Web based MULTEXT-East research licence should be filled out and submitted. The password is then sent by email.

Registered users can browse the full resources on-line, or they can download them; they are distributed as gzipped tar files (.tgz). Due to the size of the resources and the fact that different users are likely to use only parts, they are available not only as a complete download, but also split by resource type:

At some point we plan to distribute the MULTEXT-East V3 resources also on CD-ROM. If this would be of interest, please get in touch.

3. Bibliography and related research

An annotated bibliography is available in a separate folder bib/, where it is available in HTML and PDF. Some papers are also mirrored there, in particular the one describing Version 3 of the MULTEXT-East resources:

Tomaž Erjavec: MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora. In the Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC'04 (ELRA). Paris 2004

4. Contributors

This section lists the contributors of the MULTEXT-East resources (Version 3), starting with the partners of the MULTEXT-East project, the partners of the TELRI concerted action that contributed resources to MULTEXT-East, additional contributors, and, finally, the acknowledgements to all the other people that made the production of these resources possible. Further information on who did what can be found in the corpus headers and the title page of the morphosyntactic specifications. Next to each partner is listed also their responsibility; for resources of the particular languages, this contact point should also be used for further inquiries.

4.1. MULTEXT-East Partners

Listed below are the partners of the original Copernicus MULTEXT-East project [Picture].

4.1.1. Aix-en-Provence

MULTEXT-East coordinator
Jean Véronis (multext at
Laboratoire Parole et Langage
Centre National de la Recherche Scientifique
Aix-en-Provence, France
29, Av. Robert Schuman
13621 Aix-en-Provence Cedex 1

4.1.2. Vassar

Corpus encoding, English data
Nancy Ide (ide at
Greg Priest-Dorman (priestdo at [Picture]
Dept. of Computer Science
Vassar College
124 Raymond Avenue
Poughkeepsie, NY 12604-0520

4.1.3. Pisa

Morphosyntactic specification (associate partner)
Nicoletta Calzolari (glottolo at
Monica Monachini (corpmon2 at
Istituto di Linguistica Computazionale
Consiglio Nazionale delle Ricerche
Pisa, Italy

4.1.4. Sofia

Bulgarian data
Radoslav Pavlov,
Ludmila Dimitrova (ludmila at
Lydia Sinapova (lydia at
Kiril Simov (kivs at
Department of Mathematical Linguistics
Institute of Mathematics and Informatics
Bulgarian Academy of Sciences
8 Acad. G. Bonchev Street
BG-1113 Sofia
+359 2 713 3831
+359 2 971 36 49

4.1.5. Prague

Czech data
Vladimír Petkevič (vladimir.petkevic at [Picture]
Institute of Theoretical and Computational Linguistics
Faculty of Philosophy
Charles University
Celetna 13
11000 Prague 1
Czech Republic
+42 2 2481-1870 ext. 252
+42 2 2481-2166

Zdenek Laciga
BYLL Software, Ltd.
Anenske nam. 2
110 00 Prague 1
Czech Republic

4.1.6. Tartu

Estonian data
Heiki-Jaan Kaalep (hkaalep at [Picture]
Department of General Linguistics
Tartu University
Ülikooli 18
EE-2400 Tartu
+372 74 30 803
+372 74 35 440

4.1.7. Budapest

Hungarian data
Laszlo Tihanyi (Tihanyi at
Csaba Oravecz (oravecz at [Picture]
Research Institute for Linguistics
Hungarian Academy of Sciences
Szinhaz u. 5-9 - P.O. Box 19
H-1250 Budapest
+36 1 1758 011 ext. 155
+36 1 1758 285
+36 1 21 22 050

Gabor Proszeky
Fo u. 56-58 I/3
H-1011 Budapest
+36 1 2018355
+36 1 2018355

4.1.8. Bucharest

Romanian data
Dan Tufiş (tufis at [Picture]
Ştefan Bruda (bruda at
Mihai Ciocoiu (mihaic at
Center for Research in Machine Learning, Natural Language Processing and Conceptual Modelling
Romanian Academy of Sciences
Casa Academiei "13 Septembrie" 13
Bucharest 71102
+40 1 410-4113
+40 1 411-3916

4.1.9. Ljubljana

Slovene data, corpus workpackage leader
Tomaž Erjavec (tomaz.erjavec at [Picture]
Dept. of Knowledge Technologies
Jožef Stefan Institute
Jamova 39
SI-1000 Ljubljana
+386 61 1773-507
+386 61 1258-058

Miro Romih (miro.romih at
Peter Holozan (peter.holozan at
AMEBIS d.o.o.
Jakopičeva 6
61240 Kamnik
+386 61 612-829
+386 61 811-035

4.2. TELRI Partners

4.2.1. Riga

Latvian data
Andrejs Spektors (aspekt at
Artificial Intelligence Laboratory
Institute of Mathematics and Computer Science
University of Latvia
29, Raina bulv.
LV-1459 Riga

4.2.2. Kaunas

Lithuanian data
Andrius Utka (andrius at
Ruta Marcinkeviciene (ruta.marcinkeviciene at
Center of Computational Linguistics
Faculty of Humanities
Vytautus Magnus University
S. Daukanto 28
3000 Kaunas

4.2.3. Belgrade

Serbian data
Cvetana Krstev (cvetana at
Duško Vitas (vitas at
Faculty of Mathematics
Belgrade University
Studentski trg 16
11000 Belgrade
Serbia and Montenegro

4.2.4. Zagreb

Croatian data
Marko Tadić (marko.tadic at
Department of linguistics
Faculty of Philosophy,
University of Zagreb
Ivana Lučića 3
HR-10000 Zagreb

4.3. Additional Contributors

4.3.1. Severodonetsk

Russian data
Paul Sokolovsky (Paul.Sokolovsky at
Sergey Sryvkin
Severodonetsk Institute of Technology
East-Ukraine State University
Sovetsky st., bl. 3a
Severodonetsk, Lugansk reg.

4.3.2. Padova

Resian data
Han Steenwijk (han.steenwijk at
Department of Anglo-Germanic and Slavic Linguistics and Literature,
Padova University

4.4. Acknowledgments

Apart from the institutions and people listed above, the following people have greatly contributed to the production of the MULTEXT-East language resources: Renata Anžič, Liviu Anca, Ana-Maria Barbu, Aleksandra Bizjak, Damjan Bojadžiev, Lydia Bozhilova, Aleš Dobinikar, Külli Habicht, Daniel Hirst, Milena Hnátková, Primož Jakopin, Riina Mosna, Kadri Muischnek, Mircea Nicolescu, Heili Orav, Vasile Pătraşcu, Leho Paldre, Tsvetan Petrov, Helen Potter, Andriela Rääbis, Georgiana Rotariu, Matjaž Sešek, Tanja Semen, Urve Talvik, Bojana Todorovič, Elias Treeman, Viire Villandi, and Olga Vuković.

The work on MULTEXT-East resources was supported by the European Union project MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages, Copernicus 106, and in part by the US National Science Foundation grant IRI-9413451. The MULTEXT-East project results have been greatly enhanced due to the EU Concerted Action TELRI: Trans-European Language Resources Infrastructure. In particular, additional language resources have been produced, and the project's material organised for CD distribution. Work on the second release of the MULTEXT-East resources was supported by EU Copernicus Project PL96-1142 CONCEDE:Consortium for Central European Dictionary Encoding, while the work on the third release was partially funded by the grants from the National Endowment for the Humanities, in the scope of the TEI Task Force on SGML to XML migration. The work on the resources has been additionally supported by national funding bodies and individual partners' grants and contracts.

