MULTEXT-East Language Resources Version 4
Documentation
Tomaž Erjavec
2010-05-13

Table of contents

Detailed table of contents


1. Introduction

This document describes the fourth, "MONDILEX" edition of the MULTEXT-East language resources, a multilingual dataset for language engineering research and development. This standardised and linked set of resources covers a large number of mainly Central and Eastern European languages and includes annotated parallel, comparable and speech corpora with morphosyntactic lexica and specifications.

Version 4 of MULTEXT-East resources, a substantial part of which was produced in the EU MONDILEX project, adds new languages and makes the resources uniformly available in TEI P5 XML. This dataset, unique in terms of languages and the wealth of encoding, is freely available for research purposes, according to the MULTEXT-East research licence; a local text copy is provided for reference.

The paper describing this version of the resources is:
Proc. of the LREC 2010, Malta, 19-21 May, 2010. Tomaž Erjavec: MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora. In the Proceedings of the Seventh International Conference on Language Resources and Evaluation, LREC'10 ELRA Paris 2010. [PDF]

Please acknowledge the use of resources in publications by citing the above paper, or, if more relevant, others from the home page of MULTEXT-East Version 4.

The rest of this document is structured as follows: In the next subsection, we give a synopsis of languages that this distribution provides resources for. Section 2 ‘MULTEXT-East resources’ describes the resouces offered in detail, Section 3 presents the distribution of the corpus, and Section 4 gives the lists the contributors, i.e. contact points for particular languages, and acknowledgment to those that were involved in producing the resources.

1.1. The languages of MULTEXT-East

Below we list the languages represented in the resources, with the link to their respective Ethnologue ISO 639 entries.

2. MULTEXT-East resources

This section gives the details of the MULTEXT-East resources Version 4: Section ‘MULTEXT-East morphosyntactic resources’ details the three linked word-level syntactic resources: the specification for morphosyntactic descriptions, the morphosyntactic lexica, and word-level annotated corpus (‘1984’); Section ‘MULTEXT-East cesDoc corpus’ introduces the structurally marked up corpus, consisting of a parallel part (again ‘1984’), two comparable parts (fiction, newspapers), and a small speech corpus; Section ‘MULTEXT-East ‘1984’ corpus’ revisits this corpus and recaps the relevant data in somewhat more detail.

2.1. MULTEXT-East morphosyntactic resources

The morphosyntactic resources consist of three layers:
  1. The morphosyntactic specifications, which set out the grammar and vocabulary of morphosyntactic features and descriptions. They specificy what, for each language, is the set of morphosyntactic descriptions (MSDs) and what features they correspond to, e.g. they specify that the MSD Ncnp is valid for English and maps to the feature structure Noun, Type:common, Gender:neuter, Number:plural.
  2. The morphosyntactic lexicons, which can contain either the full inflectional paradigms of selected lemmas or entries for corpus attested word-forms. Each entry gives the word-form, its lemma and MSD, e.g.,
    clocks clock Ncnp
  3. The morphosyntactically annotated ‘1984’ corpus, where each word is assigned its context disambiguated MSD and lemma, e.g.,
    <w lemma="clock" ana="#Ncnp">clocks</w>

In the next sections we detail each of these layers in turn.

2.1.1. Morphosyntactic specifications

Languages: Bulgarian, Croatian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Resian, Romanian, Russian, Serbian, Slovak, Slovene, Ukrainian.

The syntax and semantics of the morphosyntactic descriptions (MSDs) are given in the MULTEXT-East morphosyntactic specifications, which have been developed in the formalism and on the basis of specifications for six Western European languages of the MULTEXT project and in cooperation with EAGLES, the Expert Advisory Group on Language Engineering Standards.

Originally, these specifications were released as a report of the MULTEXT-East project but have been since extensivelly revised. Nevertheless, the specifications are still structured as a report, and contain introductory chapters, followed by the list of defined categories (parts-of-speech), and then, for each category, a table of attribute-values, and the languages the features are appropriate for. These so called common tables are followed by language particular sections. Each language section is further subdivided, and can contain feature co-occurrence restrictions, examples, notes, and full lists of valid MSDs, as well as localisation information.

The complete specifications are an XML document, encoded as a TEI P5 schema. Further information about these specifications is available in their TEI header.

In the distribution, the specifications are given in the source TEI P5 encoding and in several derived formats:
msd/html/
The specifications as HTML
msd/tables/
Conversion tables for MSDs in various formats
msd/xml/
The specifications in the source TEI XML
msd/xslt/
XSLT scripts for various conversions over the specifications. They work, after a fashion, but could definitelly be better coded and documented...

2.1.2. Lexicons

Languages: Bulgarian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Resian, Romanian, Russian, Serbian, Slovak, Slovene, Ukrainian.

The MULTEXT-East morphosyntactic lexicons have a simple structure, where each lexical entry is composed of three fields:
  • the word-form, which is the inflected form of the word, as it appears in the text, modulo sentence-initial capitalisation;
  • the lemma, which is the base-form of the word;
  • the MSD, i.e., the morphosyntactic description of the word-form.

The sizes of the MULTEXT-East lexica vary considerably between the langauges. A few are quite limited, and serve more as a proof of concept and to assign lexical entries to the MSDs, but most contain over 20,000 lemmas and can serve as medium sized morphological lexica for the languages. In addition to explicating the inflectional behaviour of the most common (and, typically, morphologically the most complex) words of the languages, the lexica also served to establish the definitive set of valid MSDs for the languages.

The lexica themselves are available in the directory lex/, where there is also a README file and (note that WWW access to this directory is restricted to licence holders).

2.1.3. Linguistically annotated ‘1984’

Languages: Bulgarian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Romanian, Serbian, Slovak, Slovene.

The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel corpus contains the novel in the English original (about 100,000 words in length), and its translations into a number of languages.

This corpus, also called the cesAna corpus, contains word level markup, namely context disambiguated lemmas and MSDs, e.g., <w lemma="it" ana="#Pp3ns">it</w>.

Each MSD is linked to a TEI feature-structure library (automatically derived from the specifications, and written in the back matter of the novel), which gives for each MSDs its decomposition into features, e.g.
<fs xml:id="Pp3ns" xml:lang="en" feats="#P0. #P1.p #P2.3 #P3.n #P4.s"/>
and, in the feature library:
<f name="CATEGORY" xml:id="P0." xml:lang="en"><symbol value="Pronoun"/></f>
<f name="Type" xml:id="P1.p" xml:lang="en"><symbol value="personal"/></f>
<f name="Person" xml:id="P2.3" xml:lang="en"><symbol value="third"/></f>
...

The translations of ‘1984’ are sentence aligned with the English original, with hand-validated alignments. From these en-xx alignments the other bi-lingual and one multilingual alignments were produced automatically. The sentence alignments are stand-off, i.e. encoded in separate files with pointers to the aligned sentence identifiers. The alignments are encloding as TEI link groups, containing links of the form:
<link n="1:2" targets="oana-bg.xml#Obg.1.1.6.7 oana-ro.xml#Oro.1.2.7.7 oana-ro.xml#Oro.1.2.7.8"/>

The corpus is further documented in its TEI headers. Component files are available in the directory ana/ (note that WWW access to this directory is restricted to licence holders).

2.2. MULTEXT-East cesDoc corpus

Languages: Bulgarian, Czech, English, Estonian, Hungarian, Romanian, Slovene, Lithuanian, (partial).

This corpus encodes mostly (quite rich) structural information about the component texts. It consists of two comparable corpora (fiction, news), the parallel ‘1984’ corpus and a small multilingual parallel speech corpus. The corpus and its components are further documented in their TEI headers, and also in the original MULTEXT-East report D2.1 F: Corpus Collection and Preparation, although the information there is no longer current in all respects.

The corpus is further documented in its TEI headers. It corpus is available in the directory crp/, where the driver file is ‘mte-cesdoc.xml’ (note that WWW access to this directory is restricted to licence holders).

2.2.1. Comparable corpus components

Languages: Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian.

The multilingual comparable corpus contains a fiction part and a news part, where the data is comparable across the languages in terms of the number and size of texts; each of the 12 parts has approx. 100,000 words. The corpus is structurally marked up with over 40 different elements; however. sub-paragraph markup has not been harmonised across the languages.

2.2.2. Structural ‘1984’ and alignments

Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, (Latvian), Lithuanian, Serbian, Russian.

The multilingual parallel corpus consists of the novel ‘1984’, about 100,000 words in length. The corpus contains extensive headers and markup for document structure, sentences, and various sub-sentence annotations, these similar to the comparable corpus, but better harmonised over languages.

The translations of ‘1984’ are sentence aligned with the English original, with hand-validated alignments. From these en-xx alignments the other bi-lingual and one multilingual alignments were produced automatically. The sentence alignments are stand-off, i.e. encoded in separate files with pointers to the aligned sentence identifiers. The alignments are encloding as TEI link groups, containing links of the form:
<link n="1:2" targets="oana-bg.xml#Obg.1.1.6.7 oana-ro.xml#Oro.1.2.7.7 oana-ro.xml#Oro.1.2.7.8"/>

2.2.3. Speech corpus components

Languages: Romanian, Slovene, Estonian, Hungarian, (English, Czech, Bulgarian)

MULTEXT-East produced a small corpus of spoken texts taken from the EUROM-1 speech corpus. It comprises the translations (from English) of forty short passages of five thematically connected sentences. This written part of the corpus is included together with the other parts of the cesDoc corpus.

For four languages, the texts have also been read, recorded and included in the distribution. The corpus texts contain links to the spoken passages stored as .wav files. The speech files are, due to their size, stored and distributed in a separate bundle.

The speech corpus and its components are further documented in their TEI headers. The corpus is available in the directory spc/.

3. MULTEXT-East ‘1984’ corpus

The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel corpus annotated contains the novel in the English original (about 100,000 words in length), and its translations into a number of languages.

George Orwell
Nineteen Eighty-Four


Big Brother is Watching You!
It was an enormous pyramidal structure of glittering white concrete, soaring up, terrace after terrace, 300 metres into the air. From where Winston stood it was just possible to read, picked out on its white face in elegant lettering, the three slogans of the Party:
War is peace Freedom is slavery Ignorance is strength
Războiul este pace Libertatea este sclavie Ignoranţa este putere
Vojna je mir Svoboda je suženjstvo Nevednost je moč
Válka je mír Svoboda je otroctví Nevědomost je síla
Войната е мир Свободата е робство Невежеството е сила
Sõda on rahu Vabadus on orjus Teadmatus on jõud
Rat je mir Sloboda je ropstvo Neznanje je moć
A háború: béke A szabadság: szolgaság A tudatlanság: erő
Karas — tai taika Laisvė — tai vergija Nežinomas — tai jėga
Rat je mir Sloboda je ropstvo Neznanje je moć
Война — это мир Свобода — это рабство Незнание — сила
The novel exists in two versions:

The translations of ‘1984’ are sentence aligned with the English original, with hand-validated alignments. From these en-xx alignments the other bi-lingual and one multilingual alignments were produced automatically. The sentence alignments are stand-off, i.e. encoded in separate files with pointers to the aligned sentence identifiers. The alignments are encloding as TEI link groups, containing links of the form:
<link n="1:2" targets="oana-bg.xml#Obg.1.1.6.7 oana-ro.xml#Oro.1.2.7.7 oana-ro.xml#Oro.1.2.7.8"/>

The first edition of the "1984" corpus is detailed in the MULTEXT-East project reports MTE D2.1 F ""1984" Corpus Collection" (with Appendix) and MTE D2.3 F "Sentence Alignment" and "Morphosyntactic Tagging". However, note that the encoding has substantially changed from the first MULTEXT-East version described in the these reports.

The details on the encoding for Version 4 are to be found in the TEI headers in the two corpora, i.e. the cesDoc "orwl" TEI headers and cesAna "oana" TEI headers.

The details on the morphosyntactic descriptions, MSDs, used to word-tag V4 of the corpus are given in the MULTEXT-East Morphosyntactic Specifications.

4. Distribution

The resources are mounted on the Web, on the MULTEXT-East Version 4 Web site. The documentation and the morphosyntactic specification are freely available, while the corpora and lexica are restricted to research use only. To get access to these resources, the Web based MULTEXT-East research licence should be filled out and submitted. The password is then sent by email.

Registered users can browse the full resources on-line and download individual files. The resources are also distributed as ZIP archives. Due to the size of the resources and the fact that different users are likely to use only parts, they are available not only as a complete download, but also split by resource type:
mteV4(date).zip (168 MB zip / 615 MB unpacked)
The complete V4 resources: documentation, MSD specification, corpora, lexica, speech
mteV4(date)-doc.zip (17 MB / 120 MB)
The MSD specifications and documentation - those fetching other partial archives are advised to download also the -doc.
mteV4(date)-ana.zip (43 MB / 330 MB)
The morphosyntactic lexica, and cesAna (MSD annotated "1984") corpus.
mteV4(date)-crp.zip (9 MB / 45 MB)
The cesDoc (structurally annotated) corpus.
mteV4(date)-spc.zip (101 MB / 120 MB)
The speech corpus.
After download unpack the directories. So, for examples, say the distribution was made on 2010-05-14 and you want everything except speech. Then you download -doc, -crp, and -ana, and, on a Unix machine, run:
$ unzip mteV4-2010-05-14-doc.zip
$ unzip mteV4-2010-05-14-crp.zip
$ unzip mteV4-2010-05-14-ana.zip
All three archives would unpack into the directory mteV4-2010-05-14/

5. Contributors

This section lists the contributors of the MULTEXT-East resources (Version 4), starting with the partners of the MULTEXT-East project, the partners of the TELRI concerted action that contributed resources to MULTEXT-East, followed by MONDILEX and other contributors, and, finally, the acknowledgements to other people that made the production of these resources possible. Further information on who did what can be found in the corpus headers and the title page of the morphosyntactic specifications. Next to each partner is listed also their responsibility; for resources of the particular languages, this contact point should also be used for further inquiries.

5.1. MULTEXT-East Partners

Listed below are the partners of the original Copernicus MULTEXT-East project [Picture].

5.1.1. Aix-en-Provence

Responsibility:
MULTEXT-East coordinator
People:
Jean Véronis (multext at univ-aix.fr)
Organisation:
Laboratoire Parole et Langage
Centre National de la Recherche Scientifique
Aix-en-Provence, France
Address:
29, Av. Robert Schuman
13621 Aix-en-Provence Cedex 1

5.1.2. Vassar

Responsibility:
Corpus encoding, English data
People:
Nancy Ide (ide at cs.vassar.edu)
Greg Priest-Dorman (priestdo at cs.vassar.edu) [Picture]
Organisation:
Dept. of Computer Science
Vassar College
Address:
124 Raymond Avenue
Poughkeepsie, NY 12604-0520

5.1.3. Pisa

Responsibility:
Morphosyntactic specification (associate partner)
People:
Nicoletta Calzolari
Monica Monachini
Organisation:
Istituto di Linguistica Computazionale
Consiglio Nazionale delle Ricerche
Pisa, Italy

5.1.4. Sofia

Responsibility:
Bulgarian data
People:
Radoslav Pavlov,
Ludmila Dimitrova
Lydia Sinapova
Kiril Simov
Organisation:
Department of Mathematical Linguistics
Institute of Mathematics and Informatics
Bulgarian Academy of Sciences
Address:
8 Acad. G. Bonchev Street
BG-1113 Sofia
Bulgaria

5.1.5. Prague

Responsibility:
Czech data
People:
Vladimír Petkevič (vladimir.petkevic at ff.cuni.cz) [Picture]
Organisation:
Institute of Theoretical and Computational Linguistics
Faculty of Philosophy
Charles University
Address:
Celetna 13
11000 Prague 1
Czech Republic
Subcontractor:
People:
Zdenek Laciga
Organisation:
BYLL Software, Ltd.
Address:
Anenske nam. 2
110 00 Prague 1
Czech Republic

5.1.6. Tartu

Responsibility:
Estonian data
People:
Heiki-Jaan Kaalep (hkaalep at psych.ut.ee) [Picture]
Organisation:
Department of General Linguistics
Tartu University
Address:
Ülikooli 18
EE-2400 Tartu
Estonia

5.1.7. Budapest

Responsibility:
Hungarian data
People:
Laszlo Tihanyi (Tihanyi at nytud.hu)
Csaba Oravecz (oravecz at nytud.hu) [Picture]
Organisation:
Research Institute for Linguistics
Hungarian Academy of Sciences
Address:
Szinhaz u. 5-9 - P.O. Box 19
H-1250 Budapest
Hungary
Subcontractor:
People:
Gabor Proszeky
Organisation:
MorphoLogic
Address:
Fo u. 56-58 I/3
H-1011 Budapest
Hungary

5.1.8. Bucharest

Responsibility:
Romanian data
People:
Dan Tufiş (tufis at racai.ro) [Picture]
Ştefan Bruda
Mihai Ciocoiu
Organisation:
Center for Research in Machine Learning, Natural Language Processing and Conceptual Modelling
Romanian Academy of Sciences
Address:
Casa Academiei "13 Septembrie" 13
Bucharest 71102
Romania

5.1.9. Ljubljana

Responsibility:
Slovene data, corpus workpackage leader
People:
Tomaž Erjavec [Picture]
Organisation:
Dept. of Knowledge Technologies
Jožef Stefan Institute
Address:
Jamova cesta 39
SI-1000 Ljubljana
Slovenia
Subcontractor:
People:
Miro Romih (miro.romih at amebis.si)
Peter Holozan (peter.holozan at amebis.si)
Organisation:
AMEBIS d.o.o.
Address:
Jakopičeva 6
61240 Kamnik
Slovenia

5.2. TELRI Partners

5.2.1. Riga

Responsibility:
Latvian data
People:
Andrejs Spektors (aspekt at mii.lu.lv)
Organisation:
Artificial Intelligence Laboratory
Institute of Mathematics and Computer Science
University of Latvia
Address:
29, Raina bulv.
LV-1459 Riga
Latvia

5.2.2. Kaunas

Responsibility:
Lithuanian data
People:
Andrius Utka (andrius at donelaitis.vdu.lt)
Ruta Marcinkeviciene (ruta.marcinkeviciene at vdu.lt)
Organisation:
Center of Computational Linguistics
Faculty of Humanities
Vytautus Magnus University
Address:
S. Daukanto 28
3000 Kaunas
Lithuania

5.2.3. Belgrade

Responsibility:
Serbian data
People:
Cvetana Krstev, cvetana at matf.bg.ac.rs
Duško Vitas, vitas at matf.bg.ac.rs
Organisation:
Faculty of Mathematics
Belgrade University
Address:
Studentski trg 16
11000 Belgrade
Republic of Serbia

5.2.4. Zagreb

Responsibility:
Croatian data
People:
Marko Tadić (marko.tadic at ffzg.hr)
Organisation:
Department of linguistics
Faculty of Philosophy,
University of Zagreb
Address:
Ivana Lučića 3
HR-10000 Zagreb
Croatia

5.3. Additional Contributors

5.3.1. Severodonetsk

Responsibility:
Russian cesDoc "1984"
People:
Paul Sokolovsky (Paul.Sokolovsky at technologist.com)
Sergey Sryvkin
Organisation:
Severodonetsk Institute of Technology
East-Ukraine State University
Address:
Sovetsky st., bl. 3a
Severodonetsk, Lugansk reg.
Ukraine

5.3.2. Leeds

Responsibility:
Russian morphosyntactic specifications and lexicon
Person:
Serge Sharoff
Organisation:
University of Leeds,
Person:
Mikhail Kopotev
Organisation:
University of Helsinki
Person:
Tomaž Erjavec
Organisation:
Dept. of Knowledge Technologies
Jožef Stefan Institute
Person:
Anna Feldman
Organisation:
Montclair State University,
Person:
Dagmar Divjak
Organisation:
University of Sheffield

5.3.3. Padova

Responsibility:
Resian data
People:
Han Steenwijk (han.steenwijk at unipd.it)
Organisation:
Department of Anglo-Germanic and Slavic Linguistics and Literature,
Padova University

5.3.4. Skopje

Responsibility:
Macedonian data
People:
Katerina Čundeva-Zdravkova, Aleksandar Petrovski
Organisation:
Institute for Informatics
Ss. Cyril and Methodius University of Skopje

5.3.5. Galway

Responsibility:
Persian data
People:
Behrang QasemiZadeh
Organisation:
Unit for Natural Language Processing, DERI
National University of Ireland, Galway

5.4. Contributors in the scope of the MONDILEX project

5.4.1. Bratislava

Responsibility:
Slovak data
People:
Radovan Garabík
Organisation:
Ľudovít Štúr Institute of Linguistics
Slovak Academy of Sciences

5.4.2. Warsaw

Responsibility:
Polish data
Person:
Natalia Kotsyba, natalia at ibi.uw.edu.pl
Organisation:
Institute for Interdisciplinary Studies "Artes Liberales"
Warsaw University
Address:
Krakowskie Przedmieście 26/28
00-046 Warsaw
Poland
Person:
Adam Radziszewski
Organization:
Department of Artificial Intelligence
Institute of Informatics
Wroclaw University of Technology
Address:
ul. Wybrzeze Wyspianskiego 27
50-370 Wroclaw
Poland
Person:
Ivan Derzhanski
Organization:
Department of Mathematical Linguistics
Institute for Mathematics and Computer Science
Address:
8 Acad G Bonchev St
1113 Sofia
Bulgaria

5.4.3. Kyiv

Responsibility:
Ukrainian data
Person:
Natalia Kotsyba, natalia at ibi.uw.edu.pl
Organisation:
Institute for Interdisciplinary Studies "Artes Liberales"
Warsaw University
Address:
Krakowskie Przedmieście 26/28
00-046 Warsaw
Poland
Person:
Igor Shevchenko
Organisation:
Ukrainian Linguistic-Information Fund
National Academy of Sciences of Ukraine
Address:
54 Volodymyrska str.
01601 Kyiv
Ukraine
Person:
Ivan Derzhanski
Organization:
Department of Mathematical Linguistics
Institute for Mathematics and Computer Science
Address:
8 Acad G Bonchev St
1113 Sofia
Bulgaria

5.5. Acknowledgments

Apart from the institutions and people listed above, the following people have greatly contributed to the production of the MULTEXT-East language resources: Renata Anžič, Liviu Anca, Ana-Maria Barbu, Aleksandra Bizjak, Damjan Bojadžiev, Lydia Bozhilova, Aleš Dobinikar, Külli Habicht, Daniel Hirst, Milena Hnátková, Primož Jakopin, Riina Mosna, Kadri Muischnek, Mircea Nicolescu, Heili Orav, Vasile Pătraşcu, Leho Paldre, Tsvetan Petrov, Helen Potter, Andriela Rääbis, Georgiana Rotariu, Zygmunt Saloni, Matjaž Sešek, Tanja Semen, Urve Talvik, Bojana Todorovič, Elias Treeman, Viire Villandi, Olga Vuković, Marcin Woliński.

Work on MULTEXT-East resources was supported by the European Union project MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages, Copernicus 106, and in part by the US National Science Foundation grant IRI-9413451. The MULTEXT-East project results have been greatly enhanced due to the EU Concerted Action TELRI: Trans-European Language Resources Infrastructure. In particular, additional language resources have been produced, and the project's material organised for CD distribution. Work on the second release of the MULTEXT-East resources was supported by EU Copernicus Project PL96-1142 CONCEDE:Consortium for Central European Dictionary Encoding, while the work on the third release was partially funded by the grants from the National Endowment for the Humanities, in the scope of the TEI Task Force on SGML to XML migration. The fourth release was largelly supported by the EU FP7 Capacities - Research Infrastructures project MONDILEX "Conceptual Modelling of Networking of Centres for High-Quality Research in Slavic Lexicography and Their Digital Resources". The work on the resources has been additionally supported by bi-lateral projects between Slovenia and Serbia and Slovenia and Macedonia and individual partners' grants and contracts.



Tomaž Erjavec. Date: 2010-05-13
This work is licensed under the Creative Commons Attribution-No Derivative Works 3.0.