2024-09-23

Table of contents

1. Introduction

This document describes the sixth, "CLARIN" edition of the MULTEXT-East language resources, a multilingual dataset for language engineering research and development. This standardised and linked set of resources covers a large number of mainly Central and Eastern European languages and includes annotated parallel, comparable and speech corpora with morphosyntactic lexica and specifications.

Version 6 of MULTEXT-East resources, a substantial part of which was produced in scope of the CLARIN research infrastructure, updates the Macedonian morphosyntactic specifications, and adds specifications for Serbo-Croatian (meant to cover the Croatian, Serbian, Bosnian and Montenegrin languages), Albanian, for the Torlak dialect of Serbian, and the so called "Damaskini" specifications, developed esp. for a diachronic corpus of Balkan Slavic texts from 16th-19th centuries. The maintenance of the specifications has also moved to GitHub. The other resources, in particular the corpora and lexica, remain currently at version 4.

There are so far no publications describing this version of the resources, but a general overview of MULTEXT-East is available in:

Please acknowledge the use of resources in publications by citing the one of the above papers, or, if more relevant, others from the home page of MULTEXT-East Version 6.

The rest of this document is structured as follows: In the next subsection, we give a synopsis of languages that this distribution provides resources for. Section 2 describes the resouces offered in detail, and Section 3 gives the lists the contributors, i.e. contact points for particular languages, and acknowledgment to those that were involved in producing the resources.

1.1. The languages of MULTEXT-East

Below we list the languages represented in the resources, with the link to their respective Wikipedia entries.

2. MULTEXT-East resources

This section gives the details of the MULTEXT-East resources Version 6: Section MULTEXT-East morphosyntactic resources details the three linked word-level syntactic resources: the specification for morphosyntactic descriptions, the morphosyntactic lexica, and word-level annotated corpus (‘1984’); Section MULTEXT-East cesDoc corpora introduces the structurally marked up corpus, consisting of a parallel part (again ‘1984’), two comparable parts (fiction, newspapers), and a small speech corpus; Section MULTEXT-East ‘1984’ corpus revisits this corpus and recaps the relevant data in somewhat more detail.

2.1. MULTEXT-East morphosyntactic resources

The morphosyntactic resources consist of three layers:

  1. The morphosyntactic specifications, which set out the grammar and vocabulary of morphosyntactic features and descriptions. They specificy what, for each language, is the set of morphosyntactic descriptions, MSDs (or, simpler, PoS tags) and what features they correspond to. For example, they specify that the MSD Ncnp is valid for English and maps to the feature structure Noun, Type:common, Gender:neuter, Number:plural.
  2. The morphosyntactic lexicons, which can contain either the full inflectional paradigms of selected lemmas or entries for corpus attested word-forms. Each entry gives the word-form, its lemma and MSD, e.g.,
    clocks clock Ncnp
  3. The morphosyntactically annotated ‘1984’ corpus, where each word is assigned its context disambiguated MSD and lemma, e.g.,
    <w lemma="clock" ana="#Ncnp">clocks</w>

In the next sections we detail each of these layers in turn.

2.1.1. Morphosyntactic specifications

The syntax and semantics of the morphosyntactic descriptions (MSDs) are given in the MULTEXT-East morphosyntactic specifications. The specifications have been developed in the formalism and on the basis of specifications for six Western European languages of the EU MULTEXT project from the 1990s' and in cooperation with EAGLES, the Expert Advisory Group on Language Engineering Standards. The first version of these specifications was released as a report of the MULTEXT-East project but have been since extensivelly revised. Nevertheless, the specifications are still structured as a report, and contain introductory chapters, followed by the list of defined categories (parts-of-speech), and then, for each category, a table of attribute-values, and the languages the features are appropriate for. These so called common tables are followed by language particular sections. Each language section is further subdivided, and can contain feature co-occurrence restrictions, examples, notes, and full lists of valid MSDs, as well as localisation information.

The complete specifications are an XML document, encoded according to the TEI Guidelines.

In specifications are maintained on https://github.com/clarinsi/mte-msd, and, for continuity, mirrored here, i.e. on https://nl.ijs.si/ME/V6/msd/

The source TEI encoding is down-converted to several derived formats:

html
Readable format
tables
MSD tabless, including TEI MSDs as feature structures
cooked TEI
Canonical specifications in TEI
source TEI
Editable TEI specifications on GitHub
xslt
Conversion XSLT (and Perl) scripts on GitHub

2.1.2. Morphosyntactic Lexicons

Languages: Bulgarian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Resian, Romanian, Russian, Serbian, Slovak, Slovene, Ukrainian.

The MULTEXT-East morphosyntactic lexicons have a simple structure, where each lexical entry is composed of the following fields:

  • the word-form, which is the inflected form of the word, as it appears in the text, modulo sentence-initial capitalisation;
  • the lemma, which is the base-form of the word;
  • the MSD, i.e., the morphosyntactic description of the word-form;
  • optional frequency of triplet in some reference corpus.

The sizes of the MULTEXT-East lexica vary considerably between the langauges. A few are quite limited, and serve more as a proof of concept and to assign lexical entries to the MSDs, but most contain over 20,000 lemmas and can serve as medium sized morphological lexica for the languages. In addition to explicating the inflectional behaviour of the most common (and, typically, morphologically the most complex) words of the languages, the lexica also served to establish the definitive set of valid MSDs for the languages.

The lexica are available from the CLARIN.SI repository:

2.1.3. Linguistically annotated ‘1984’

Languages: Bulgarian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Romanian, Serbian, Slovak, Slovene.

The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel corpus contains the novel in the English original (about 100,000 words in length), and its translations into a number of languages.

This corpus, also called the cesAna corpus, contains word level markup, namely context disambiguated lemmas and MSDs, e.g., <w lemma="it" ana="#Pp3ns">it</w>.

Each MSD is linked to a TEI feature-structure library (automatically derived from the specifications, and written in the back matter of the novel), which gives for each MSDs its decomposition into features, e.g.
<fs xml:id="Pp3ns" xml:lang="en" feats="#P0. #P1.p #P2.3 #P3.n #P4.s"/>
and, in the feature library:
<f name="CATEGORY" xml:id="P0." xml:lang="en"><symbol value="Pronoun"/></f>
<f name="Type" xml:id="P1.p" xml:lang="en"><symbol value="personal"/></f>
<f name="Person" xml:id="P2.3" xml:lang="en"><symbol value="third"/></f>
...

The translations of ‘1984’ are sentence aligned with the English original, with hand-validated alignments. From these en-xx alignments the other bi-lingual and one multilingual alignments were produced automatically. The sentence alignments are stand-off, i.e. encoded in separate files with pointers to the aligned sentence identifiers. The alignments are encloding as TEI link groups, containing links of the form:
<link n="1:2" targets="oana-bg.xml#Obg.1.1.6.7 oana-ro.xml#Oro.1.2.7.7 oana-ro.xml#Oro.1.2.7.8"/>

The corpus is further documented in its TEI headers.

The corpus is available from the CLARIN.SI repository:

2.2. MULTEXT-East cesDoc corpora

Languages: Bulgarian, Czech, English, Estonian, Hungarian, Romanian, Slovene, Lithuanian, (partial).

The so called "cesDoc" corpora encodes mostly (quite rich) structural information about the component texts. The corpora consists of the parallel ‘1984’ corpus, two comparable corpora (fiction, news), and a small multilingual parallel speech corpus. The corpus and its components are further documented in their TEI headers, and also in the original MULTEXT-East report D2.1 F: Corpus Collection and Preparation, although the information there is no longer current in all respects.

The corpus is further documented in its TEI headers.

2.2.1. Structural ‘1984’ and alignments

Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Lithuanian, Serbian, Russian.

The multilingual parallel corpus consists of the novel ‘1984’, about 100,000 words in length. The corpus contains extensive headers and markup for document structure, sentences, and various sub-sentence annotations, these similar to the comparable corpus, but better harmonised over languages.

The translations of ‘1984’ are sentence aligned with the English original, with hand-validated alignments. From these en-xx alignments the other bi-lingual and one multilingual alignments were produced automatically. The sentence alignments are stand-off, i.e. encoded in separate files with pointers to the aligned sentence identifiers. The alignments are encloding as TEI link groups, containing links of the form:
<link n="1:2" targets="oana-bg.xml#Obg.1.1.6.7 oana-ro.xml#Oro.1.2.7.7 oana-ro.xml#Oro.1.2.7.8"/>

The corpus is available from the CLARIN.SI repository:

2.2.2. Comparable corpus components

Languages: Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian.

The multilingual comparable corpus contains a fiction part and a news part, where the data is comparable across the languages in terms of the number and size of texts; each of the 12 parts has approx. 100,000 words. The corpus is structurally marked up with over 40 different elements; however. sub-paragraph markup has not been harmonised across the languages.

The comparable corpus and its components are further documented in their TEI headers. The ‘cmp’ corpus is available for download in the directory herecrp/.

2.2.3. Speech corpus components

Languages: Romanian, Slovene, Estonian, Hungarian, (English, Czech, Bulgarian)

MULTEXT-East produced a small corpus of spoken texts taken from the EUROM-1 speech corpus. It comprises the translations (from English) of forty short passages of five thematically connected sentences. This written part of the corpus is included together with the other parts of the cesDoc corpus.

For four languages, the texts have also been read, recorded and included in the distribution. The corpus texts contain links to the spoken passages stored as .wav files. The speech files are, due to their size, stored and distributed in a separate bundle.

The speech corpus and its components are further documented in their TEI headers. The corpus is available in the directory spc/.

3. MULTEXT-East ‘1984’ corpus

The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel corpus annotated contains the novel in the English original (about 100,000 words in length), and its translations into a number of languages.

George Orwell
Nineteen Eighty-Four


Big Brother is Watching You!
It was an enormous pyramidal structure of glittering white concrete, soaring up, terrace after terrace, 300 metres into the air. From where Winston stood it was just possible to read, picked out on its white face in elegant lettering, the three slogans of the Party:
War is peaceFreedom is slaveryIgnorance is strength
Războiul este paceLibertatea este sclavieIgnoranţa este putere
Vojna je mirSvoboda je suženjstvoNevednost je moč
Válka je mírSvoboda je otroctvíNevědomost je síla
Войната е мирСвободата е робствоНевежеството е сила
Sõda on rahuVabadus on orjusTeadmatus on jõud
A háború: békeA szabadság: szolgaságA tudatlanság: erő
Karas — tai taikaLaisvė — tai vergijaNežinomas — tai jėga
Rat je mirSloboda je ropstvoNeznanje je moć
Война — это мирСвобода — это рабствоНезнание — сила

The novel exists in two versions:

The translations of ‘1984’ are sentence aligned with the English original, with hand-validated alignments. From these en-xx alignments the other bi-lingual and one multilingual alignments were produced automatically. The sentence alignments are stand-off, i.e. encoded in separate files with pointers to the aligned sentence identifiers. The alignments are encloding as TEI link groups, containing links of the form:
<link n="1:2" targets="oana-bg.xml#Obg.1.1.6.7 oana-ro.xml#Oro.1.2.7.7 oana-ro.xml#Oro.1.2.7.8"/>

The first edition of the "1984" corpus is detailed in the MULTEXT-East project reports MTE D2.1 F ""1984" Corpus Collection" (with Appendix) and MTE D2.3 F "Sentence Alignment" and "Morphosyntactic Tagging". However, note that the encoding has substantially changed from the first MULTEXT-East version described in the these reports.

The details on the encoding are to be found in the TEI headers in the two corpora, i.e. the cesDoc "orwl" TEI headers and cesAna "oana" TEI headers.

The details on the morphosyntactic descriptions, MSDs, used to word-tag V6 of the corpus are given in the MULTEXT-East Morphosyntactic Specifications.

4. Contributors

This section lists the contributors of the MULTEXT-East resources (Version 4), starting with the partners of the MULTEXT-East project, the partners of the TELRI concerted action that contributed resources to MULTEXT-East, followed by MONDILEX and other contributors, and, finally, the acknowledgements to other people that made the production of these resources possible. Further information on who did what can be found in the corpus headers and the title page of the morphosyntactic specifications. Next to each partner is listed also their responsibility; for resources of the particular languages, this contact point should also be used for further inquiries.

4.1. MULTEXT-East Partners

Listed below are the partners of the original Copernicus MULTEXT-East project [Picture].

4.1.1. Aix-en-Provence

Responsibility:
MULTEXT-East coordinator
People:
Jean Véronis (multext at univ-aix.fr)
Organisation:
Laboratoire Parole et Langage
Centre National de la Recherche Scientifique
Aix-en-Provence, France
Address:
29, Av. Robert Schuman
13621 Aix-en-Provence Cedex 1

4.1.2. Vassar

Responsibility:
Corpus encoding, English data
People:
Nancy Ide (ide at cs.vassar.edu)
Greg Priest-Dorman (priestdo at cs.vassar.edu) [Picture]
Organisation:
Dept. of Computer Science
Vassar College
Address:
124 Raymond Avenue
Poughkeepsie, NY 12604-0520

4.1.3. Pisa

Responsibility:
Morphosyntactic specification (associate partner)
People:
Nicoletta Calzolari
Monica Monachini
Organisation:
Istituto di Linguistica Computazionale
Consiglio Nazionale delle Ricerche
Pisa, Italy

4.1.4. Sofia

Responsibility:
Bulgarian data
People:
Radoslav Pavlov,
Ludmila Dimitrova
Lydia Sinapova
Kiril Simov
Organisation:
Department of Mathematical Linguistics
Institute of Mathematics and Informatics
Bulgarian Academy of Sciences
Address:
8 Acad. G. Bonchev Street
BG-1113 Sofia
Bulgaria

4.1.5. Prague

Responsibility:
Czech data
People:
Vladimír Petkevič (vladimir.petkevic at ff.cuni.cz) [Picture]
Organisation:
Institute of Theoretical and Computational Linguistics
Faculty of Philosophy
Charles University
Address:
Celetna 13
11000 Prague 1
Czech Republic
Subcontractor:
People:
Zdenek Laciga
Organisation:
BYLL Software, Ltd.
Address:
Anenske nam. 2
110 00 Prague 1
Czech Republic

4.1.6. Tartu

Responsibility:
Estonian data
People:
Heiki-Jaan Kaalep (hkaalep at psych.ut.ee) [Picture]
Organisation:
Department of General Linguistics
Tartu University
Address:
Ülikooli 18
EE-2400 Tartu
Estonia

4.1.7. Budapest

Responsibility:
Hungarian data
People:
Laszlo Tihanyi (Tihanyi at nytud.hu)
Csaba Oravecz (oravecz at nytud.hu) [Picture]
Organisation:
Research Institute for Linguistics
Hungarian Academy of Sciences
Address:
Szinhaz u. 5-9 - P.O. Box 19
H-1250 Budapest
Hungary
Subcontractor:
People:
Gabor Proszeky
Organisation:
MorphoLogic
Address:
Fo u. 56-58 I/3
H-1011 Budapest
Hungary

4.1.8. Bucharest

Responsibility:
Romanian data
People:
Dan Tufiş (tufis at racai.ro) [Picture]
Ştefan Bruda
Mihai Ciocoiu
Organisation:
Center for Research in Machine Learning, Natural Language Processing and Conceptual Modelling
Romanian Academy of Sciences
Address:
Casa Academiei "13 Septembrie" 13
Bucharest 71102
Romania

4.1.9. Ljubljana

Responsibility:
Slovene data, corpus workpackage leader
People:
Tomaž Erjavec [Picture]
Organisation:
Dept. of Knowledge Technologies
Jožef Stefan Institute
Address:
Jamova cesta 39
SI-1000 Ljubljana
Slovenia
Subcontractor:
People:
Miro Romih (miro.romih at amebis.si)
Peter Holozan (peter.holozan at amebis.si)
Organisation:
AMEBIS d.o.o.
Address:
Jakopičeva 6
61240 Kamnik
Slovenia

4.2. TELRI Partners

4.2.1. Riga

Responsibility:
Latvian data
People:
Andrejs Spektors (aspekt at mii.lu.lv)
Organisation:
Artificial Intelligence Laboratory
Institute of Mathematics and Computer Science
University of Latvia
Address:
29, Raina bulv.
LV-1459 Riga
Latvia

4.2.2. Kaunas

Responsibility:
Lithuanian data
People:
Andrius Utka (andrius at donelaitis.vdu.lt)
Ruta Marcinkeviciene (ruta.marcinkeviciene at vdu.lt)
Organisation:
Center of Computational Linguistics
Faculty of Humanities
Vytautus Magnus University
Address:
S. Daukanto 28
3000 Kaunas
Lithuania

4.2.3. Belgrade

Responsibility:
Serbian data
People:
Cvetana Krstev, cvetana at matf.bg.ac.rs
Duško Vitas, vitas at matf.bg.ac.rs
Organisation:
Faculty of Mathematics
Belgrade University
Address:
Studentski trg 16
11000 Belgrade
Republic of Serbia

4.3. Additional Contributors

4.3.1. Severodonetsk

Responsibility:
Russian cesDoc "1984"
People:
Paul Sokolovsky (Paul.Sokolovsky at technologist.com)
Sergey Sryvkin
Organisation:
Severodonetsk Institute of Technology
East-Ukraine State University
Address:
Sovetsky st., bl. 3a
Severodonetsk, Lugansk reg.
Ukraine

4.3.2. Leeds

Responsibility:
Russian morphosyntactic specifications and lexicon
Person:
Serge Sharoff
Organisation:
University of Leeds,
Person:
Mikhail Kopotev
Organisation:
University of Helsinki
Person:
Tomaž Erjavec
Organisation:
Dept. of Knowledge Technologies
Jožef Stefan Institute
Person:
Anna Feldman
Organisation:
Montclair State University,
Person:
Dagmar Divjak
Organisation:
University of Sheffield

4.3.3. Padova

Responsibility:
Resian data
People:
Han Steenwijk (han.steenwijk at unipd.it)
Organisation:
Department of Anglo-Germanic and Slavic Linguistics and Literature,
Padova University

4.3.4. Galway

Responsibility:
Persian data
People:
Behrang QasemiZadeh
Organisation:
Unit for Natural Language Processing, DERI
National University of Ireland, Galway

4.4. Contributors in the scope of the MONDILEX project

4.4.1. Bratislava

Responsibility:
Slovak data
People:
Radovan Garabík
Organisation:
Ľudovít Štúr Institute of Linguistics
Slovak Academy of Sciences

4.4.2. Warsaw

Responsibility:
Polish data
Person:
Natalia Kotsyba, natalia at ibi.uw.edu.pl
Organisation:
Institute for Interdisciplinary Studies "Artes Liberales"
Warsaw University
Address:
Krakowskie Przedmieście 26/28
00-046 Warsaw
Poland
Person:
Adam Radziszewski
Organization:
Department of Artificial Intelligence
Institute of Informatics
Wroclaw University of Technology
Address:
ul. Wybrzeze Wyspianskiego 27
50-370 Wroclaw
Poland
Person:
Ivan Derzhanski
Organization:
Department of Mathematical Linguistics
Institute for Mathematics and Computer Science
Address:
8 Acad G Bonchev St
1113 Sofia
Bulgaria

4.4.3. Kyiv

Responsibility:
Ukrainian data
Person:
Natalia Kotsyba, natalia at ibi.uw.edu.pl
Organisation:
Institute for Interdisciplinary Studies "Artes Liberales"
Warsaw University
Address:
Krakowskie Przedmieście 26/28
00-046 Warsaw
Poland
Person:
Igor Shevchenko
Organisation:
Ukrainian Linguistic-Information Fund
National Academy of Sciences of Ukraine
Address:
54 Volodymyrska str.
01601 Kyiv
Ukraine
Person:
Ivan Derzhanski
Organization:
Department of Mathematical Linguistics
Institute for Mathematics and Computer Science
Address:
8 Acad G Bonchev St
1113 Sofia
Bulgaria

4.5. Contributors in the scope of the CLARIN infrastructure

4.5.1. Skopje

Responsibility:
Macedonian data
People:
Katerina Zdravkova, Aleksandar Petrovski
Organisation:
Institute for Informatics
Ss. Cyril and Methodius University of Skopje

4.5.2. Tuebingen

Responsibility:
Chechen data
People:
Dietmar Fiesel
Organisation:
Faculty of Humanities
Eberhard Karls Universitaet Tuebingen

4.5.3. Berlin

Responsibility:
Albanian data
People:
Dalian Zogaj, Philipp Wasserscheidt
Organisation:
Faculty of Language, Literature and Humanities
Humboldt-Universität zu Berlin

4.5.4. Zürich

Responsibility:
Torlak data
People:
Teodora Vuković
Organisation:
Slavisches Seminar
Universität Zürich
Responsibility:
Damaskini data
People:
Ivan Šimko
Organisation:
Slavisches Seminar
Universität Zürich

4.6. Acknowledgments

Apart from the institutions and people listed above, the following people have greatly contributed to the production of the MULTEXT-East language resources: Renata Anžič, Liviu Anca, Ana-Maria Barbu, Aleksandra Bizjak, Damjan Bojadžiev, Lydia Bozhilova, Aleš Dobinikar, Külli Habicht, Daniel Hirst, Milena Hnátková, Primož Jakopin, Riina Mosna, Kadri Muischnek, Mircea Nicolescu, Heili Orav, Vasile Pătraşcu, Leho Paldre, Tsvetan Petrov, Helen Potter, Andriela Rääbis, Georgiana Rotariu, Zygmunt Saloni, Matjaž Sešek, Tanja Semen, Urve Talvik, Bojana Todorovič, Elias Treeman, Viire Villandi, Olga Vuković, Marcin Woliński.

Work on MULTEXT-East resources was supported by the European Union project MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages, Copernicus 106, and in part by the US National Science Foundation grant IRI-9413451. The MULTEXT-East project results have been greatly enhanced due to the EU Concerted Action TELRI: Trans-European Language Resources Infrastructure. In particular, additional language resources have been produced, and the project's material organised for CD distribution. Work on the second release of the MULTEXT-East resources was supported by EU Copernicus Project PL96-1142 CONCEDE:Consortium for Central European Dictionary Encoding, while the work on the third release was partially funded by the grants from the National Endowment for the Humanities, in the scope of the TEI Task Force on SGML to XML migration. The fourth release was largelly supported by the EU FP7 Capacities - Research Infrastructures project MONDILEX "Conceptual Modelling of Networking of Centres for High-Quality Research in Slavic Lexicography and Their Digital Resources". The fifth and sixth release was also supported by the research infastructure CLARIN.SI. The work on the resources has been additionally supported by bi-lateral projects between Slovenia and Serbia and Slovenia and Macedonia and individual partners' grants and contracts.

Tomaž Erjavec. Date: 2022-03-24