This document describes the sixth, "CLARIN" edition of the MULTEXT-East language resources, a multilingual dataset for language engineering research and development. This standardised and linked set of resources covers a large number of mainly Central and Eastern European languages and includes annotated parallel, comparable and speech corpora with morphosyntactic lexica and specifications.
Version 6 of MULTEXT-East resources, a substantial part of which was produced in scope of the CLARIN research infrastructure, updates the Macedonian morphosyntactic specifications, and adds specifications for Serbo-Croatian (meant to cover the Croatian, Serbian, Bosnian and Montenegrin languages), Albanian, for the Torlak dialect of Serbian, and the so called "Damaskini" specifications, developed esp. for a diachronic corpus of Balkan Slavic texts from 16th-19th centuries. The maintenance of the specifications has also moved to GitHub. The other resources, in particular the corpora and lexica, remain currently at version 4.
There are so far no publications describing this version of the resources, but a general overview of MULTEXT-East is available in:
Please acknowledge the use of resources in publications by citing the one of the above papers, or, if more relevant, others from the home page of MULTEXT-East Version 6.
The rest of this document is structured as follows: In the next subsection, we give a synopsis of languages that this distribution provides resources for. Section 2 describes the resouces offered in detail, and Section 3 gives the lists the contributors, i.e. contact points for particular languages, and acknowledgment to those that were involved in producing the resources.
Below we list the languages represented in the resources, with the link to their respective Wikipedia entries.
This section gives the details of the MULTEXT-East resources Version 6: Section ‘MULTEXT-East morphosyntactic resources’ details the three linked word-level syntactic resources: the specification for morphosyntactic descriptions, the morphosyntactic lexica, and word-level annotated corpus (‘1984’); Section ‘MULTEXT-East cesDoc corpora’ introduces the structurally marked up corpus, consisting of a parallel part (again ‘1984’), two comparable parts (fiction, newspapers), and a small speech corpus; Section ‘MULTEXT-East ‘1984’ corpus’ revisits this corpus and recaps the relevant data in somewhat more detail.
The morphosyntactic resources consist of three layers:
In the next sections we detail each of these layers in turn.
The syntax and semantics of the morphosyntactic descriptions (MSDs) are given in the MULTEXT-East morphosyntactic specifications. The specifications have been developed in the formalism and on the basis of specifications for six Western European languages of the EU MULTEXT project from the 1990s' and in cooperation with EAGLES, the Expert Advisory Group on Language Engineering Standards. The first version of these specifications was released as a report of the MULTEXT-East project but have been since extensivelly revised. Nevertheless, the specifications are still structured as a report, and contain introductory chapters, followed by the list of defined categories (parts-of-speech), and then, for each category, a table of attribute-values, and the languages the features are appropriate for. These so called common tables are followed by language particular sections. Each language section is further subdivided, and can contain feature co-occurrence restrictions, examples, notes, and full lists of valid MSDs, as well as localisation information.
The complete specifications are an XML document, encoded according to the TEI Guidelines.
In specifications are maintained on https://github.com/clarinsi/mte-msd, and, for continuity, mirrored here, i.e. on https://nl.ijs.si/ME/V6/msd/
The source TEI encoding is down-converted to several derived formats:
Languages: Bulgarian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Resian, Romanian, Russian, Serbian, Slovak, Slovene, Ukrainian.
The MULTEXT-East morphosyntactic lexicons have a simple structure, where each lexical entry is composed of the following fields:
The sizes of the MULTEXT-East lexica vary considerably between the langauges. A few are quite limited, and serve more as a proof of concept and to assign lexical entries to the MSDs, but most contain over 20,000 lemmas and can serve as medium sized morphological lexica for the languages. In addition to explicating the inflectional behaviour of the most common (and, typically, morphologically the most complex) words of the languages, the lexica also served to establish the definitive set of valid MSDs for the languages.
The lexica are available from the CLARIN.SI repository:
Languages: Bulgarian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Romanian, Serbian, Slovak, Slovene.
The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel corpus contains the novel in the English original (about 100,000 words in length), and its translations into a number of languages.
This corpus, also called the cesAna corpus, contains word level markup, namely context disambiguated lemmas and MSDs, e.g., <w lemma="it" ana="#Pp3ns">it</w>.
Each MSD is linked to a TEI feature-structure library (automatically derived from the specifications, and written in the back matter of the novel), which gives for each MSDs its decomposition into features, e.g.
<fs xml:id="Pp3ns" xml:lang="en" feats="#P0. #P1.p #P2.3 #P3.n #P4.s"/>
and, in the feature library:
<f name="CATEGORY" xml:id="P0." xml:lang="en"><symbol value="Pronoun"/></f>
<f name="Type" xml:id="P1.p" xml:lang="en"><symbol value="personal"/></f>
<f name="Person" xml:id="P2.3" xml:lang="en"><symbol value="third"/></f>
...
The translations of ‘1984’ are sentence aligned with the English original, with hand-validated alignments. From these en-xx alignments the other bi-lingual and one multilingual alignments were produced automatically. The sentence alignments are stand-off, i.e. encoded in separate files with pointers to the aligned sentence identifiers. The alignments are encloding as TEI link groups, containing links of the form:
<link n="1:2" targets="oana-bg.xml#Obg.1.1.6.7 oana-ro.xml#Oro.1.2.7.7 oana-ro.xml#Oro.1.2.7.8"/>
The corpus is further documented in its TEI headers.
The corpus is available from the CLARIN.SI repository:
Languages: Bulgarian, Czech, English, Estonian, Hungarian, Romanian, Slovene, Lithuanian, (partial).
The so called "cesDoc" corpora encodes mostly (quite rich) structural information about the component texts. The corpora consists of the parallel ‘1984’ corpus, two comparable corpora (fiction, news), and a small multilingual parallel speech corpus. The corpus and its components are further documented in their TEI headers, and also in the original MULTEXT-East report D2.1 F: Corpus Collection and Preparation, although the information there is no longer current in all respects.
The corpus is further documented in its TEI headers.
Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Lithuanian, Serbian, Russian.
The multilingual parallel corpus consists of the novel ‘1984’, about 100,000 words in length. The corpus contains extensive headers and markup for document structure, sentences, and various sub-sentence annotations, these similar to the comparable corpus, but better harmonised over languages.
The translations of ‘1984’ are sentence aligned with the English original, with hand-validated alignments. From these en-xx alignments the other bi-lingual and one multilingual alignments were produced automatically. The sentence alignments are stand-off, i.e. encoded in separate files with pointers to the aligned sentence identifiers. The alignments are encloding as TEI link groups, containing links of the form:
<link n="1:2" targets="oana-bg.xml#Obg.1.1.6.7 oana-ro.xml#Oro.1.2.7.7 oana-ro.xml#Oro.1.2.7.8"/>
The corpus is available from the CLARIN.SI repository:
Languages: Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian.
The multilingual comparable corpus contains a fiction part and a news part, where the data is comparable across the languages in terms of the number and size of texts; each of the 12 parts has approx. 100,000 words. The corpus is structurally marked up with over 40 different elements; however. sub-paragraph markup has not been harmonised across the languages.
The comparable corpus and its components are further documented in their TEI headers. The ‘cmp’ corpus is available for download in the directory herecrp/.
Languages: Romanian, Slovene, Estonian, Hungarian, (English, Czech, Bulgarian)
MULTEXT-East produced a small corpus of spoken texts taken from the EUROM-1 speech corpus. It comprises the translations (from English) of forty short passages of five thematically connected sentences. This written part of the corpus is included together with the other parts of the cesDoc corpus.
For four languages, the texts have also been read, recorded and included in the distribution. The corpus texts contain links to the spoken passages stored as .wav files. The speech files are, due to their size, stored and distributed in a separate bundle.
The speech corpus and its components are further documented in their TEI headers. The corpus is available in the directory spc/.
The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel corpus annotated contains the novel in the English original (about 100,000 words in length), and its translations into a number of languages.
George Orwell Nineteen Eighty-Four | It was an enormous pyramidal structure of glittering white concrete, soaring up, terrace after terrace, 300 metres into the air. From where Winston stood it was just possible to read, picked out on its white face in elegant lettering, the three slogans of the Party:
|
The novel exists in two versions:
The translations of ‘1984’ are sentence aligned with the English original, with hand-validated alignments. From these en-xx alignments the other bi-lingual and one multilingual alignments were produced automatically. The sentence alignments are stand-off, i.e. encoded in separate files with pointers to the aligned sentence identifiers. The alignments are encloding as TEI link groups, containing links of the form:
<link n="1:2" targets="oana-bg.xml#Obg.1.1.6.7 oana-ro.xml#Oro.1.2.7.7 oana-ro.xml#Oro.1.2.7.8"/>
The first edition of the "1984" corpus is detailed in the MULTEXT-East project reports MTE D2.1 F ""1984" Corpus Collection" (with Appendix) and MTE D2.3 F "Sentence Alignment" and "Morphosyntactic Tagging". However, note that the encoding has substantially changed from the first MULTEXT-East version described in the these reports.
The details on the encoding are to be found in the TEI headers in the two corpora, i.e. the cesDoc "orwl" TEI headers and cesAna "oana" TEI headers.
The details on the morphosyntactic descriptions, MSDs, used to word-tag V6 of the corpus are given in the MULTEXT-East Morphosyntactic Specifications.
This section lists the contributors of the MULTEXT-East resources (Version 4), starting with the partners of the MULTEXT-East project, the partners of the TELRI concerted action that contributed resources to MULTEXT-East, followed by MONDILEX and other contributors, and, finally, the acknowledgements to other people that made the production of these resources possible. Further information on who did what can be found in the corpus headers and the title page of the morphosyntactic specifications. Next to each partner is listed also their responsibility; for resources of the particular languages, this contact point should also be used for further inquiries.
Listed below are the partners of the original Copernicus MULTEXT-East project [Picture].
Apart from the institutions and people listed above, the following people have greatly contributed to the production of the MULTEXT-East language resources: Renata Anžič, Liviu Anca, Ana-Maria Barbu, Aleksandra Bizjak, Damjan Bojadžiev, Lydia Bozhilova, Aleš Dobinikar, Külli Habicht, Daniel Hirst, Milena Hnátková, Primož Jakopin, Riina Mosna, Kadri Muischnek, Mircea Nicolescu, Heili Orav, Vasile Pătraşcu, Leho Paldre, Tsvetan Petrov, Helen Potter, Andriela Rääbis, Georgiana Rotariu, Zygmunt Saloni, Matjaž Sešek, Tanja Semen, Urve Talvik, Bojana Todorovič, Elias Treeman, Viire Villandi, Olga Vuković, Marcin Woliński.
Work on MULTEXT-East resources was supported by the European Union project MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages, Copernicus 106, and in part by the US National Science Foundation grant IRI-9413451. The MULTEXT-East project results have been greatly enhanced due to the EU Concerted Action TELRI: Trans-European Language Resources Infrastructure. In particular, additional language resources have been produced, and the project's material organised for CD distribution. Work on the second release of the MULTEXT-East resources was supported by EU Copernicus Project PL96-1142 CONCEDE:Consortium for Central European Dictionary Encoding, while the work on the third release was partially funded by the grants from the National Endowment for the Humanities, in the scope of the TEI Task Force on SGML to XML migration. The fourth release was largelly supported by the EU FP7 Capacities - Research Infrastructures project MONDILEX "Conceptual Modelling of Networking of Centres for High-Quality Research in Slavic Lexicography and Their Digital Resources". The fifth and sixth release was also supported by the research infastructure CLARIN.SI. The work on the resources has been additionally supported by bi-lateral projects between Slovenia and Serbia and Slovenia and Macedonia and individual partners' grants and contracts.