This document describes the fourth, "MONDILEX" edition of the MULTEXT-East language resources, a multilingual dataset for language engineering research and development. This standardised and linked set of resources covers a large number of mainly Central and Eastern European languages and includes annotated parallel, comparable and speech corpora with morphosyntactic lexica and specifications.
Version 4 of MULTEXT-East resources, a substantial part of which was produced in the EU MONDILEX project, adds new languages and makes the resources uniformly available in TEI P5 XML. This dataset, unique in terms of languages and the wealth of encoding, is freely available for research purposes, according to the MULTEXT-East research licence; a local text copy is provided for reference.
The paper describing this version of the resources is:
Proc. of the LREC 2010, Malta,
19-21 May, 2010.
Tomaž Erjavec: MULTEXT-East
Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora.
In the Proceedings of the Seventh International Conference on Language Resources and
Evaluation, LREC'10 ELRA
Paris
2010. [PDF]
Please acknowledge the use of resources in publications by citing the above paper, or, if more relevant, others from the home page of MULTEXT-East Version 4.
The rest of this document is structured as follows: In the next subsection, we give a synopsis of languages that this distribution provides resources for. Section 2 ‘MULTEXT-East resources’ describes the resouces offered in detail, Section 3 presents the distribution of the corpus, and Section 4 gives the lists the contributors, i.e. contact points for particular languages, and acknowledgment to those that were involved in producing the resources.
This section gives the details of the MULTEXT-East resources Version 4: Section ‘MULTEXT-East morphosyntactic resources’ details the three linked word-level syntactic resources: the specification for morphosyntactic descriptions, the morphosyntactic lexica, and word-level annotated corpus (‘1984’); Section ‘MULTEXT-East cesDoc corpus’ introduces the structurally marked up corpus, consisting of a parallel part (again ‘1984’), two comparable parts (fiction, newspapers), and a small speech corpus; Section ‘MULTEXT-East ‘1984’ corpus’ revisits this corpus and recaps the relevant data in somewhat more detail.
In the next sections we detail each of these layers in turn.
Languages: Bulgarian, Croatian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Resian, Romanian, Russian, Serbian, Slovak, Slovene, Ukrainian.
The syntax and semantics of the morphosyntactic descriptions (MSDs) are given in the MULTEXT-East morphosyntactic specifications, which have been developed in the formalism and on the basis of specifications for six Western European languages of the MULTEXT project and in cooperation with EAGLES, the Expert Advisory Group on Language Engineering Standards.
Originally, these specifications were released as a report of the MULTEXT-East project but have been since extensivelly revised. Nevertheless, the specifications are still structured as a report, and contain introductory chapters, followed by the list of defined categories (parts-of-speech), and then, for each category, a table of attribute-values, and the languages the features are appropriate for. These so called common tables are followed by language particular sections. Each language section is further subdivided, and can contain feature co-occurrence restrictions, examples, notes, and full lists of valid MSDs, as well as localisation information.
The complete specifications are an XML document, encoded as a TEI P5 schema. Further information about these specifications is available in their TEI header.
Languages: Bulgarian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Resian, Romanian, Russian, Serbian, Slovak, Slovene, Ukrainian.
The sizes of the MULTEXT-East lexica vary considerably between the langauges. A few are quite limited, and serve more as a proof of concept and to assign lexical entries to the MSDs, but most contain over 20,000 lemmas and can serve as medium sized morphological lexica for the languages. In addition to explicating the inflectional behaviour of the most common (and, typically, morphologically the most complex) words of the languages, the lexica also served to establish the definitive set of valid MSDs for the languages.
The lexica themselves are available in the directory lex/, where there is also a README file and (note that WWW access to this directory is restricted to licence holders).
Languages: Bulgarian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Romanian, Serbian, Slovak, Slovene.
The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel corpus contains the novel in the English original (about 100,000 words in length), and its translations into a number of languages.
This corpus, also called the cesAna corpus, contains word level markup, namely context disambiguated lemmas and MSDs, e.g., <w lemma="it" ana="#Pp3ns">it</w>.
Each MSD is linked to a TEI feature-structure library (automatically derived from the
specifications, and written in the back matter of the novel), which gives for each MSDs its
decomposition into features, e.g.
<fs xml:id="Pp3ns" xml:lang="en" feats="#P0. #P1.p #P2.3 #P3.n #P4.s"/>
and,
in the feature library:
<f name="CATEGORY" xml:id="P0." xml:lang="en"><symbol value="Pronoun"/></f>
<f name="Type" xml:id="P1.p" xml:lang="en"><symbol value="personal"/></f>
<f name="Person" xml:id="P2.3" xml:lang="en"><symbol
value="third"/></f>
...
The translations of ‘1984’ are sentence aligned with the English original, with
hand-validated alignments. From these en-xx alignments the other bi-lingual and one
multilingual alignments were produced automatically. The sentence alignments are stand-off,
i.e. encoded in separate files with pointers to the aligned sentence identifiers. The
alignments are encloding as TEI link
groups, containing links of the form:
<link n="1:2" targets="oana-bg.xml#Obg.1.1.6.7 oana-ro.xml#Oro.1.2.7.7
oana-ro.xml#Oro.1.2.7.8"/>
The corpus is further documented in its TEI headers. Component files are available in the directory ana/ (note that WWW access to this directory is restricted to licence holders).
Languages: Bulgarian, Czech, English, Estonian, Hungarian, Romanian, Slovene, Lithuanian, (partial).
This corpus encodes mostly (quite rich) structural information about the component texts. It consists of two comparable corpora (fiction, news), the parallel ‘1984’ corpus and a small multilingual parallel speech corpus. The corpus and its components are further documented in their TEI headers, and also in the original MULTEXT-East report D2.1 F: Corpus Collection and Preparation, although the information there is no longer current in all respects.
The corpus is further documented in its TEI headers. It corpus is available in the directory crp/, where the driver file is ‘mte-cesdoc.xml’ (note that WWW access to this directory is restricted to licence holders).
Languages: Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian.
The multilingual comparable corpus contains a fiction part and a news part, where the data is comparable across the languages in terms of the number and size of texts; each of the 12 parts has approx. 100,000 words. The corpus is structurally marked up with over 40 different elements; however. sub-paragraph markup has not been harmonised across the languages.
Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, (Latvian), Lithuanian, Serbian, Russian.
The multilingual parallel corpus consists of the novel ‘1984’, about 100,000 words in length. The corpus contains extensive headers and markup for document structure, sentences, and various sub-sentence annotations, these similar to the comparable corpus, but better harmonised over languages.
The translations of ‘1984’ are sentence aligned with the English original, with
hand-validated alignments. From these en-xx alignments the other bi-lingual and one
multilingual alignments were produced automatically. The sentence alignments are stand-off,
i.e. encoded in separate files with pointers to the aligned sentence identifiers. The
alignments are encloding as TEI link
groups, containing links of the form:
<link n="1:2" targets="oana-bg.xml#Obg.1.1.6.7 oana-ro.xml#Oro.1.2.7.7
oana-ro.xml#Oro.1.2.7.8"/>
Languages: Romanian, Slovene, Estonian, Hungarian, (English, Czech, Bulgarian)
MULTEXT-East produced a small corpus of spoken texts taken from the EUROM-1 speech corpus. It comprises the translations (from English) of forty short passages of five thematically connected sentences. This written part of the corpus is included together with the other parts of the cesDoc corpus.
For four languages, the texts have also been read, recorded and included in the distribution. The corpus texts contain links to the spoken passages stored as .wav files. The speech files are, due to their size, stored and distributed in a separate bundle.
The speech corpus and its components are further documented in their TEI headers. The corpus is available in the directory spc/.
The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel corpus annotated contains the novel in the English original (about 100,000 words in length), and its translations into a number of languages.
George Orwell
Nineteen Eighty-Four |
It was an enormous pyramidal structure of glittering
white concrete, soaring up, terrace after terrace, 300 metres into the air. From where
Winston stood it was just possible to read, picked out on its white face in
elegant lettering, the three slogans of the
Party: |
The translations of ‘1984’ are sentence aligned with the English original, with
hand-validated alignments. From these en-xx alignments the other bi-lingual and one
multilingual alignments were produced automatically. The sentence alignments are stand-off,
i.e. encoded in separate files with pointers to the aligned sentence identifiers. The
alignments are encloding as TEI link
groups, containing links of the form:
<link n="1:2" targets="oana-bg.xml#Obg.1.1.6.7 oana-ro.xml#Oro.1.2.7.7
oana-ro.xml#Oro.1.2.7.8"/>
The first edition of the "1984" corpus is detailed in the MULTEXT-East project reports MTE D2.1 F ""1984" Corpus Collection" (with Appendix) and MTE D2.3 F "Sentence Alignment" and "Morphosyntactic Tagging". However, note that the encoding has substantially changed from the first MULTEXT-East version described in the these reports.
The details on the encoding for Version 4 are to be found in the TEI headers in the two corpora, i.e. the cesDoc "orwl" TEI headers and cesAna "oana" TEI headers.
The details on the morphosyntactic descriptions, MSDs, used to word-tag V4 of the corpus are given in the MULTEXT-East Morphosyntactic Specifications.
The resources are mounted on the Web, on the MULTEXT-East Version 4 Web site. The documentation and the morphosyntactic specification are freely available, while the corpora and lexica are restricted to research use only. To get access to these resources, the Web based MULTEXT-East research licence should be filled out and submitted. The password is then sent by email.
This section lists the contributors of the MULTEXT-East resources (Version 4), starting with the partners of the MULTEXT-East project, the partners of the TELRI concerted action that contributed resources to MULTEXT-East, followed by MONDILEX and other contributors, and, finally, the acknowledgements to other people that made the production of these resources possible. Further information on who did what can be found in the corpus headers and the title page of the morphosyntactic specifications. Next to each partner is listed also their responsibility; for resources of the particular languages, this contact point should also be used for further inquiries.
Listed below are the partners of the original Copernicus MULTEXT-East project [Picture].
Apart from the institutions and people listed above, the following people have greatly contributed to the production of the MULTEXT-East language resources: Renata Anžič, Liviu Anca, Ana-Maria Barbu, Aleksandra Bizjak, Damjan Bojadžiev, Lydia Bozhilova, Aleš Dobinikar, Külli Habicht, Daniel Hirst, Milena Hnátková, Primož Jakopin, Riina Mosna, Kadri Muischnek, Mircea Nicolescu, Heili Orav, Vasile Pătraşcu, Leho Paldre, Tsvetan Petrov, Helen Potter, Andriela Rääbis, Georgiana Rotariu, Zygmunt Saloni, Matjaž Sešek, Tanja Semen, Urve Talvik, Bojana Todorovič, Elias Treeman, Viire Villandi, Olga Vuković, Marcin Woliński.
Work on MULTEXT-East resources was supported by the European Union project MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages, Copernicus 106, and in part by the US National Science Foundation grant IRI-9413451. The MULTEXT-East project results have been greatly enhanced due to the EU Concerted Action TELRI: Trans-European Language Resources Infrastructure. In particular, additional language resources have been produced, and the project's material organised for CD distribution. Work on the second release of the MULTEXT-East resources was supported by EU Copernicus Project PL96-1142 CONCEDE:Consortium for Central European Dictionary Encoding, while the work on the third release was partially funded by the grants from the National Endowment for the Humanities, in the scope of the TEI Task Force on SGML to XML migration. The fourth release was largelly supported by the EU FP7 Capacities - Research Infrastructures project MONDILEX "Conceptual Modelling of Networking of Centres for High-Quality Research in Slavic Lexicography and Their Digital Resources". The work on the resources has been additionally supported by bi-lateral projects between Slovenia and Serbia and Slovenia and Macedonia and individual partners' grants and contracts.