MULTEXT-East Language ResourcesVersion 3 |
MULTEXT-East Language Resources Version 3 |
Tomaž Erjavec, |
2004-05-13 |
This document describes the third edition of the MULTEXT-East language resources, a multilingual dataset for language engineering research and development. This standardised and linked set of resources covers a large number of mainly Central and Eastern European languages and includes annotated parallel, comparable and speech corpora with morphosyntactic lexica and specifications. The most important component is the linguistically annotated corpus consisting of Orwell's novel ‘1984’ in the English original and translations.
These release builds on the results of several EU projects: MULTEXT-East (produced linked resources for Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian and English), TELRI (added resources for Lithuanian, Croatian, Serbian, and Russian; first release), and CONCEDE (validation, re-encoding; partial re-release); and the TEI Task Force on SGML to XML migration (conversion to XML).
Version 3 of MULTEXT-East resources brings together the first two releases (TELRI and CONCEDE), makes them available in TEI P4 XML, and introduces further extensions, e.g. the annotated ‘1984’ in Serbian and the specification for Resian, a dialect of Slovene. This dataset, unique in terms of languages and the wealth of encoding, is extensively documented, and freely available for research purposes, according to the MULTEXT-East research licence; a local text copy is provided for reference.
The published paper describing this version of the resources is:
Tomaž Erjavec:
MULTEXT-East Version 3:
Multilingual Morphosyntactic Specifications, Lexicons and
Corpora.
In the Proceedings of the Fourth International Conference on
Language Resources and Evaluation, LREC'04
(ELRA).
Paris
2004.
[PDF]
Please acknowledge the use of resources in publications by citing the above paper, or, if more relevant, others from the MULTEXT-East annotated bibliography.
The rest of this document is structured as follows: In the next subsection, we give a synopsis of languages that this distribution provides resources for. Section 2 ‘MULTEXT-East resources’ describes the resouces offered in detail, containing sections on the East morphosyntactic resources, MULTEXT-East cesDoc corpus, and the MULTEXT-East ‘1984’ cesAna corpus. Section 3 ‘Bibliography and related research’ essentially points to the annotateed bibliography, Section 4 ‘Distribution’ discusses the availability of the corpus, and Section 5 ‘Contributors’ gives lists the contact points for particular languages, and acknowledgment to those that were involved in producing the resources.
Below we list the languages represented in the resources, with the link to their respective Ethnologue ISO 639 entries.
This section gives the details of the MULTEXT-East resources Version 3: Section ‘MULTEXT-East morphosyntactic resources’ details the three linked word-level syntactic resources: the specification for morphosyntactic descriptions, the morphosyntactic lexica, and word-level annotated corpus (‘1984’); Section ‘MULTEXT-East cesDoc corpus’ introduces the structurally marked up corpus, consisting of a parallel part (again ‘1984’), two comparable parts (fiction, newspapers), and a small speech corpus; Section ‘MULTEXT-East ‘1984’ corpus’ revisits this corpus and recaps the relevant data in somewhat more detail.
By far the most useful part of the MULTEXT-East project deliverables proved to be the morphosyntactic resources, which were later re-released in the CONCEDE edition and consist of three layers:
In the next sections we detail each of these layers in turn.
Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian, Croatian, Resian.
The syntax and semantics of the morphosyntactic descriptions (MSDs) are given in the MULTEXT-East morphosyntactic specifications, which have been developed in the formalism and on the basis of specifications for six Western European languages of the MULTEXT project and in cooperation with EAGLES, the Expert Advisory Group on Language Engineering Standards.
Originally, these specifications were released as a report of the MULTEXT-East project but have been revised for both subsequent releases. The complete specifications are structured as a report, and contain introductory chapters, followed by the list of defined categories (parts-of-speech), and then, for each category, a table of attribute-values, and the languages the features are appropriate for. These so called common tables are followed by language particular sections. Each language section is further subdivided, and can contain feature co-occurrence restrictions, examples, notes, and full lists of valid MSDs, as well as localisation information. The formal core of the specifications resides in the common tables, as they define the features, their codes for MSD representation, and their appropriateness for each language.
Technically, the complete specifications are a LaTeX document (with derived PDF and HTML renderings), where the common tables are plain ASCII in a strictly defined format. This format is suitable for a viewing, and reasonably manageable for modification and addition of new languages. However, it is not appropriated for processing, in particular for enabling smooth manipulation and linking to an XML encoded corpus using the MSDs. We have therefore implemented a (Perl) conversion of the common tables into XML, to the TEI.fs module, a tagset devoted to encoding feature-structures. This tagset is currently being used as the basis of an evolving ISO standard (currently a Draft International Standard), as part of work of ISO/TC 37/SC4 Language Resource Management.
In the distribution, the specifications are given in the original and several derived encodings; furthermore, closely related reports of MULTEXT and EAGLES are provided:
Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian.
The MULTEXT-East morphosyntactic lexicons have a simple structure, where each lexical entry is composed of three fields:
To produce the lexica, the token lists of the MULTEXT-East corpus were first fed through morphological analysers in order to produce the lemma list; this list was further extended from the comparable corpus, to arrive at at least 15,000 lemmas - some languages have further extended this, e.g., Romanian to 41,000 lemmas. In the next step, the lemmas were fed back to morphological generators (except for the agglutinative languages) in order to produce the complete inflected lists, i.e., the full paradigms of the lemmas, which constituted the final lexica of the project.
The MULTEXT-East lexica serve as medium sized morphological lexica for the languages. In addition to explicating the inflectional behaviour of the most common (and, typically, morphologically the most complex) words of the languages, the lexica also serve to establish the definitive set of valid MSDs for the languages.
To serve as a standard registry of MSDs, we converted the lexical MSDs to TEI feature structure libraries, fsLib, one for each category. Here each MSD is expressed as a feature structure specifying its id, the language(s) it is appropriate for, and its decomposition into features.
The structure and contents of the lexica is explained in MULTEXT-East report MTE D1.2 M: Language-specific Resources (but note that the lexica have in the meantime been revised, so the details in these report are no longer correct).
The lexica themselves are available in the directory lex/, where there is also a README file and (maybe) a file giving various counts on the lexica (note that WWW access to this directory is restricted).
Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, Serbian.
This is the centrepiece of the resources, as it contains word level markup, namely context disambiguated lemmas and MSDs, e.g., <w lemma="it" ana="Pp3ns">it</w>. This, so called cesAna corpus, is suitable for PoS tagging experiments; because it was the first such resource for many of the languages it was also the most difficult to produce as the work had to proceed mostly manually.
The corpus is available in the directory crp/, where the driver file is mte-cesana.xml (note that WWW access to this directory is restricted).
The extended MULTEXT-East resources were first released in the scope of the TELRI concerted action, first on a CD-ROM and later through TRACTOR, the TELRI Research Archive of Computational Tools and Resources.
For Version 3 the structurally annotated texts from this edition (so called cesDoc corpus, as this was the format it was originally encoded in) were converted from SGML to XML. The corpus is stored as one TEI P4 document with the <teiCorpus.2> root element, comprising the header, and the component texts, i.e. the multilingual parallel speech, comparable fiction and news, and parallel ‘1984’ texts. The corpus is further documented in its corpus and text headers, and also in the original MULTEXT-East report D2.1 F: Corpus Collection and Preparation, although the information there is no longer current in all respects.
The corpus is available in the directory crp/, where the driver file is mte-cesdoc.xml. The complete corpus texts are also available in HTML, in the directory htm/ (driver file) produced by the Sebastian Rhatz' TEI stlysheets (note that WWW access to these directories is restricted).
The table below links to the HTML view of the MULTEXT-East cesDoc corpus, namely to the teiHeaders and the texts.
1, MULTEXT-East cesDoc corpus, HTML view
Language | "1984" | Speech | Fiction | News |
English | Header Text | Header Text | ||
Bulgarian | Header Text | Header Text | Header Text | Header Text |
Czech | Header Text | Header Text | Header Text | Header Text |
Estonian | Header Text | Header Text | Header Text | Header Text |
Hungarian | Header Text | Header Text | Header Text | Header Text |
Romanian | Header Text | Header Text | Header Text | Header Text |
Slovene | Header Text | Header Text | Header Text | Header Text |
Lithuanian | Header Text | |||
Serbian | Header Text | |||
Russian | Header Text |
Languages: Romanian, Slovene, Estonian, Hungarian, (English, Czech, Bulgarian)
MULTEXT-East produced a small corpus of spoken texts taken from the EUROM-1 speech corpus. It comprises the translations (from English) of forty short passages of five thematically connected sentences. For four languages, the texts have also been read, recorded and included in the distribution. The corpus texts contain links to the spoken passages, which have for V3 been normalised in terms of volume, and stored as .wav files. The speech files are, due to their size, stored and distributed in a separate bundle.
Languages: Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian.
The multilingual comparable corpus contains a fiction part and a news part, where the data is comparable across the languages in terms of the number and size of texts; each of the 12 parts has approx. 100,000 words. The corpus is structurally marked up with over 40 different elements; however. sub-paragraph markup has not been harmonised across the languages.
Languages: English, Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian, (Latvian), Lithuanian, Serbian, Russian.
The multilingual parallel corpus consists of the novel ‘1984’, about 100,000 words in length. The corpus contains extensive headers and markup for document structure, sentences, and various sub-sentence annotations, these similar to the comparable corpus, but better harmonised over languages.
The translations of ‘1984’ have been automatically sentence aligned with the English original, and the alignments hand-validated. The bilingual alignments are valid to xcesAlign.dtd, i.e., are stored not with the primary data but in separate documents, as references to sentence IDs, e.g., <link xtargets="Osl.1.2.6.6 ; Ocs.1.1.5.6 Ocs.1.1.5.7"/>.
The cesDoc encoded novel served as the basis for producing the linguistically annotated version. The link between the two is maintained via sentence identifiers.
The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel corpus annotated contains the novel in the English original (about 100,000 words in length), and its translations into a number of languages.
George Orwell
Nineteen Eighty-Four |
It was an enormous pyramidal structure of glittering white
concrete, soaring up, terrace after terrace, 300 metres
into the air. From where Winston stood it was just
possible to read, picked out on its white face in elegant
lettering, the three slogans
of the Party:
|
The table below gives, for each language, the HTML rendering of: the cesDoc corpus Text of the novel from the cesDoc corpus; the relevant section from the MULTEXT-East D2.1 Report the TEI cesDoc header of "1984", and, where available, the TEI cesAna header with the language particular section of the MULTEXT-East Morphosyntactic Specification for MSD.
4, MULTEXT-East "1984", HTML view
All | Text | Report | cesDoc header | cesAna header | MSD |
English | Text | Report | cesDoc header | cesAna header | MSD |
Bulgarian | Text | Report | cesDoc header | cesAna header | MSD |
Czech | Text | Report | cesDoc header | cesAna header | MSD |
Estonian | Text | Report | cesDoc header | cesAna header | MSD |
Hungarian | Text | Report | cesDoc header | cesAna header | MSD |
Romanian | Text | Report | cesDoc header | cesAna header | MSD |
Slovene | Text | Report | cesDoc header | cesAna header | MSD |
Latvian | Report | ||||
Lithuanian | Text | Report | cesDoc header | ||
Serbian | Text | Report | cesDoc header | cesAna header | MSD |
Russian | Text | cesDoc header |
The novel is encoded in TEI P4 and exists in two versions:
Apart from the marked-up texts themselves the corpus has also two other components:
The first edition of the "1984" corpus is detailed in the MULTEXT-East project reports MTE D2.1 F ""1984" Corpus Collection" (with Appendix) and MTE D2.3 F "Sentence Alignment" and "Morphosyntactic Tagging". However, note that the encoding has substantially changed from the first MULTEXT-East version described in the preceeding reports. The details on the encoding for Version 3 are to be found in the TEI headers in the two corpora, i.e. the cesDoc teiHeader and cesAna teiHeader. The details on the morphosyntactic descriptions used to word-tag V3 of the corpus are given in the MULTEXT-East Morphosyntactic Specifications, and also in the MSD library teiHeader, which is a part of the cesAna corpus.
The resources are mounted on the Web, on the MULTEXT-East Version 3 Web site. The documentation and the morphosyntactic specification are freely available, while the corpora and lexica are restricted to research use only. To get access to these resources, the Web based MULTEXT-East research licence should be filled out and submitted. The password is then sent by email.
Registered users can browse the full resources on-line, or they can download them; they are distributed as gzipped tar files (.tgz). Due to the size of the resources and the fact that different users are likely to use only parts, they are available not only as a complete download, but also split by resource type:
After download unpack the directories. So, for examples, say the distribution was made on 2004-05-05 and you want everything except speech. Then you download -doc, -crp, and -ana, and, on a Unix machine, run:
$ tar xfz mteV3-2004-05-05-doc.tgz $ tar xfz mteV3-2004-05-05-crp.tgz $ tar xfz mteV3-2004-05-05-ana.tgzAll three archives would unpack into the directory mteV3-2004-05-05/
At some point we plan to distribute the MULTEXT-East V3 resources also on CD-ROM. If this would be of interest, please get in touch.
An annotated bibliography is available in a separate folder bib/, where it is available in HTML and PDF. Some papers are also mirrored there, in particular the one describing Version 3 of the MULTEXT-East resources:
Tomaž Erjavec: MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora. In the Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC'04 (ELRA). Paris 2004
This section lists the contributors of the MULTEXT-East resources (Version 3), starting with the partners of the MULTEXT-East project, the partners of the TELRI concerted action that contributed resources to MULTEXT-East, additional contributors, and, finally, the acknowledgements to all the other people that made the production of these resources possible. Further information on who did what can be found in the corpus headers and the title page of the morphosyntactic specifications. Next to each partner is listed also their responsibility; for resources of the particular languages, this contact point should also be used for further inquiries.
Listed below are the partners of the original Copernicus MULTEXT-East project [Picture].
Apart from the institutions and people listed above, the following people have greatly contributed to the production of the MULTEXT-East language resources: Renata Anžič, Liviu Anca, Ana-Maria Barbu, Aleksandra Bizjak, Damjan Bojadžiev, Lydia Bozhilova, Aleš Dobinikar, Külli Habicht, Daniel Hirst, Milena Hnátková, Primož Jakopin, Riina Mosna, Kadri Muischnek, Mircea Nicolescu, Heili Orav, Vasile Pătraşcu, Leho Paldre, Tsvetan Petrov, Helen Potter, Andriela Rääbis, Georgiana Rotariu, Matjaž Sešek, Tanja Semen, Urve Talvik, Bojana Todorovič, Elias Treeman, Viire Villandi, and Olga Vuković.
The work on MULTEXT-East resources was supported by the European Union project MULTEXT-East: Multilingual Text Tools and Corpora for Central and Eastern European Languages, Copernicus 106, and in part by the US National Science Foundation grant IRI-9413451. The MULTEXT-East project results have been greatly enhanced due to the EU Concerted Action TELRI: Trans-European Language Resources Infrastructure. In particular, additional language resources have been produced, and the project's material organised for CD distribution. Work on the second release of the MULTEXT-East resources was supported by EU Copernicus Project PL96-1142 CONCEDE:Consortium for Central European Dictionary Encoding, while the work on the third release was partially funded by the grants from the National Endowment for the Humanities, in the scope of the TEI Task Force on SGML to XML migration. The work on the resources has been additionally supported by national funding bodies and individual partners' grants and contracts.