East meets West:
A Compendium of Multilingual Language Resources


Tomaz Erjavec
Dept. for Intelligent Systems,
Institute Jozef Stefan,

Ann Lawson
Abteilung Lexik
Institut für deutsche Sprache

Laurent Romary
Lorraine, Nancy

1. Overview

The TELRI Concerted Action has released a CD-ROM containing multilingual language resources, mainly corpora, and tools for language engineering. This product represents the concrete results of the joint research aspect of the project, which brought together partners across Europe to work together on a close and practical level. The CD-ROM provides standardised resources for a large number of languages, mainly from non-EU countries, for which such resources still tend to be scarce. The making of these resources served to foster the use of existing conventions and recommendations such as TEI, EAGLES, MULTEXT in Central and Eastern European countries.

The CD-ROM consists of two volumes. The contents of the first volume arose entirely within the TELRI action, and were produced in a process of dissemination of expertise through the work on a common corpus. The second volume contains the results of the EU MULTEXT-East project, which applied standards to a large variety of resource types: corpora, lexica and tools. This project finished in 1997 and its results have been enhanced and prepared for CD distribution in the scope of TELRI.

The two volumes are in broad agreement as to the kind of resources they offer, and the type of encoding they use, but they reflect the different organisational structures from which they grew. TELRI was a concerted action acting on a broad front, with no funds devoted to labour. Most members also had no a priori experience of SGML/TEI and had no dedicated tools available to annotate their texts. So, for example, the "Plato" corpus was encoded directly in generic TEI and TEI Lite (Sperberg-McQueen and Burnard, 1994), rather than trying to emulate more complex schemes, such as the PAROLE specification (Ridings, 1996). MULTEXT-East, on the other hand, was a Joint Project with dedicated funds and so could be more focused on the kinds of resources it produced. Here the partners shared platforms and tools, could adopt a more unified encoding practice and were involved in the definition of conventions which would be more applicable in a linguistic engineering context. So, for example, the MULTEXT-East corpus was prepared using MULTEXT tools (Ide and Véronis, 1994) and is encoded in accordance with the Corpus Encoding Specification, CES (Ide et al, 1996), both developed for such a context. MULTEXT-East also had a clearly defined set of languages that it aimed to produce resources for, namely Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovene, with English serving as the 'hub' of the parallel corpus and as the meta-language of the project.

The production of the two volumes was enhanced by the two teams working in cooperation, as they had to solve similar difficulties (scanning, copyright, etc.) and could share know-how through workshops and exchange various software tools. The concrete result of this cooperation is a double CD, "East meets West: A Compendium of Multilingual Language Resources". In the next sections we give a overview of its contents, concentrating on corpora, lexica, and tools. The paper finishes with a discussion of the CD-ROM's distribution and it's prospective uses.

2. The Corpora

Each volume of the CD ROM contains an annotated multilingual corpus. The two corpora differ in composition and encoding, reflecting the different practices adopted in their creation.

The first volume contains a parallel corpus, comprising Plato's "Republic" in twenty one languages. This corpus grew out of the work within the TELRI working groups WG7 Joint Research, WG5 TELRI Service Pool, and WG4 Lingware Availability. The "Joint Research" WG was formed with the aim of encouraging collaboration between as many of the TELRI partners as possible. For this reason, it was decided to focus on building a sample parallel corpus of translations of one text. Such corpora are able to furnish researchers with considerable information about language patterning in general, for instance on translational equivalents, collocations and phraseological units. At least initially, the size of this corpus was of less import than the coverage across the languages represented in TELRI, and the active involvement of as many partners as possible.

Plato's "Republic" was chosen as the main text for developing a parallel corpus. As explained above, a single text was chosen for reasons of practicability. A well-known classical text was chosen on the grounds that translations in most, if not all, of the TELRI project languages would be likely to exist. In addition, since none of the TELRI project members is a native speaker of Ancient Greek, no language would be privileged by having the original version in their own language. All the texts produced and examined in the parallel corpus are "target texts", since only the Greek is the "source". Furthermore, the age of the text and hence also many of the translation would, it was hoped, eliminate many troublesome copyright issues.

Each project member sought out suitable translations of the "Republic" in their own language. Translations could be found in almost all the project languages, with the exception of Estonian and Albanian. In several languages, more than one translation was available, such as for English and Czech. Partners tried to find versions of the texts already in electronic form, whenever possible. These could be downloaded from the Internet, bought on CD-ROM or acquired direct from the publishers. If no electronic version was available, partners scanned the texts using OCR scanners. In exceptional cases, when the quality of the type was too poor to be recognised by the scanner, the text was typed in by hand. TELRI was able to offer financial assistance for such technical matters.

The conditions varied greatly from country to country, as was expected. The extremes were on the one hand the English texts, with an abundance of websites and downloadable texts (although also with their own problems) and on the other the situation in, for instance, Latvia. In Latvian, no full translation could be found, so the edited partial translation was encoded and used. In some cases, not only was there no electronic version available, but indeed the translation was out of print and the only copies to be found in libraries or in private ownership. Since scanning requires a flat page and libraries object to their books being torn apart, in these cases the texts were painstakingly keyed in.

Another important issue was that of copyright. Copyright holders were approached for permission to use their texts in academic research work. A letter was prepared by the Group for partners to use when writing to copyright holders to ask permission to use their text in project work. When texts were not or no longer in copyright, because of their age for instance, no permission was sought but the publisher or source of the text was always acknowleged. The situation differed in each country and various individual solutions were worked out.

The choice of text made for some interesting problems. Recognised as a fundamental philosphical text, "The Republic" has been published in different forms over many centuries. Some of the editions published are treated as "standards". However, for the purposes of the project, they proved not to be very standard. Most version contained markings or annotations of various kinds, but there was little consistency. Initially, we hoped to be able to use these markings to assist the alignment process and encouraged sites scanning or typing in the texts to retain them. It was then decided to use a particular set of markings, and they were manually inserted into those texts which did not already have them.

We then set about encoding the data. While all partners retained a plain-text version of the "Republic", which is also included on the CD-ROM, it was also encoded according to a widely-accepted convention, namely the TEI guidelines. This involves SGML. a method of encoding electronic texts with information about the structural layout and content of the document. This is of use when analysing the structure of the document, automatically aligning and manipulating it. Most project partners had little or no experience of this particular encoding method at the start of the project. The texts were encoded according to the TEI guidelines using the following conventions:

On the basis of this encoding, the corpus has been automatically aligned up to the level of sentences using the hierarchical aligner devised by Bonhomme et al. (1995) and compiled into a series of HTML files of intertwined texts by pairs of languages. Plain ascii and HTML versions of each text have also been included in the CD-ROM.

All the texts were then uploaded to the Mannheim TELRI ftp site, along with a brief information file detailing the source, status of encoding and person responsible for the version. In this way, all TELRI partners could download the texts for their own use and, if necessary, further encode or alter the text in collaboration with the partner originally responsible. There was thus an interchange between partners, while the integrity of each version was retained.

The Working Group used the texts to test corpus alignment software. The aligning of texts enables the user to compare the translations of particular words or phrases. This is of use for investigating language, training and improving computer-aided translation tools and also for CALL (Computer Assisted Language Learning). For alignment, some segmentation of the texts had to be undertaken. This commonly involved inserting codes to specify where sentence or paragraph boundaries occur. These could then be recognised by the alignment program, along with the markers described above. Many partners have used the parallel texts to examine specific linguistic features of a chosen language pair. This can reflect on the peculiarities of the languages involved and on the nature of translation generally. Much of this research work can also be found on the CD-ROM, with general information about the project.

The second volume contains the corpus of the MULTEXT-East project, with further additions due to TELRI. The corpus is composed of three parts, namely a multilingual parallel corpus, similar to the "Republic" one, a multilingual comparable corpus and a small multilingual speech corpus. The complete corpus has been marked up with header and structural information, and is encoded in the Corpus Encoding Specification, CES (Ide et al), a TEI-like encoding scheme.

The parallel corpus contains the novel '1984' by George Orwell in the original, and translations into the six MULTEXT-East languages. In the scope of TELRI, the corpus was extended by three new translations, namely the Lithuanian, Latvian, and Serbian. The parallel corpus has been marked and validated for sentence boundaries and alignment. Alignment is between the English version and each of the ten languages, thus giving ten pair-wise alignments. The alignments themselves are not included in the primary data, but are expressed in a separate document, which contains ID references to the aligned sentences. The alignments are thus 'flat', i.e. they do not attempt to model the hierarchical structure of the aligned documents. Furthermore, the MULTEXT-East seven-language parallel corpus has been tokenised, and each word token assigned its part-of-speech, or, more accurately, its morphosyntactic description. To arrive at such an annotated corpus involved first developing a harmonised set of lexical specification, as discussed below. It should be noted that this corpus represents the first such effort for most of the languages involved.

The MULTEXT-East corpus further contains a million word multilingual comparable corpus in the six languages of the project. The comparable corpus consist of two parts, the first being 'fiction', comprising either a single novel or excerpts from several novels, and the second the 'news' part, comprising articles from daily newspapers.

Finally, a small parallel speech corpus is also included: it comprises six translations of forty passages from EUROM/SAM, and has been recorded and digitised for four of the languages, namely Estonian, Hungarian, Romanian, and Slovene.

3. Lexical Resources

The MULTEXT-East deliverables include substantial harmonised lexical resources for the six languages of the project. They comprise morphosyntactic descriptions, lexica, and, as mentioned above, the Orwell corpus annotated with this information. The morphosyntactic descriptions follow the Eagles (Monachini and Calzolari 1995) and MULTEXT (Bel, Calzolari and Monachini eds. 1995) specification, and define 14 grammatical categories (POS), each of them with a number of specific attributes, and define for each attribute the set of allowed values. These specifications are provided in tabular format, and define the attributes/values and which combinations are allowed for a particular language. So, for example, the POS 'Verb' has all together 13 attributes, with, at the one extreme, Czech, as a highly inflectional language, utilising 10 of them, and Romanian, at the other extreme, only 5. The tables thus provide a word-level morphosyntactic 'grammar' for the six languages. The morphosyntactic descriptions themselves are written in a compact string notation, with the first character giving the POS and defining the meaning of the remaining characters. With these characters, the position in the string gives the attribute, and a one letter code its value. Furthermore, the special character '-' is used when a certain attribute is not applicable, either for the language or for the particular combination of features. So, for example, the string 'Vmip1d--y' denotes POS:Verb, Type:main, VForm:indicative, Tense:present, Person:first, Number:dual, Gender:not applicable, Voice:not applicable, Negative:yes.

In the lexica of the project, each entry consists of three fields: the word-form, its lemma and its morphosyntactic description. The lexica provide at least 15.000 lemmas for each language, and cover the corpus of the MULTEXT-East project. Except for Estonian and Hungarian, where this is not possible due to the nature of the languages, all the possible word-forms of the lemmas are included in the lexica.

4. Corpus Tools

Both volumes of the CD ROM also contain a variety of software tools. These are either in the public domain, or have been produced by the project partners. Included is a full SGML environment, comprising a SGML-mode for the well known editor emacs, a SGML parser, and DTDs (TEI/TEI Lite/CES), together with some basic SGML aware tools (lexical statistics, dictionary look-up etc.). Other tools, namely aligners, taggers and concordancing systems are also provided.

5. Perspectives

Although the nature of the TELRI and MULTEXT-East resources are clearly specific, the work has more wide-ranging consequences. The CD ROM provides standardised resources for over 20 languages, and tools to produce further, compatible resources, or exploit the ones already provided. Its possible applications are thus numerous. It could provide data for lexical studies, in particular those of translation equivalents; for teaching language or translators; or serve as learning data for taggers, aligners, tokenisers, and similar trainable programs. Aligned language pairings are provided which would otherwise be virtually impossible to find. The CD ROM uses HTML to structure its contents and provide documentation on the resources. These pages include external HTTP links, pointing to useful environments and documentation. The CD ROM can thus also serve as a 'primer' for language engineering applications, being especially useful for sites with poor Internet connections. Another important result, though less tangible, lies in the sharing of knowledge and expertise throughout the work. Institutes, indeed countries, with little or no experience in the Language Engineering field have gained expertise in corpus selection, collection, encoding and manipulation. This expertise is now being used to produce new corpora and to encode existing corpus material.

