Next: The Corpus Encoding Up: Introduction Previous: Background

Description of the Corpus

The MULTEXT-East text corpus has two main components --- one set as the basis for translation studies and two sets to allow comparable studies to be carried out in different languages.

The first set (7 100k words) is the Multilingual Parallel Corpus, consisting of the English original and the translated data in the six languages of the project. The parallel data chosen for the project is the novel "1984" by George Orwell. The choice of data was motivated by the availability of translations in all six languages and the availability of the English original in digital form from the Oxford Text Archive via the European Corpus Initiative.

Currently, all seven "1984" components of the multilingual parallel corpus have been collected and SGML validated. Details of this set can be found in Chapter 2.

The second two sets (2 6 100k words) constitute the Multilingual Comparable Corpus. The intention was to make one set six complete novels written by native speakers of the respective languages, and the other set daily newspaper articles from the six countries. This goal was not met completely, due to copyright and obtainability problems; the composition of the two sets is given below.

Multilingual Comparable Corpus, Fiction:

Bulgarian: one novel & 4 collections of short stories
Czech: excerpts from the novel: Anna Hostomská "Opera -- pruvodce operní tvorbou"
Estonian: 51 excerpts from novels
Hungarian: excerpts from four novels
Romanian: two novelettes and a novel by Mihai Radulescu
Slovene: the novel "Galjot" by Drago Jancar

Details of this set can be found in Chapter 3.

Multilingual Comparable Corpus, Newspapers:

Bulgarian: "Kontinent" daily
Czech: "Lidové noviny" daily
Estonian: articles from 11 newspapers
Hungarian: "Magyar Hirlap" daily
Romanian: "Romå nia Liberà" daily
Slovene: "Dnevnik" daily

Currently, a large majority of the multilingual coparable corpus has been collected and SGML validated. Details of this set can be found in Chapter 4.

Licence agreements

Most of the partners have by now obtained written license agreements for their component corpora. However, securing such agreements has been very time consuming, and, in some cases unsuccessful in the first instance, requiring the modification of the corpus structure. As there isn't much precedence in these matters in Central & Eastern European Countries, this is hardly surprising: ready made license agreements were not available, and publishers were in most cases wary of agreeing to have their data publicly available, fearing potential loss of royalties from e.g. third-part re-prints of their materials. This situation is somewhat contradictory, as, on the one hand, the publishers regard their texts as highly sensitive materials, while, on the other, there is practically no mechanism for enforcing copyright laws in most of the countries of the project. The details of particular license agreements of the component corpora are given in the language sections of the next three chapters.

Our intention is that the MULTEXT-East corpora will be made publicly available, probably under the auspices of the new European Linguistic Resource Association (ELRA). In the second year of the project, an agreement should be set up with ELRA, specifying the conditions of the final MULTEXT-East corpus distribution.

Next: The Corpus Encoding Up: Introduction Previous: Background

Tomaz Erjavec
Sat May 18 20:25:31 MDT 1996