The MULTEXT-East text corpus has two main components --- one set as the basis for translation studies and two sets to allow comparable studies to be carried out in different languages.
The first set (7 100k words) is the Multilingual Parallel
Corpus, consisting of the English original and the translated data in
the six languages of the project. The parallel data chosen for the
project is the novel "1984" by George Orwell. The choice of data
was motivated by the availability of translations in all six
languages and the availability of the English original in digital form
from the Oxford Text Archive via the European Corpus Initiative.
Currently, all seven "1984" components of the multilingual parallel corpus have been collected and SGML validated. Details of this set can be found in Chapter 2.
The second two sets (2 6
100k words) constitute the
Multilingual Comparable Corpus. The intention was to make one set six
complete novels written by native speakers of the respective
languages, and the other set daily newspaper articles from the six
countries. This goal was not met completely, due to copyright and
obtainability problems; the composition of the two sets is given
below.
Multilingual Comparable Corpus, Fiction:
Multilingual Comparable Corpus, Newspapers:
Currently, a large majority of the multilingual coparable corpus has been collected and SGML validated. Details of this set can be found in Chapter 4.
Most of the partners have by now obtained written license agreements for their component corpora. However, securing such agreements has been very time consuming, and, in some cases unsuccessful in the first instance, requiring the modification of the corpus structure. As there isn't much precedence in these matters in Central & Eastern European Countries, this is hardly surprising: ready made license agreements were not available, and publishers were in most cases wary of agreeing to have their data publicly available, fearing potential loss of royalties from e.g. third-part re-prints of their materials. This situation is somewhat contradictory, as, on the one hand, the publishers regard their texts as highly sensitive materials, while, on the other, there is practically no mechanism for enforcing copyright laws in most of the countries of the project. The details of particular license agreements of the component corpora are given in the language sections of the next three chapters.
Our intention is that the MULTEXT-East corpora will be made publicly available, probably under the auspices of the new European Linguistic Resource Association (ELRA). In the second year of the project, an agreement should be set up with ELRA, specifying the conditions of the final MULTEXT-East corpus distribution.