Next: License agreements Up: Introduction Previous: Background

Description of the Corpus

The MULTEXT-East corpus is divided into a text corpus, and a speech corpus. The text corpus consists of the parallel part, as the basis for translation studies, and two parts to allow comparable studies to be carried out in different languages. The overall composition is thus as follows:

1.: Multilingual Parallel Corpus: 1984
This set (7 x 100k words) is the multilingual parallel corpus, consisting of the English original and the translated data in the six languages of the project. The parallel data chosen for the project was the novel ``1984'' by George Orwell. The choice of data was motivated by the availability of translations in all six languages and the availability of the English original in digital form from the Oxford Text Archive via the European Corpus Initiative. Details of this set can be found in Chapter 2.
2.: Multilingual Comparable Corpus: Fiction
The intention was to make this first part of the comparable corpus (6 x 100k words) a set of six complete novels written by native speakers of the MULTEXT-East languages. This goal was not met completely, due to copyright and obtainability problems. In some cases the language components of this part are therefore composed either from excerpts of novels or of collections of short stories. Details of this set can be found in Chapter 3.
3.: Multilingual Comparable Corpus: News
This second part of the comparable corpus (6 x 100k words) is composed of newspaper articles from the six countries of the project. Details of this set can be found in Chapter 4.
4.: Multilingual Speech Corpus: EUROM
Finally, the MULTEXT-East corpus comprises a small parallel speech corpus. For this corpus, a sample (200 sentences) of the English part of the EUROM1 multilingual speech database was selected. This text was translated into the six languages and, except for Bulgarian and Czech, recorded by one male speaker and digitised in accordance with the EUROM recommendations. Furthermore, the texts were also encoded in CES format, to make them part of the overall MULTEXT-East CES corpus. Details of this set can be found in Chapter 5.

For each corpus component that was obtained in digital form, the original file has also been preserved.

Next: License agreements Up: Introduction Previous: Background

Multext-East