Next: License agreements
Up: Introduction
Previous: Background
The MULTEXT-East corpus is divided into a text corpus, and a speech
corpus. The text corpus consists of the parallel part, as the basis
for translation studies, and two parts to allow comparable studies to
be carried out in different languages. The overall composition is thus
as follows:
- 1.
- Multilingual Parallel Corpus: 1984
This set (7 x 100k words) is the multilingual parallel
corpus, consisting of the English original and the translated data
in the six languages of the project. The parallel data chosen for
the project was the novel ``1984'' by George Orwell. The choice of
data was motivated by the availability of translations in all six
languages and the availability of the English original in digital
form from the Oxford Text Archive via the European Corpus
Initiative. Details of this set can be found in Chapter 2.
- 2.
- Multilingual Comparable Corpus: Fiction
The intention was to make this first part of the comparable
corpus (6 x 100k words) a set of six complete novels written
by native speakers of the MULTEXT-East languages. This goal was not met
completely, due to copyright and obtainability problems. In some
cases the language components of this part are therefore composed
either from excerpts of novels or of collections of short stories.
Details of this set can be found in Chapter 3.
- 3.
- Multilingual Comparable Corpus: News
This second part of the comparable corpus (6 x 100k words) is
composed of newspaper articles from the six countries
of the project. Details of this set can be found in
Chapter 4.
- 4.
- Multilingual Speech Corpus: EUROM
Finally, the MULTEXT-East corpus comprises a small parallel speech corpus.
For this corpus, a sample (200 sentences) of the English part of the
EUROM1 multilingual speech database was selected. This text was
translated into the six languages and, except for Bulgarian and
Czech, recorded by one male speaker and digitised in accordance
with the EUROM recommendations. Furthermore, the texts were also encoded
in CES format, to make them part of the overall MULTEXT-East CES
corpus. Details of this set can be found in Chapter 5.
For each corpus component that was obtained in digital form, the
original file has also been preserved.
Next: License agreements
Up: Introduction
Previous: Background
Multext-East