Next: Structure of the Corpus Up: Multilingual Parallel: Orwell's ``1984'' Previous: Multilingual Parallel: Orwell's ``1984''

Overview

The novel ``1984'' by George Orwell is the central component of the MULTEXT-East corpus: it is the parallel text, where the English original is sentence aligned with the six languages of the project, and each translation tagged for part-of-speech. Despite the small size of this parallel corpus (7 x 100k words), it can nevertheless constitute a valuable linguistic resource for the MULTEXT-East languages, especially as the project also delivers lexica which cover the word-forms of ``1984''.

It was therefore important to ensure that the CES markup of ``1984'' is similar across the languages, as much as allowed by the differences of translations. This process proceeded in a cyclic fashion, with initial errors of alignment guiding the harmonisation process. The greatest care was thus devoted to checking the structural markup (paragraphs, sentences, lines, items), while the density of sub-sentence markup differs more across the languages.

Below we give an estimate for the number of words, by language. The wordcounts were produced by removing the SGML tags from the texts and then using a 'wc'-like procedure.

English 104.302

Romanian 101.460

Slovene 91.619

Bulgarian 87.235

Czech 80.366

Hungarian 81.147

Estonian 79.334

The following sections give the details on the encoding of the ``1984'' corpus as a whole, which is followed by sections that detail the MULTEXT-East encoding of the English original, and the Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovene translations. For each language we give a description of the corpus, its structure, the structure of the original from which the CES version was derived, and the markup process that led from one to the other.

Note that the added 'linguistic' markup of the ``1984'' corpus (i.e. alignment and part-of-speech annotation), which is contained in separate SGML documents, hyperlinked to the ``1984'' corpus, is described in the MULTEXT-East Deliverable D23.

Next: Structure of the Corpus Up: Multilingual Parallel: Orwell's ``1984'' Previous: Multilingual Parallel: Orwell's ``1984''

Multext-East

English	104.302
Romanian	101.460
Slovene	91.619
Bulgarian	87.235
Czech	80.366
Hungarian	81.147
Estonian	79.334