The novel ``1984'' by George Orwell is the central component of the MULTEXT-East corpus: it is the parallel text, where the English original is sentence aligned with the six languages of the project, and each translation tagged for part-of-speech. Despite the small size of this parallel corpus (7 x 100k words), it can nevertheless constitute a valuable linguistic resource for the MULTEXT-East languages, especially as the project also delivers lexica which cover the word-forms of ``1984''.
It was therefore important to ensure that the CES markup of ``1984'' is similar across the languages, as much as allowed by the differences of translations. This process proceeded in a cyclic fashion, with initial errors of alignment guiding the harmonisation process. The greatest care was thus devoted to checking the structural markup (paragraphs, sentences, lines, items), while the density of sub-sentence markup differs more across the languages.
Below we give an estimate for the number of words, by language. The wordcounts were produced by removing the SGML tags from the texts and then using a 'wc'-like procedure.
English | 104.302 |
Romanian | 101.460 |
Slovene | 91.619 |
Bulgarian | 87.235 |
Czech | 80.366 |
Hungarian | 81.147 |
Estonian | 79.334 |
The following sections give the details on the encoding of the ``1984'' corpus as a whole, which is followed by sections that detail the MULTEXT-East encoding of the English original, and the Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovene translations. For each language we give a description of the corpus, its structure, the structure of the original from which the CES version was derived, and the markup process that led from one to the other.
Note that the added 'linguistic' markup of the ``1984'' corpus (i.e. alignment and part-of-speech annotation), which is contained in separate SGML documents, hyperlinked to the ``1984'' corpus, is described in the MULTEXT-East Deliverable D23.