Multext-East "1984" Corpus

The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel corpus contains the novel in the English original, and its translations into Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovene.

The corpus is SGML marked-up in accordance with the TEI-like Corpus Encoding Specification (CES). The annotation includes header information and structural markup harmonised down to paragraph level. Some sub-paragraph markup is also included, most importantly, validated <s> markup. Structural elements of the corpus are marked with IDs. The corpus is described in detail in the Corpus Collection Report: Orwell's "1984".

The six translations of "1984" are sentence aligned with the English original, and the alignments validated. The alignment documents do not contain primary data, but are encoded in accordance with the CES conventions for parallel text alignment, i.e. they contain hyperlinks to <s> elements of the translations and the original. The alignment of the MULTEXT-East "1984" corpus is further explained in the Corpus Markup Report: Sentence Alignment.

Due to the Copernicus concerted action TELRI it was possible to add new translations (Lithuanian, Latvian, Serbo-Croatian) to the MULTEXT-East seven-language "1984" corpus. These translations have been structurally marked up and aligned in the same way as the MULTEXT-East ones, and are documented in the Appendix of the Corpus Collection Report. As a volunteer effort, the eleventh "1984, Russian, was added to the corpus.

Finally, each of the seven MULTEXT-East "1984" is tagged for part-of-speech. The annotated corpus is stored in separate documents, in accordance with the CES conventions for segmentation and grammatical annotation. The documents contain tokenised primary data, with linguistic annotation given for the (word) tokens. This annotation consists of the token's lemma, its part-of-speech tag and/or its morphosyntactic description. The mophosyntactic descriptions are further explained in the Specifications and Notation for Lexicon Encoding Report. The tagged corpus is described in more detail in the Corpus Markup Report: Morphosyntactic Tagging.

Selected parts of the "1984" corpus have been translated into HTML 3.2. A number of tools have been used for this translation (rendering). FRED was used for the CES2HTML translation, and NSL for knitting the alignment documents. LaTeX2HTML was used to translate the project reports.

The table below gives, for each language: the documentation on the "1984" corpus, i.e. the language section from the D2.1 Report; the CES Header of the "1984"; its first chapter as a sample of the CES Document primary data; the Alignment of the first chapter with English; and the Header of the annotated "1984".

The document and alignment samples are provided in ISO-8859-2 (ISO Latin-2) for Bg, Cs, Hu, Ro, Sh, Sl, and, sloppily, for Et, Lv, Lt; and in ISO-8859-5 (ISO Cyrillic) for Bg and Ru.

English	Report	Header	Document		Annotation Header
Bulgarian	Report	Header	Document	Alignment	Annotation Header
Czech	Report	Header	Document	Alignment	Annotation Header
Estonian	Report	Header	Document	Alignment	Annotation Header
Hungarian	Report	Header	Document	Alignment	Annotation Header
Romanian	Report	Header	Document	Alignment	Annotation Header
Slovene	Report	Header	Document	Alignment	Annotation Header
Latvian	Report	Header	Document	Alignment
Lithuanian	Report	Header	Document	Alignment
Serbo-Croatian	Report	Header	Document	Alignment
Russian		Header	Document

The "1984" corpus can be found in the /corp/1984/ directory of the CD.
WWW access is restricted for copyright reasons.

[up] [home]

Last updated 21-December-1997 by et