TELRI & MULTEXT-East MTE D2.1 F; TELRI Annex I. ``1984'' Corpus
Workpackage Coordinator: Tomaz Erjavec
Contributors:
Latvian: Andrejs Spektors
Lithuanian: Andrius Utka
Serbo-Croat: Cvetana Krstev, Dusko Vitas
This report is an appendix to the MTE D2.1 F report, which documents the Deliverable D2.1 carried out within the framework of the Copernicus 106 MULTEXT-East (Multilingual Text Tools and Corpora for Eastern and Central European Languages) Project. This deliverable consisted of collecting and annotating a multilingual corpus, totaling cca 2 million words. The central component of this corpus is the parallel corpus, composed of Orwell's novel ``1984'' in the English original, along with translations into Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovene. In addition to basic structural encoding, this corpus was sentence aligned between the translations and English, and each language component tagged for part-of-speech.
Due to the Copernicus concerted action TELRI, (Trans-European Language Resources Infrastructure), it was possible to add three new translations to the existing seven-language ``1984'' corpus. This Appendix documents these new ``1984'' language components, namely the Latvian, Lithuanian, and Serbo-Croat translations of ``1984''.
In order to make this Appendix as self-contained as possible, we give in the following sections the details on the encoding of the TELRI ``1984'' additions as a whole, highlighting the points common both the MULTEXT-East and the TELRI ``1984'' corpus: these sections are very similar to the Overview section of the ``1984'' corpus collection report, i.e. MULTEXT-East D2.1 F, Section 2.1.
These overview sections are followed by sections detailing the encoding of the TELRI translations. For each language we give a description of the corpus, its structure, the structure of the original from which the CES version was derived, and the markup process that led from one to the other.
Note that the added alignment markup of the ``1984'' TELRI corpus, which is contained in separate SGML documents, hyperlinked to the ``1984'' corpus, is in structure identical to the alignments of the MULTEXT-East ``1984'' corpus. The reader is referred to the MULTEXT-East Deliverable D2.3 F for the description of these alignments.