COP project 106 MULTEXT-East Deliverable D2.3 F-- Introduction
This report documents Deliverable D2.3 carried out within the framework of the Copernicus 106 MULTEXT-East (Multilingual Text Tools and Corpora for Eastern and Central European Languages) Project. The MULTEXT-East task 2.3 consists of added linguistic annotations to the parallel part of the MULTEXT-East multilingual corpus collection. The parallel corpus consists of the novel 1984 by G.Orwell in the seven languages of the project.
The markup of the complete MULTEXT-East corpus to CES Level 1 (header & basic structure markup and sentence segmentation) is given in the D2.1 report ``Sample Corpus Collection and Preparation''. The Level 1 markup includes a TEI conformant header (file, encoding, profile and revision descriptions), and universal text elements down to the level of the paragraph, e.g. textual divisions, paragraphs, titles and headings, footnotes, tables and poems. Some CES 2 level markup has also been included, e.g. quoted material, rendition information, and, to varying degrees, abbreviations, dates, names, and numbers. Finally, the parallel ``1984'' corpus has been sentence segmented, the segmentation manually validated and the structural elements of the corpus marked with unique identifiers.
The present report details the additional markup performed on the ``1984'' MULTEXT-East corpus, which consists of:
This additional markup was not encoded in the primary data but, in line with the MULTEXT philosophy, stored in separate SGML documents, and hyperlinked to the primary ``1984'' data. These documents are also encoded in SGML, in the Corpus Encoding Specification (CES) DTD. The CES DTD, along with documentation can be obtained from http://www.cs.vassar.edu/CES/ . The rest of this report explains the structure of the above two annotations to the parallel corpus.