This report documents the Deliverable D2.1 carried out within the framework of the Copernicus 106 MULTEXT-East (Multilingual Text Tools and Corpora for Eastern and Central European Languages) Project. It consisted of collecting and annotating a multilingual corpus for the six languages of the project, totaling cca 2 million words. The MULTEXT-East corpus has four components: a parallel corpus, two collections of comparable material, and a small parallel speech corpus.
In the scope of D2.1, the text collections were obtained, in most cases in pre-existing digital form and were up-translated to at least Level 1 (header & basic structure markup) of the Corpus Encoding Specification, while the speech corpus was translated, and, for most of the languages, recorded, digitised and encoded in accordance with EUROM recommendations. The parallel corpus has additionally been sentence segmented and strucural elements marked with unique identifiers.
In addition to basic structure encoding, the six translations of the parallel text corpus have been sentence aligned to the (English) original. Furthermore, the parallel corpus has been tokenised and annotated for parts-of-speech. For this additional annotation, stored in separate SGML documents, refer to MULTEXT-East Deliverable D23.
As modern day corpora go, the MULTEXT-East textual corpus is of modest size: each of the three sets for the six languages has approximately 100k words. While its small size limits the utility of such a corpus, it nevertheless represents a valuable resource: it is the first such collection for Central and East European Languages and for some of the partners, a first standardised corpus of their languages. At least the parallel part is also significantly annotated in a common scheme. Furthermore, the corpora are only a part of the MULTEXT-East deliverables, with e.g. lexicons covering the words in the corpus giving added value to the corpus itself.
In this report we document the acquisition and preparation of the data delivered as the MULTEXT-East corpus. The structure of the report is similar to that of the MLCC (Multilingual Corpora for Cooperation) report. For each corpus component, we begin with a general overview of the data, followed by a detailed description of (English), Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovene corpora.