next up previous contents
Next: Description of the Up: Introduction Previous: Introduction

Background

This report documents the Intermediate M Deliverable D2.1 carried out within the framework of the Copernicus 106 MULTEXT-East (Multilingual Text Tools and Corpora for Eastern and Central European Languages) Project. The MULTEXT-East task 2.1 consists of a small (cca 2M words) multilingual corpus collection for the six languages of the project. In the first year of the project the text collections were obtained, in most cases in pre-existing digital form, and were up-translated to Level 1 (header & basic structure markup) of the TEI-compliant Corpus Encoding Standard.

The final goal of the MULTEXT-East project is to produce a multilingual corpus with two main components, a parallel corpus of translations and two multilingual collections of comparable material, as well as a small speech corpus. The parallel corpus is to be sentence aligned, and parts of the corpus are to be marked-up with morphosyntactic tags. The data is destined to be made available at cost with minimal restrictions.

As modern day corpora go, the MULTEXT-East corpus is of modest size: each of the three sets for the six languages has approximately 100k words. While its small size limits the utility of such a corpus, it does represent a first step in developing a multilingual corpus for Central and East European Languages, and for some of the partners, a first standardised corpus of their languages. Furthermore, the corpora are only a part of the MULTEXT-East deliverables, with e.g. the lexicons covering the words in the corpus giving added value to the corpus itself.

In the first year of the project the corpus-related effort of the partners has been mainly in finalising the exact contents of the corpus, obtaining the text collections, installing the necessary SGML corpus processing software, and up-translating the data to compliance with the current version of the Corpus Encoding Standard, CES.

However, not all the component corpora have been obtained yet, and a few are not at this point CES-1 conformant. Further harmonisation is also required for the present corpus, refinements may be made in the definition of CES, and the MULTEXT tools are to be used to additionally mark-up the corpus. Therefore the final version of the MULTEXT-East corpus will not be available until project completion.

In this report we document the acquisition and preparation of the data delivered by milestone M in fulfilment of the MULTEXT-East contract. The structure of the report is similar to that of the final MLCC (Multilingual Corpora for Cooperation) report. We begin with a general overview of the data, followed by a detailed description of the Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovene corpora.



next up previous contents
Next: Description of the Up: Introduction Previous: Introduction



Tomaz Erjavec
Sat May 18 20:25:31 MDT 1996