next up previous contents
Next: Task 2.2. Test Up: WP2. Tool application Previous: WP2. Tool application

Task 2.1. Sample corpus collection and preparation

This task will consist of a small corpus collection for the project. For written texts, pre-existing digital materials will be used whenever possible. For the small speech corpus, Laboratoire Parole et Langage in Aix-en-Provence can make recordings for those partners who do not have appropriate equipment and recording facilities. Native speakers of all MULTEXT-East languages are available at the Universite de Provence. The overall target is to produce a speech corpus comparable to MULTEXT 's, but of reduced scale. Availability of data may affect the content of the corpus, and therefore its definition may change over the course of the project and may not be finalized until its completion. This task will specify exactly which parts and how much of the corpus is hand validated.

This task is also concerned with the cleanup, markup and validation of the multilingual comparable corpus and multilingual parallel corpus up to level 1 (which includes level 0 markup). The corpus preparation should be completed in the first period of the project. However, because refinements may be made in the definition of level 0 and 1 markup later in the project, the final version of the corpus may not be available until project completion.

Deliverables:

2.1. Sample corpus marked and validated for level 1 (DATA+REPORT, public)



Tomaz Erjavec
Mon May 20 13:01:13 MDT 1996