next up previous contents
Next: Expected impact Up: Background and approach Previous: Markup

Sample corpus

 

Given its modest size, MULTEXT-East cannot provide a large-scale, fully annotated and hand validated multilingual corpus. MULTEXT-East \ will only provide a small size sample corpus, composed of material comparable to MULTEXT 's, whose primary goal is to provide an example and test-bed for:

The sample corpus will be prepared in TEI -conformant SGML format and annotated for basic structural features as well as sub-paragraph segmentation, POS, and alignment of parallel texts.

The current plan for composition of the sample corpus is outlined below. However, data collection is difficult to plan and subject to changes due to availability. During the course of the project, the content of the corpus may therefore need to be redefined.

The sample corpus will be composed of three major parts, each comprising six languages (Bulgarian, Czech, Estonian, Hungarian, Romanian, Slovenian):

  1. Multilingual Comparable Corpus (at least 1,200,000 words):

    For each of the six MULTEXT-East languages, the comparable corpus will include two subsets of at least 100,000 words each, from domains such as financial journalism, general journalism, or a technical sublanguage (to be determined). The data should be comparable across the six languages, in terms of the number and size of texts. Selection criteria will be determined and applied to each subset, to ensure quality. The entire multilingual comparable corpus will be prepared in Level 1 format, manually or using ad-hoc tools, and then automatically annotated using the project tools. For each language, some part--to be determined after evaluation of the effort involved--will be hand validated for level 2 and POS.

  2. Multilingual Parallel Corpus (at least 600,000 Words):

    For the six MULTEXT-East languages, the parallel corpus will include at least 100,000 words per language. It may prove difficult to find parallel texts for several CEE languages, but there seem to exist texts in CEE languages with parallel English translations. The entire multilingual parallel corpus will be marked and validated for level 1. For each language, half of the corpus will be marked and validated for alignment and sentence boundaries. The entire multilingual parallel corpus will be prepared in Level 1 format, manually or using ad-hoc tools, and then automatically annotated using the project tools. Some part--to be determined after evaluation of the effort involved--will be hand validated. Alignment will be between English and each of the six MULTEXT-East languages, thus constituting six pair-wise alignments, as in MULTEXT .

  3. Multilingual Speech Corpus

    MULTEXT-East will record a small corpus of spoken texts in each of the six languages, analogous to the one used in MULTEXT , that is, comprising 40 short passages of 5 thematically connected sentences, each spoken by several native speakers, with phonemic and orthographic transcriptions. The number of speakers will be determined after evaluation of the effort involved. MULTEXT-East will enhance this spoken corpus with markup for prosody, segmentation, and POS. The prosody markup will consist of two levels: F0 curve modeling and symbolic coding. This markup will be performed using the tools developed in MULTEXT , and some part--to be determined after evaluation of the effort involved--will be hand validated. The orthographic transcriptions will be marked for level 2 and POS and will be hand validated. It is not within the scope of MULTEXT-East to carry out a phoneme-level segmentation of the signal. We propose to carry out a restricted alignment, consisting of the alignment of word boundaries as well as the beginning of accented vowels between signal and transcription. This segmental alignment is expected to be sufficient for at least a first approximation of prosody markup. The restricted alignment will be carried out for one speaker per language.



next up previous contents
Next: Expected impact Up: Background and approach Previous: Markup



Tomaz Erjavec
Mon May 20 13:01:13 MDT 1996