Next: TELRI Appendix 1: Additional Up: Multilingual Parallel Speech Corpus Previous: Organisation of the MULTEXT-East

Structure of the CES Speech Corpus

In addition to the EUROM format, the textual part of the MULTEXT-East speech corpus was also encoded in CES. For each of the seven languages, the passages were encoded as a cesDoc element, with a header describing the corpus component, and the text containing the passages. Each passage was encoded as a division of type block, containing a head and one paragraph. The text itself was sentence segmented and IDed. The following English BLOCK O0 illustrates the resulting structure:

<div id="Sen.1" type="block" n="O0">
<head>*BLOCK: O0</head>
<p id="Sen.1.2">
<s id="Sen.1.2.1">Last week my friend had to go to the doctors to have
some injections.</s>
<s id="Sen.1.2.2">She is going to the Far East for a holiday and she 
needs to have an injection against cholera, typhoid fever, hepatitis
A, polio and tetanus.</s>
<s id="Sen.1.2.3">I think she will feel quite ill after all those.</s>
<s id="Sen.1.2.4">She is going to get them all done at once, at one 
session.</s>
<s id="Sen.1.2.5">I shan't feel sorry for her though!</s>
</p>
</div>

The complete CES encoded MULTEXT-East speech corpus has the following tag usage:

text = 7, body = 7
div = 280, head = 280, p = 280
s = 1377

It is interesting to note the number of sentences; the EUROM states that the passages consist of ``5 thematically linked sentences''. Thus the number of all the sentences should be be 1400 (200 x 5), but is actually less. This holds even for the English original, which has 196 sentences instead of 200.

Multext-East