In addition to the EUROM format, the textual part of the MULTEXT-East speech corpus was also encoded in CES. For each of the seven languages, the passages were encoded as a cesDoc element, with a header describing the corpus component, and the text containing the passages. Each passage was encoded as a division of type block, containing a head and one paragraph. The text itself was sentence segmented and IDed. The following English BLOCK O0 illustrates the resulting structure:
<div id="Sen.1" type="block" n="O0"> <head>*BLOCK: O0</head> <p id="Sen.1.2"> <s id="Sen.1.2.1">Last week my friend had to go to the doctors to have some injections.</s> <s id="Sen.1.2.2">She is going to the Far East for a holiday and she needs to have an injection against cholera, typhoid fever, hepatitis A, polio and tetanus.</s> <s id="Sen.1.2.3">I think she will feel quite ill after all those.</s> <s id="Sen.1.2.4">She is going to get them all done at once, at one session.</s> <s id="Sen.1.2.5">I shan't feel sorry for her though!</s> </p> </div>
The complete CES encoded MULTEXT-East speech corpus has the following tag usage:
It is interesting to note the number of sentences; the EUROM states that the passages consist of ``5 thematically linked sentences''. Thus the number of all the sentences should be be 1400 (200 x 5), but is actually less. This holds even for the English original, which has 196 sentences instead of 200.