For the MULTEXT-East speech corpus, a small sample of the English part of the EUROM1 multilingual speech database was selected. This text was translated into the six languages and, except for Bulgarian and Czech, recorded for one male speaker and digitized in accordance with the EUROM recommendations. The seven texts were, in addition to the EUROM format, also encoded in CES, as cesDoc elements.
The texts chosen comprise 40 passages (4 x 10, designated blocks: O, P, Q, R / 0-9) of 5 thematically linked sentences. As an example we give below the English blocks O0 and O1:
BLOCK: O0 Last week my friend had to go to the doctors to have some injections. She is going to the Far East for a holiday and she needs to have an injection against cholera, typhoid fever, hepatitis A, polio and tetanus. I think she will feel quite ill after all those. She is going to get them all done at once, at one session. I shan't feel sorry for her though! BLOCK: O1 I have a problem with my water softener. The water-level is too high and the overflow keeps dripping. Could you arrange to send an engineer on Tuesday morning please? It's the only day I can manage this week. I'd be grateful if you could confirm the arrangement in writing.
The translations into the MULTEXT-East languages were 'localised', i.e. the situations described in the monologues were translated as if the speaker were describing a situation in his native land. For example, local place names were used instead of the British ones.
The text were spoken by one native male speaker. The recording was performed as close as possible to the EUROM guidelines, which are as follows:
The digitisation features of the recordings were the following: