next up previous contents
Next: Structure of the CES Up: Multilingual Parallel Speech Corpus Previous: Multilingual Parallel Speech Corpus

Organisation of the MULTEXT-East EUROM Corpus

In the corpus there are two types of files:

All the filenames have the structure: name.suffix

The name is composed of:

1.
the first 2 characters are the initials of the speaker.
2.
the 2 following characters are the nomenclature (number) of the passage.
3.
the 4 following figures are free

The speech sample file suffix is, for example, .pss (p = passage; s = Slovene; s = speech sample).

The following one letter codes were used for the MULTEXT-East languages:

The description file suffix is pXo with:

Here is an example of an English associated file (ie suffix .peo):

--------------------------------------------------------------------------
HD: V3.0
TYP: orthographic
DBN: EUROM_1
VOL: 
DIR: 
SRC: FAO00079.PES
TXF: O0.TXT
CMT: Information about the recording session
SAM: 20000
BEG: 0
END: 406271
RED: 18/Jan/91
RET: 14:15:50
REP: UCL
SNB: 2
SBF: 01
SSB: 16
RCC: 2
NCH: 2
SPI: M, 48, BRITISH
PCF: PASSAGE.DES
CMT: Information about the labelling session
EXP: 
SYS: 
DAT: 
SPA: 
CMT: Item: label start, end, input gain, min level, max level, string
LBD: 
LBR: 0, 406271, 6, -3706, 4534, Last week my friend had to go to the doctors to
EXT: have some injections. She  is going to the Far East for a holiday and she
EXT: needs to have an injection against cholera, typhoid fever, hepatitis A,
EXT: polio and tetanus. I think she will feel quite ill after all those. She is
EXT: going to get them all done at once, at one session. I shan't feel sorry
EXT: for her though!  
LB2: 0, 406271, 0, -6143, 9339
ELF: 
-----------------------------------------------------------------------------

The following are the minimal fields that must be present in a description file:

TYP, DBN, SRC, TXF, SAM, SNB, SBF, SSB, BEG, END, RED, SPI, PCF, 
LBR and following EXT


next up previous contents
Next: Structure of the CES Up: Multilingual Parallel Speech Corpus Previous: Multilingual Parallel Speech Corpus
Multext-East