Next: Multilingual Parallel: Orwell's ``1984'' Up: Introduction Previous: License agreements

Corpus Encoding

The text corpus, along with the textual part of the speech corpus has been encoded in SGML, in the Corpus Encoding Specification (CES) DTD. The CES DTD, along with documentation can be obtained from http://www.cs.vassar.edu/CES/ . The complete MULTEXT-East CES corpus is encoded as a cesCorpus element, which comprises a header and the 26 component parts (2 x 7 + 2 x 6) of the corpus, each encoded as a cesDoc element.

Each corpus cesDoc component is stored in a separate file. All system identifiers (i.e. filenames) are encapsulated in the MULTEXT-East catalog, which is structured according to the SGML Open Technical Resolution 9401:1997. There the corpus as a whole and its components are given the following PUBLIC identifiers:

     -//MTE//DOCUMENT CES1//     
                             
     -//MTE//TEXT CES1 1984//EN         -//MTE//TEXT CES1 Speech//EN
     -//MTE//TEXT CES1 1984//BG         -//MTE//TEXT CES1 Speech//BG
     -//MTE//TEXT CES1 1984//CS         -//MTE//TEXT CES1 Speech//CS
     -//MTE//TEXT CES1 1984//ET         -//MTE//TEXT CES1 Speech//ET
     -//MTE//TEXT CES1 1984//HU         -//MTE//TEXT CES1 Speech//HU
     -//MTE//TEXT CES1 1984//RO         -//MTE//TEXT CES1 Speech//RO
     -//MTE//TEXT CES1 1984//SL         -//MTE//TEXT CES1 Speech//SL
                             
     -//MTE//TEXT CES1 Fiction//BG      -//MTE//TEXT CES1 News//BG  
     -//MTE//TEXT CES1 Fiction//CS      -//MTE//TEXT CES1 News//CS  
     -//MTE//TEXT CES1 Fiction//ET      -//MTE//TEXT CES1 News//ET  
     -//MTE//TEXT CES1 Fiction//HU      -//MTE//TEXT CES1 News//HU  
     -//MTE//TEXT CES1 Fiction//RO      -//MTE//TEXT CES1 News//RO  
     -//MTE//TEXT CES1 Fiction//SL      -//MTE//TEXT CES1 News//SL

Each cesDoc element is also marked with the lang attribute that gives the language of the corpus component (but note that all cesHeader elements are marked as English). For language IDs, the two letter ISO 639 values have been used. The definitions for the CES language elements are encapsulated in the file with the PUBLIC identifier:

    ISO 639-1988//ENTITIES Languages//EN

These entities encompass the European languages; we give here the definitions for the MULTEXT-East languages and languages referred to in the MULTEXT-East corpus:

     <language id=en iso639=en>English</language>
     <language id=bg iso639=bg>Bulgarian</language>
     <language id=cs iso639=cs>Czech</language>
     <language id=et iso639=et>Estonian</language>
     <language id=hu iso639=hu>Hungarian</language>
     <language id=ro iso639=ro>Romanian</language>
     <language id=sl iso639=sl>Slovene</language>

     <language id=de iso639=de>German</language>
     <language id=fr iso639=fr>French</language>
     <language id=ge iso639=ge>German</language>
     <language id=it iso639=it>Italian</language>
     <language id=la iso639=la>Latin</language>
     <language id=lv iso639=lv>Latvian</language>
     <language id=ru iso639=ru>Russian</language>
     <language id=sp iso639=es>Spanish</language>

For (language specific) character representation, the documents use SGML entities from the following entity sets:

     ISO 8879-1986//ENTITIES Added Latin 1//EN
     ISO 8879-1986//ENTITIES Added Latin 2//EN 
     ISO 8879-1986//ENTITIES Russian Cyrillic//EN
     ISO 8879-1986//ENTITIES Non Russian Cyrillic//EN

The last two are used for Bulgarian, the first two by all the other MULTEXT-East languages.

For each language (the reference to) its language specific character set entities has been encapsulated in files with the following PUBLIC identifiers:

     -//MTE//ENTITIES Bulgarian//EN
     -//MTE//ENTITIES Czech//EN
     -//MTE//ENTITIES Estonian//EN
     -//MTE//ENTITIES Hungarian//EN
     -//MTE//ENTITIES Romanian//EN
     -//MTE//ENTITIES Slovene//EN

It should be noted that, in the current version, only the complete -//MTE//DOCUMENT CES1// corpus constitutes a valid SGML document. To enable partial processing of the corpus components, a very simple mechanism has been adopted, which is likely to change in the future. Namely, each component text has a SGML PROLOG, but commented out. To take as an example the beginning of -//MTE//TEXT CES1 1984//CS:

     <!--DOCTYPE cesDoc PUBLIC "-//CES//DTD cesDoc//EN" [
       <!ENTITY % ONECOMPONENT "INCLUDE">
       <!ENTITY ISOlang PUBLIC 
                "ISO 639-1988//ENTITIES Languages//EN">
       <!ENTITY % MTEcs PUBLIC 
                "-//MTE//ENTITIES Czech//EN">
       %MTEcs;
     ]-->

To process a single text, the comment markers (-) should be removed from this prolog. Note that in order to enable this kind of separate processing, marked sections have been made use of: if ONECOMPONENT is set to INCLUDE, then language definitions are included in the cesDoc header. If the corpus as a whole is processed, ONECOMPONENT is set to IGNORE, and the the language definitions are a part of the cesCorpus header.

The cesDoc corpus components have been encoded at least up to CES level 1. Level 1 markup includes a TEI-like header (file, encoding, profile and revision descriptions), and universal text elements down to the level of the paragraph, e.g. textual divisions, paragraphs, titles and headings, footnotes, tables and poems. Some CES 2 level markup has also been included, e.g. quoted material (quote, q, mentioned elements), rendition information, and, to varying degrees, abbreviations, dates, names, and numbers. The parallel corpus sets are furthermore marked up for sentences. Below we give a summary of the elements used, together with the number of times they appear in the corpus.

cesdoc = 26; text = 26; group = 1; body = 28; div = 3309;
p = 27993; head = 3262; byline = 1240; caption = 145; closer = 2; dateline = 220; figdesc = 115; figure = 115; list = 61; note = 52; opener = 290; poem = 90; quote = 756; table = 3
s = 65758; cell = 75; item = 350; l = 456; row = 15
q = 23947
author = 168; bibl = 168; corr = 1; docauthor = 729; ptr = 39;
abbr = 6059; date = 1835; distinct = 223; foreign = 919; hi = 5328; label = 2; measure = 18; mentioned = 1419; name = 41340; num = 5238; ref = 26; sp = 251; speaker = 6; term = 6; time = 87; title = 920

The details on the mark-up of the component corpora can be found in the 'Corpus Encoding' sections of the following chapters.

For the corpus components, HTML translations of cesDoc headers and samples of the texts are also available. These were obtained with custom-written CES 2 HTML tables, using the Fred software package (see http://www.oclc.org/fred/ ).

Next: Multilingual Parallel: Orwell's ``1984'' Up: Introduction Previous: License agreements

Multext-East