next up previous contents
Next: Latvian Up: TELRI Appendix 1: Additional Previous: Overview

Subsections

Corpus Encoding

The ``1984'' corpus has been encoded in SGML, in the Corpus Encoding Specification (CES) Document Type Declaration. The CES DTD, along with documentation can be obtained from http://www.cs.vassar.edu/CES/ . The complete MULTEXT-East CES corpus is encoded as a <cesCorpus> document, which comprises a header and the 26 component parts (seven for ``1984'') of the corpus, each encoded as a <cesDoc> element. The TELRI ``1984'' translations have not been included in this <cesCorpus>, but are encoded as separate <cesDoc> SGML documents, with no common header.

Each <cesDoc> document is stored in a separate file. All system identifiers (i.e. filenames) are encapsulated in the catalog, which is structured according to the SGML Open Technical Resolution 9401:1997. There the TELRI ``1984'' corpus components are given the following PUBLIC identifiers:

     -//TELRI//DOCUMENT CES1 1984//LV
     -//TELRI//DOCUMENT CES1 1984//LT
     -//TELRI//DOCUMENT CES1 1984//SH

Each <cesDoc> element is also marked with the lang attribute that gives the language of the corpus component (but note that all <cesHeader> elements are marked as English). For language IDs, the two letter ISO 639 values have been used. The definitions for the CES language elements are encapsulated in the file with the PUBLIC identifier:

    ISO 639-1988//ENTITIES Languages//EN

These entities encompass the European languages; we give here the definitions for the three TELRI languages:

  <language id=lv iso639=lv>Latvian;Lettish</language><!--BALTIC-->
  <language id=lt iso639=lt>Lithuanian</language>     <!--BALTIC-->
  <language id=sh iso639=sh>Serbo-Croatian</language> <!--SLAVIC-->

For (language specific) character representation, the documents use SGML entities from the following entity sets:

     ISO 8879-1986//ENTITIES Added Latin 1//EN
     ISO 8879-1986//ENTITIES Added Latin 2//EN 
     ISO 8879-1986//ENTITIES Russian Cyrillic//EN
     ISO 8879-1986//ENTITIES Non Russian Cyrillic//EN
The first two are used for Latvian in Lithuanian, while Serbo-Croatian (indirectly) uses all four. Namely, the Serbo-Croatian translation is from Serbia, where text can be written either in the Latin or in the Cyrillic alphabet. Therefore the entities that are used in the Serbo-Croatian ``1984'' provide mapping to either Added Latin or to Cyrillic entities.

For each language (the reference to) its language specific character set entities has been encapsulated in files with the following PUBLIC identifiers:

     -//MTE//ENTITIES Latvian//EN 
     -//MTE//ENTITIES Lithuanian//EN 
     -//MTE//ENTITIES Serbian//EN

It should be noted that, in the current version, only the complete MULTEXT-East -//MTE//DOCUMENT CES1// corpus constitutes a valid SGML document. Conversely, each TELRI ``1984'' component is a valid document, i.e. contains the SGML prolog.

To retain (future) compatibility with the MULTEXT-East ``1984'', the structure of the TELRI ``1984'' is slightly more complicated than would be necessary for 'stand alone' documents. As an example we give the prolog of -//TELRI//DOCUMENT CES1 1984//LT:

     <!DOCTYPE cesDoc PUBLIC "-//CES//DTD cesDoc//EN" [
       <!ENTITY % ONECOMPONENT "INCLUDE">
       <!ENTITY ISOlang PUBLIC "ISO 639-1988//ENTITIES Languages//EN">
       <!ENTITY % MTElt PUBLIC "-//MTE//ENTITIES Lithuanian//EN">
       %MTElt;
     ]>

Marked sections have been made use of: if ONECOMPONENT is set to INCLUDE, then <language> definitions are included in the <cesDoc> header. In the (MULTEXT-East) corpus, ONECOMPONENT is set to IGNORE, and the the <language> definitions are a part of the <cesCorpus> header.

All the corpus components have been encoded at least up to CES level 1. Level 1 markup includes a TEI-like header (file, encoding, profile and revision descriptions), and universal text elements down to the level of the paragraph, e.g. textual divisions, paragraphs, titles and headings, footnotes, tables and poems. Some CES 2 level markup has also been included, e.g. quoted material (<quote>, <q>), and rendition information. Finally, some CES 3 level markup is also present, namely sentence markup with <s>.

To illustrate the kinds of elements used in the ``1984'' corpus, and to show their distributions across the language components we give the following table, which gives the contents of the <tagUsage> elements in the headers of the respective language components. For comparative purposes, we give the numbers for the complete ``1984'' corpus:

GI EN BG CS ET HU RO SL LV LT SH
abbr 38 28 23 73 38 3 26     7
date 40 40 39 18 39   33     39
body 1 1 1 1 1 1 1 1 1 1
div 28 28 28 28 28 28 28 28 28 28
foreign 39 29 91 93 43 430 7     7
head 1 1 1 5 5 28 29 28 28 27
hi 103 103 75 183 71 413 242 129 136 323
item 4 4 4 4 4 4 4 4 4 4
l 32 26 33 32 32 26 34 32 36 32
list 1 1 1 1 1 1 1 1 1 1
mentioned 261 256 244 44 281          
name 1744 1704 2181 2457 1843 2157 1327 163   1371
note 2 8 2 2 2 3 1 1 1 1
num 52 34 48 14 10     1    
p 1286 1321 1285 1289 1292 1335 1288 1332 1331 1279
poem 10 7 11 10 10 7 10 10 12 10
ptr 2 8 1 2 2   1 1 1  
q 2209 1203 2208 2192 2197 2137 2260 210 40 2246
quote 35 34 36 35 35 23 35 25 38 35
s 6701 6649 6714 6658 6732 6487 6689 6690 6675 6652
title 46 41 45 29 40 1 10 1 7 4
term     2              

Tag usage in Orwell's ``1984''

For the purpose of alignment it was important to ensure that the gross structure of all the languages was as similar as possible. The translations were encoded taking the English digital version as as the norm, with errors of alignment guiding the harmonisation. The elements that have been harmonised can be seen in the table above, as they occur the same number of times in all the languages. In particular, each of the ``1984''s has one <body>, which consists of three <div type=part n=1, 2, 3> and of one <div type=part n=appendix>. Each part, except the Appendix, is further subdivided into a number of <div type=chapter n=1, 2, ...>.

Sentence segmentation

Prior to sentence alignment, the ``1984'' had to be sentence segmented. In inserting the <s> markup, the well known problem of crossing <q> and <s> hierarchies was avoided by not marking up (problematic) <q> elements: these are indicated by the original rendering. In the cases where <q> elements were already in place, and crossed with the <s> structure, the <q> elements were split and marked by type=MI (Machine Inserted).

The automatically inserted <s> elements were then hand validated via the process of alignment: where the S alignment between the English version and a translation was not one-to-one, there was a fair chance of it being not a difference in translation, but of a wrongly placed <s>. Such alignments were checked, and the <s> tags corrected where necessary.

ID marking

The ider program, written by Greg Priest-Dorman, was used to automatically assign unique identifiers to the following structural elements in the ``1984'' corpus:

The values of the id attribute are strings composed of the initial letter O identifying the Orwell corpus, the two letter ISO 639 code identifying the language component of the corpus (en, bg cs, et, hu, ro, sl), followed by numbers separated by periods; these give the position of the tag in the SGML tree of the document, rooted at the BODY element. The following example illustrates the ID naming scheme:

  <body lang="sl" id="Osl">
    <div id="Osl.1"  type=part n=1>
    <head>Prvi del</head>
      <div id="Osl.1.2" type=chapter n=1>
        <head>I</head>
        <p id="Osl.1.2.2">
          <s id="Osl.1.2.2.1">
            Bil je jasen, mrzel aprilski dan in ure so bile trinajst.
          </s>
...
        <quote id="Osl.1.2.17" rend="CN IT"><p id="Osl.1.2.17.1">
            <s id="Osl.1.2.17.1.1">
              <date iso8601="1984-04-04">4. april 1984</date>
            </s>
          </p></quote>
...
        <quote id="Osl.1.8.18" rend="CN IT"><poem id="Osl.1.8.18.1">
            <l id="Osl.1.8.18.1.1">Tam pod kostanjevim drevesom</l>
            <l id="Osl.1.8.18.1.2">izdala si me,</l>
            <l id="Osl.1.8.18.1.3">izdal sem te,</l>
            <l id="Osl.1.8.18.1.4">ne da bi trenila z o&ccaron;esom.</l>
          </poem></quote>

HTML rendering

For the corpus components, HTML translations of <cesDoc> headers and samples of the texts are also available. These were obtained with custom-written CES 2 HTML translation maps, using the Fred software package (see http://www.oclc.org/fred/ ). The mapping does not preserve information explicitly, but 'renders' it. For example, <name> is represented as <b> and <respType> as '(...)'.


next up previous contents
Next: Latvian Up: TELRI Appendix 1: Additional Previous: Overview
Multext-East