Next: Latvian Up: TELRI Appendix 1: Additional Previous: Overview

Subsections

Corpus Encoding

The ``1984'' corpus has been encoded in SGML, in the Corpus Encoding Specification (CES) Document Type Declaration. The CES DTD, along with documentation can be obtained from http://www.cs.vassar.edu/CES/ . The complete MULTEXT-East CES corpus is encoded as a <cesCorpus> document, which comprises a header and the 26 component parts (seven for ``1984'') of the corpus, each encoded as a <cesDoc> element. The TELRI ``1984'' translations have not been included in this <cesCorpus>, but are encoded as separate <cesDoc> SGML documents, with no common header.

Each <cesDoc> document is stored in a separate file. All system identifiers (i.e. filenames) are encapsulated in the catalog, which is structured according to the SGML Open Technical Resolution 9401:1997. There the TELRI ``1984'' corpus components are given the following PUBLIC identifiers:

     -//TELRI//DOCUMENT CES1 1984//LV
     -//TELRI//DOCUMENT CES1 1984//LT
     -//TELRI//DOCUMENT CES1 1984//SH

Each <cesDoc> element is also marked with the lang attribute that gives the language of the corpus component (but note that all <cesHeader> elements are marked as English). For language IDs, the two letter ISO 639 values have been used. The definitions for the CES language elements are encapsulated in the file with the PUBLIC identifier:

    ISO 639-1988//ENTITIES Languages//EN

These entities encompass the European languages; we give here the definitions for the three TELRI languages:

  <language id=lv iso639=lv>Latvian;Lettish</language><!--BALTIC-->
  <language id=lt iso639=lt>Lithuanian</language>     <!--BALTIC-->
  <language id=sh iso639=sh>Serbo-Croatian</language> <!--SLAVIC-->

For (language specific) character representation, the documents use SGML entities from the following entity sets:

     ISO 8879-1986//ENTITIES Added Latin 1//EN
     ISO 8879-1986//ENTITIES Added Latin 2//EN 
     ISO 8879-1986//ENTITIES Russian Cyrillic//EN
     ISO 8879-1986//ENTITIES Non Russian Cyrillic//EN

The first two are used for Latvian in Lithuanian, while Serbo-Croatian (indirectly) uses all four. Namely, the Serbo-Croatian translation is from Serbia, where text can be written either in the Latin or in the Cyrillic alphabet. Therefore the entities that are used in the Serbo-Croatian ``1984'' provide mapping to either Added Latin or to Cyrillic entities.

For each language (the reference to) its language specific character set entities has been encapsulated in files with the following PUBLIC identifiers:

     -//MTE//ENTITIES Latvian//EN 
     -//MTE//ENTITIES Lithuanian//EN 
     -//MTE//ENTITIES Serbian//EN

It should be noted that, in the current version, only the complete MULTEXT-East -//MTE//DOCUMENT CES1// corpus constitutes a valid SGML document. Conversely, each TELRI ``1984'' component is a valid document, i.e. contains the SGML prolog.

To retain (future) compatibility with the MULTEXT-East ``1984'', the structure of the TELRI ``1984'' is slightly more complicated than would be necessary for 'stand alone' documents. As an example we give the prolog of -//TELRI//DOCUMENT CES1 1984//LT:

     <!DOCTYPE cesDoc PUBLIC "-//CES//DTD cesDoc//EN" [
       <!ENTITY % ONECOMPONENT "INCLUDE">
       <!ENTITY ISOlang PUBLIC "ISO 639-1988//ENTITIES Languages//EN">
       <!ENTITY % MTElt PUBLIC "-//MTE//ENTITIES Lithuanian//EN">
       %MTElt;
     ]>

Marked sections have been made use of: if ONECOMPONENT is set to INCLUDE, then <language> definitions are included in the <cesDoc> header. In the (MULTEXT-East) corpus, ONECOMPONENT is set to IGNORE, and the the <language> definitions are a part of the <cesCorpus> header.

All the corpus components have been encoded at least up to CES level 1. Level 1 markup includes a TEI-like header (file, encoding, profile and revision descriptions), and universal text elements down to the level of the paragraph, e.g. textual divisions, paragraphs, titles and headings, footnotes, tables and poems. Some CES 2 level markup has also been included, e.g. quoted material (<quote>, <q>), and rendition information. Finally, some CES 3 level markup is also present, namely sentence markup with <s>.

To illustrate the kinds of elements used in the ``1984'' corpus, and to show their distributions across the language components we give the following table, which gives the contents of the <tagUsage> elements in the headers of the respective language components. For comparative purposes, we give the numbers for the complete ``1984'' corpus:

GI	EN	BG	CS	ET	HU	RO	SL	LV	LT	SH
abbr	38	28	23	73	38	3	26			7
date	40	40	39	18	39		33			39
body	1	1	1	1	1	1	1	1	1	1
div	28	28	28	28	28	28	28	28	28	28
foreign	39	29	91	93	43	430	7			7
head	1	1	1	5	5	28	29	28	28	27
hi	103	103	75	183	71	413	242	129	136	323
item	4	4	4	4	4	4	4	4	4	4
l	32	26	33	32	32	26	34	32	36	32
list	1	1	1	1	1	1	1	1	1	1
mentioned	261	256	244	44	281
name	1744	1704	2181	2457	1843	2157	1327	163		1371
note	2	8	2	2	2	3	1	1	1	1
num	52	34	48	14	10			1
p	1286	1321	1285	1289	1292	1335	1288	1332	1331	1279
poem	10	7	11	10	10	7	10	10	12	10
ptr	2	8	1	2	2		1	1	1
q	2209	1203	2208	2192	2197	2137	2260	210	40	2246
quote	35	34	36	35	35	23	35	25	38	35
s	6701	6649	6714	6658	6732	6487	6689	6690	6675	6652
title	46	41	45	29	40	1	10	1	7	4
term			2

Tag usage in Orwell's ``1984''

For the purpose of alignment it was important to ensure that the gross structure of all the languages was as similar as possible. The translations were encoded taking the English digital version as as the norm, with errors of alignment guiding the harmonisation. The elements that have been harmonised can be seen in the table above, as they occur the same number of times in all the languages. In particular, each of the ``1984''s has one <body>, which consists of three <div type=part n=1, 2, 3> and of one <div type=part n=appendix>. Each part, except the Appendix, is further subdivided into a number of <div type=chapter n=1, 2, ...>.

Sentence segmentation

Prior to sentence alignment, the ``1984'' had to be sentence segmented. In inserting the <s> markup, the well known problem of crossing <q> and <s> hierarchies was avoided by not marking up (problematic) <q> elements: these are indicated by the original rendering. In the cases where <q> elements were already in place, and crossed with the <s> structure, the <q> elements were split and marked by type=MI (Machine Inserted).

The automatically inserted <s> elements were then hand validated via the process of alignment: where the S alignment between the English version and a translation was not one-to-one, there was a fair chance of it being not a difference in translation, but of a wrongly placed <s>. Such alignments were checked, and the <s> tags corrected where necessary.

ID marking

The ider program, written by Greg Priest-Dorman, was used to automatically assign unique identifiers to the following structural elements in the ``1984'' corpus:

<div>
<p>, <list>, <poem>
<s>, <item>, <l>

The values of the id attribute are strings composed of the initial letter O identifying the Orwell corpus, the two letter ISO 639 code identifying the language component of the corpus (en, bg cs, et, hu, ro, sl), followed by numbers separated by periods; these give the position of the tag in the SGML tree of the document, rooted at the BODY element. The following example illustrates the ID naming scheme:

  <body lang="sl" id="Osl">
    <div id="Osl.1"  type=part n=1>
    <head>Prvi del</head>
      <div id="Osl.1.2" type=chapter n=1>
        <head>I</head>
        <p id="Osl.1.2.2">
          <s id="Osl.1.2.2.1">
            Bil je jasen, mrzel aprilski dan in ure so bile trinajst.
          </s>
...
        <quote id="Osl.1.2.17" rend="CN IT"><p id="Osl.1.2.17.1">
            <s id="Osl.1.2.17.1.1">
              <date iso8601="1984-04-04">4. april 1984</date>
            </s>
          </p></quote>
...
        <quote id="Osl.1.8.18" rend="CN IT"><poem id="Osl.1.8.18.1">
            <l id="Osl.1.8.18.1.1">Tam pod kostanjevim drevesom</l>
            <l id="Osl.1.8.18.1.2">izdala si me,</l>
            <l id="Osl.1.8.18.1.3">izdal sem te,</l>
            <l id="Osl.1.8.18.1.4">ne da bi trenila z o&ccaron;esom.</l>
          </poem></quote>

HTML rendering

For the corpus components, HTML translations of <cesDoc> headers and samples of the texts are also available. These were obtained with custom-written CES 2 HTML translation maps, using the Fred software package (see http://www.oclc.org/fred/). The mapping does not preserve information explicitly, but 'renders' it. For example, <name> is represented as <b> and <respType> as '(...)'.

Next: Latvian Up: TELRI Appendix 1: Additional Previous: Overview

Multext-East