Structure of the Corpus

To illustrate the kinds of elements used in the ``1984'' corpus, and to show their distributions across the language components we give the following table, which gives the contents of the <tagUsage> elements in the headers of the respective language components:

abbr 38 28 23 73 38 3 26
body 1 1 1 1 1 1 1
date 40 40 39 18 39   33
div 28 28 28 28 28 28 28
foreign 39 29 91 93 43 430 7
head 1 1 1 5 5 28 29
hi 103 103 75 183 71 413 242
item 4 4 4 4 4 4 4
l 32 26 33 32 32 26 34
list 1 1 1 1 1 1 1
mentioned 261 256 244 44 281    
name 1744 1704 2181 2457 1843 2157 1327
note 2 8 2 2 2 3 1
num 52 34 48 14 10    
p 1286 1321 1285 1289 1292 1335 1288
poem 10 7 11 10 10 7 10
ptr 2 8 1 2 2   1
q 2209 1203 2208 2192 2197 2137 2260
quote 35 34 36 35 35 23 35
s 6701 6649 6714 6658 6732 6487 6689
title 46 41 45 29 40 1 10
term     2        

Tag usage in Orwell's ``1984''

For the purpose of alignment it was important to ensure that the gross structure of the 7 languages was as similar as possible. The translations were encoded taking the English digital version as as the norm, with errors of alignment guiding the harmonisation. The elements that have been harmonised can be seen in the table above, as they occur the same number of times in all the languages. In particular, each of the ``1984''s has one <body>, which consists of three <div type=part n=1, 2, 3> and of one <div type=part n=appendix>. Each part, except the Appendix, is further subdivided into a number of <div type=chapter n=1, 2, ...>.

The languages of ``1984''

In addition to the 'standard' language IDs that the lang attribute can refer to, some elements in the ``1984'' corpus have been marked up with more specific language values. These are defined in the <langUsage> element in the <cesDoc> header of the particular language components. We give the definitions below:

     <language id="ns" iso639="none">Newspeak</language>
     <language id="ns-jg" iso639="none">Newspeak official jargon</language>
     <language id="en-ck" iso639="none">British Cockney English</language>

     <language id=bg-cl    iso639=bg>Bulgarian colloquial</language>
     <language id=ns-bg    iso639=bg>Newspeak Bulgarian</language>
     <language id=ns-jg-bg iso639=bg>Newspeak official jargon Bulgarian</language>

     <language id=cs-cl    iso639=cs>Czech colloquial</language>
     <language id=ns-cs    iso639=cs>Newspeak Czech</language>
     <language id=ns-jg-cs iso639=cs>Newspeak official jargon Czech</language>

     <language id=ns-et    iso639=none>Newspeak Estonian</language>

     <language id=ns-hu    iso639=hu>Newspeak Hungarian</language>
     <language id=ns-jg-hu iso639=hu>Newspeak official jargon Hungarian</language>

     <Language id="ns-ro"  iso639="xx">Nouvorb&abreve;</language>

     <language id=ns-sl    iso639=sl>Newspeak Slovene</language>

It can be noted that the above definitions still need to be harmonised. They also raise some interesting questions as to the definition of 'language'.

Sentence segmentation

Prior to sentence alignment, the ``1984'' had to be sentence segmented. This was done automatically, by using a combination of the MtSeg tool and special purpose scripts written by Greg Priest-Dorman of Vassar.

In inserting the <s> markup, the well known problem of crossing <q> and <s> hierarchies appeared. This was solved by automatically splitting the <q> elements where necessary. The <q> elements that were so inserted were marked by type=MI (Machine Inserted) in the translations, and with type=broken, with prev and next ID references in the English original. Where the <hi> elements spanned more than one <s>, these were corrected by hand.

The automatically inserted <s> elements were then hand validated via the process of alignment: where the S alignment between the English version and a translation was not one-to-one, there was a fair chance of it being not a difference in translation, but of a wrongly placed <s>. Such alignments were checked, and the <s> tags corrected where necessary.

ID marking

The ider program, written by Greg Priest-Dorman, was used to automatically assign unique identifiers to the following structural elements in the ``1984'' corpus:

The values of the id attribute are strings composed of the initial letter O identifying the Orwell corpus, the two letter ISO 936 code identifying the language component of the corpus (en, bg cs, et, hu, ro, sl), followed by numbers separated by periods; these give the position of the tag in the SGML tree of the document, rooted at the BODY element. The following example illustrates the ID naming scheme:

  <body lang="sl" id="Osl">
    <div id="Osl.1" type=part n=1>
    <head>Prvi del</head>
      <div id="Osl.1.2" type=chapter n=1>
        <p id="Osl.1.2.2">
          <s id="Osl.">
            Bil je jasen, mrzel aprilski dan in ure so bile trinajst.
        <quote id="Osl.1.2.17" rend="CN IT"><p id="Osl.">
            <s id="Osl.">
              <date iso8601="1984-04-04">4. april 1984</date>
        <quote id="Osl.1.8.18" rend="CN IT"><poem id="Osl.">
            <l id="Osl.">Tam pod kostanjevim drevesom</l>
            <l id="Osl.">izdala si me,</l>
            <l id="Osl.">izdal sem te,</l>
            <l id="Osl.">ne da bi trenila z o&ccaron;esom.</l>

