Next: English Up: Multilingual Parallel: Orwell's ``1984'' Previous: Overview

Subsections

Structure of the Corpus

To illustrate the kinds of elements used in the ``1984'' corpus, and to show their distributions across the language components we give the following table, which gives the contents of the <tagUsage> elements in the headers of the respective language components:

GI	EN	BG	CS	ET	HU	RO	SL
abbr	38	28	23	73	38	3	26
body	1	1	1	1	1	1	1
date	40	40	39	18	39		33
div	28	28	28	28	28	28	28
foreign	39	29	91	93	43	430	7
head	1	1	1	5	5	28	29
hi	103	103	75	183	71	413	242
item	4	4	4	4	4	4	4
l	32	26	33	32	32	26	34
list	1	1	1	1	1	1	1
mentioned	261	256	244	44	281
name	1744	1704	2181	2457	1843	2157	1327
note	2	8	2	2	2	3	1
num	52	34	48	14	10
p	1286	1321	1285	1289	1292	1335	1288
poem	10	7	11	10	10	7	10
ptr	2	8	1	2	2		1
q	2209	1203	2208	2192	2197	2137	2260
quote	35	34	36	35	35	23	35
s	6701	6649	6714	6658	6732	6487	6689
title	46	41	45	29	40	1	10
term			2

Tag usage in Orwell's ``1984''

For the purpose of alignment it was important to ensure that the gross structure of the 7 languages was as similar as possible. The translations were encoded taking the English digital version as as the norm, with errors of alignment guiding the harmonisation. The elements that have been harmonised can be seen in the table above, as they occur the same number of times in all the languages. In particular, each of the ``1984''s has one <body>, which consists of three <div type=part n=1, 2, 3> and of one <div type=part n=appendix>. Each part, except the Appendix, is further subdivided into a number of <div type=chapter n=1, 2, ...>.

The languages of ``1984''

In addition to the 'standard' language IDs that the lang attribute can refer to, some elements in the ``1984'' corpus have been marked up with more specific language values. These are defined in the <langUsage> element in the <cesDoc> header of the particular language components. We give the definitions below:

     <language id="ns" iso639="none">Newspeak</language>
     <language id="ns-jg" iso639="none">Newspeak official jargon</language>
     <language id="en-ck" iso639="none">British Cockney English</language>

     <language id=bg-cl    iso639=bg>Bulgarian colloquial</language>
     <language id=ns-bg    iso639=bg>Newspeak Bulgarian</language>
     <language id=ns-jg-bg iso639=bg>Newspeak official jargon Bulgarian</language>

     <language id=cs-cl    iso639=cs>Czech colloquial</language>
     <language id=ns-cs    iso639=cs>Newspeak Czech</language>
     <language id=ns-jg-cs iso639=cs>Newspeak official jargon Czech</language>

     <language id=ns-et    iso639=none>Newspeak Estonian</language>

     <language id=ns-hu    iso639=hu>Newspeak Hungarian</language>
     <language id=ns-jg-hu iso639=hu>Newspeak official jargon Hungarian</language>

     <Language id="ns-ro"  iso639="xx">Nouvorb&abreve;</language>

     <language id=ns-sl    iso639=sl>Newspeak Slovene</language>

It can be noted that the above definitions still need to be harmonised. They also raise some interesting questions as to the definition of 'language'.

Sentence segmentation

Prior to sentence alignment, the ``1984'' had to be sentence segmented. This was done automatically, by using a combination of the MtSeg tool and special purpose scripts written by Greg Priest-Dorman of Vassar.

In inserting the <s> markup, the well known problem of crossing <q> and <s> hierarchies appeared. This was solved by automatically splitting the <q> elements where necessary. The <q> elements that were so inserted were marked by type=MI (Machine Inserted) in the translations, and with type=broken, with prev and next ID references in the English original. Where the <hi> elements spanned more than one <s>, these were corrected by hand.

The automatically inserted <s> elements were then hand validated via the process of alignment: where the S alignment between the English version and a translation was not one-to-one, there was a fair chance of it being not a difference in translation, but of a wrongly placed <s>. Such alignments were checked, and the <s> tags corrected where necessary.

ID marking

The ider program, written by Greg Priest-Dorman, was used to automatically assign unique identifiers to the following structural elements in the ``1984'' corpus:

<div>, <quote>
<p>, <list>, <poem>
<s>, <item>, <l>
<q>

The values of the id attribute are strings composed of the initial letter O identifying the Orwell corpus, the two letter ISO 936 code identifying the language component of the corpus (en, bg cs, et, hu, ro, sl), followed by numbers separated by periods; these give the position of the tag in the SGML tree of the document, rooted at the BODY element. The following example illustrates the ID naming scheme:

  <body lang="sl" id="Osl">
    <div id="Osl.1" type=part n=1>
    <head>Prvi del</head>
      <div id="Osl.1.2" type=chapter n=1>
        <head>I</head>
        <p id="Osl.1.2.2">
          <s id="Osl.1.2.2.1">
            Bil je jasen, mrzel aprilski dan in ure so bile trinajst.
          </s>
...
        <quote id="Osl.1.2.17" rend="CN IT"><p id="Osl.1.2.17.1">
            <s id="Osl.1.2.17.1.1">
              <date iso8601="1984-04-04">4. april 1984</date>
            </s>
          </p></quote>
...
        <quote id="Osl.1.8.18" rend="CN IT"><poem id="Osl.1.8.18.1">
            <l id="Osl.1.8.18.1.1">Tam pod kostanjevim drevesom</l>
            <l id="Osl.1.8.18.1.2">izdala si me,</l>
            <l id="Osl.1.8.18.1.3">izdal sem te,</l>
            <l id="Osl.1.8.18.1.4">ne da bi trenila z o&ccaron;esom.</l>
          </poem></quote>

Next: English Up: Multilingual Parallel: Orwell's ``1984'' Previous: Overview

Multext-East