To illustrate the kinds of elements used in the ``1984'' corpus, and to show their distributions across the language components we give the following table, which gives the contents of the <tagUsage> elements in the headers of the respective language components:
GI | EN | BG | CS | ET | HU | RO | SL |
---|---|---|---|---|---|---|---|
abbr | 38 | 28 | 23 | 73 | 38 | 3 | 26 |
body | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
date | 40 | 40 | 39 | 18 | 39 | 33 | |
div | 28 | 28 | 28 | 28 | 28 | 28 | 28 |
foreign | 39 | 29 | 91 | 93 | 43 | 430 | 7 |
head | 1 | 1 | 1 | 5 | 5 | 28 | 29 |
hi | 103 | 103 | 75 | 183 | 71 | 413 | 242 |
item | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
l | 32 | 26 | 33 | 32 | 32 | 26 | 34 |
list | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
mentioned | 261 | 256 | 244 | 44 | 281 | ||
name | 1744 | 1704 | 2181 | 2457 | 1843 | 2157 | 1327 |
note | 2 | 8 | 2 | 2 | 2 | 3 | 1 |
num | 52 | 34 | 48 | 14 | 10 | ||
p | 1286 | 1321 | 1285 | 1289 | 1292 | 1335 | 1288 |
poem | 10 | 7 | 11 | 10 | 10 | 7 | 10 |
ptr | 2 | 8 | 1 | 2 | 2 | 1 | |
q | 2209 | 1203 | 2208 | 2192 | 2197 | 2137 | 2260 |
quote | 35 | 34 | 36 | 35 | 35 | 23 | 35 |
s | 6701 | 6649 | 6714 | 6658 | 6732 | 6487 | 6689 |
title | 46 | 41 | 45 | 29 | 40 | 1 | 10 |
term | 2 |
Tag usage in Orwell's ``1984''
For the purpose of alignment it was important to ensure that the gross structure of the 7 languages was as similar as possible. The translations were encoded taking the English digital version as as the norm, with errors of alignment guiding the harmonisation. The elements that have been harmonised can be seen in the table above, as they occur the same number of times in all the languages. In particular, each of the ``1984''s has one <body>, which consists of three <div type=part n=1, 2, 3> and of one <div type=part n=appendix>. Each part, except the Appendix, is further subdivided into a number of <div type=chapter n=1, 2, ...>.
In addition to the 'standard' language IDs that the lang attribute can refer to, some elements in the ``1984'' corpus have been marked up with more specific language values. These are defined in the <langUsage> element in the <cesDoc> header of the particular language components. We give the definitions below:
<language id="ns" iso639="none">Newspeak</language> <language id="ns-jg" iso639="none">Newspeak official jargon</language> <language id="en-ck" iso639="none">British Cockney English</language> <language id=bg-cl iso639=bg>Bulgarian colloquial</language> <language id=ns-bg iso639=bg>Newspeak Bulgarian</language> <language id=ns-jg-bg iso639=bg>Newspeak official jargon Bulgarian</language> <language id=cs-cl iso639=cs>Czech colloquial</language> <language id=ns-cs iso639=cs>Newspeak Czech</language> <language id=ns-jg-cs iso639=cs>Newspeak official jargon Czech</language> <language id=ns-et iso639=none>Newspeak Estonian</language> <language id=ns-hu iso639=hu>Newspeak Hungarian</language> <language id=ns-jg-hu iso639=hu>Newspeak official jargon Hungarian</language> <Language id="ns-ro" iso639="xx">Nouvorbă</language> <language id=ns-sl iso639=sl>Newspeak Slovene</language>
It can be noted that the above definitions still need to be harmonised. They also raise some interesting questions as to the definition of 'language'.
Prior to sentence alignment, the ``1984'' had to be sentence segmented. This was done automatically, by using a combination of the MtSeg tool and special purpose scripts written by Greg Priest-Dorman of Vassar.
In inserting the <s> markup, the well known problem of crossing <q> and <s> hierarchies appeared. This was solved by automatically splitting the <q> elements where necessary. The <q> elements that were so inserted were marked by type=MI (Machine Inserted) in the translations, and with type=broken, with prev and next ID references in the English original. Where the <hi> elements spanned more than one <s>, these were corrected by hand.
The automatically inserted <s> elements were then hand validated via the process of alignment: where the S alignment between the English version and a translation was not one-to-one, there was a fair chance of it being not a difference in translation, but of a wrongly placed <s>. Such alignments were checked, and the <s> tags corrected where necessary.
The ider program, written by Greg Priest-Dorman, was used to automatically assign unique identifiers to the following structural elements in the ``1984'' corpus:
The values of the id attribute are strings composed of the initial letter O identifying the Orwell corpus, the two letter ISO 936 code identifying the language component of the corpus (en, bg cs, et, hu, ro, sl), followed by numbers separated by periods; these give the position of the tag in the SGML tree of the document, rooted at the BODY element. The following example illustrates the ID naming scheme:
<body lang="sl" id="Osl"> <div id="Osl.1" type=part n=1> <head>Prvi del</head> <div id="Osl.1.2" type=chapter n=1> <head>I</head> <p id="Osl.1.2.2"> <s id="Osl.1.2.2.1"> Bil je jasen, mrzel aprilski dan in ure so bile trinajst. </s> ... <quote id="Osl.1.2.17" rend="CN IT"><p id="Osl.1.2.17.1"> <s id="Osl.1.2.17.1.1"> <date iso8601="1984-04-04">4. april 1984</date> </s> </p></quote> ... <quote id="Osl.1.8.18" rend="CN IT"><poem id="Osl.1.8.18.1"> <l id="Osl.1.8.18.1.1">Tam pod kostanjevim drevesom</l> <l id="Osl.1.8.18.1.2">izdala si me,</l> <l id="Osl.1.8.18.1.3">izdal sem te,</l> <l id="Osl.1.8.18.1.4">ne da bi trenila z očesom.</l> </poem></quote>