next up previous contents
Next: Bulgarian Up: Multilingual Parallel: Orwell's Previous: Overview

English

  COP project 106 MULTEXT-East ``1984'', English

Contributors: Nancy Ide (CNRS/Vassar) and Tomaz Erjavec (IJS)

Description of the Corpus

The digital source of the English original of ``1984'' was obtained from the European Corpus Initiative, Multilingual Corpus 1 CD-ROM (as the sub-component men08 of its mul08 part), which also contains a Slovene, Croatian and Serbian translation. The source for the ECI-MC1 versions of ``1984'' was, in turn, the Oxford Text Archive, who already made a SGML encoding of the text, which, however, was not TEI-compliant.

Below we reproduce the content of the editorial file of the English ``1984'' on the ECI-MC1, mul08/men08/eng18.edt:

For the OTA, all four versions were prepared by the OUCS KDEM service in 1985 for Dr David C Bennett of the School of Oriental And African Studies at London University. The OTA has copies of letters from all the publishers concerned which appear to indicate that academic use of the text is permitted. A.M. Heath (the agents for Orwell's estate) state that the purposes ``must be solely for the use of research and not for sale in any way''. We have been unable to make contact with Dr Bennett: the texts here have not been encoded or proofread in any way since they were produced (other than the English text, which was converted to an SGML like encoding by John Price-Wilkin, and subsequently automatically converted to conform to the OTA's dtd by myself and Alan Morrison.

(ECI: LB, Nov 1992)

The above would seem to indicate that the OTA digital source for the English ``1984'' has the same distribution restrictions as those imposed by the MULTEXT-East project itself, namely, that the use of the materials is free for academic purposes. This, however, does not apply to the ECI edition of the corpus, which in its copyright statement explicitly forbids further distribution of its corpus.

As computed by the Unix program wc over the whole CES-1 document, the English ``1984'' has 109423 words.

Structure of the Corpus

The English ``1984'' corpus body consists of four <div type=part> . Each part (except <div type=part n=APPENDIX> ) is further subdivided into a number of <div type=chapter> . In the English version, only the appendix is followed by a <head> .

The text is segmented into paragraphs, with the <head> , <quote> , <note> , and <poem>

elements marked-up at the paragraph level.

Sub-paragraph tagging consists of <hi> , <q> , <mentioned> , and <name> and <term> . The tag <mentioned> is reserved for Newspeak words, while <term> gives the description of such Newspeak words. This is possibly an abuse of the CES, but CES currently does not give the <gloss> tag, which was used in the digital source for the descriptions of Newspeak words.

The first chapter is also additionally marked-up for <date> , <foreign> , and <num> .

Rendering information, where given, is a descriptive (e.g. italics) value of the rend attribute. Rendering has, except for capitalisation, been removed from the tag content.

The following is an example from the English ``1984'' corpus:

<p>
<name type=org>Ministry of Truth</name>, 
&mdash;  
<name type=org lang=ns>Minitrue</name>, 
in <name>Newspeak</name><ptr target=N1 rend=asterisk>
&mdash; was
startlingly different from any other object in sight. It was
an enormous pyramidal structure of glittering white
concrete, soaring up, terrace after terrace, <num>300</num> metres 
into the air. From where
<name type=person>Winston</name>
stood it was just possible to
read, picked out on its white face in elegant lettering, the
three slogans of the
<name type=org>Party</name>:
<q rend="centered caps" type=slogan>War is peace</q>
<q rend="centered caps" type=slogan>Freedom is slavery</q>
<q rend="centered caps" type=slogan>Ignorance is strength.</q>
</p>
<note place=foot id=N1>
<name>Newspeak</name> was the official language of 
<name type=place>Oceania</name>. 
For an account of its structure and etymology see Appendix.</note>

Structure of the Original

The ECI version which was used as the digital source was encoded in a TEI compliant DTD, but not proofread. For a history of it's evolution, see the introduction at the beginning of this section. The following is an example from the ECI mul08/mul08a.eci:

The Ministry of Truth &mdash;  Minitrue, in Newspeak* &mdash;  was
startlingly different from any other object in sight. It was
an enormous pyramidal structure of glittering white
concrete, soaring up, terrace after terrace, 300 metres into
the air. From where Winston stood it was just possible to
read, picked out on its white face in elegant lettering, the
three slogans of the Party:

<q>
WAR IS PEACE
</q>

<q>
FREEDOM IS SLAVERY
</q>

<q>
IGNORANCE IS STRENGTH
</q>

Markup Process

The ECI version which was used as the digital source was manually converted to CES-1 compliance. Original OCR mistakes were corrected with a spelling checker. The first chapter has been additionally marked-up.



next up previous contents
Next: Bulgarian Up: Multilingual Parallel: Orwell's Previous: Overview



Tomaz Erjavec
Sat May 18 20:25:31 MDT 1996