Contributors: Nancy Ide, Greg Priest-Dorman (CNRS/Vassar), Tomaz Erjavec (IJS)
The digital source of the English original of ``1984'' was obtained from the European Corpus Initiative, Multilingual Corpus 1 CD-ROM (as the sub-component men08 of its mul08 part), which also contains a Slovene, Croatian and Serbian translation. The source for the ECI-MC1 versions of ``1984'' was, in turn, the Oxford Text Archive, who already made an SGML encoding of the text, which, however, was not TEI-compliant.
Below we reproduce the content of the editorial file of the English ``1984'' on the ECI-MC1, mul08/men08/eng18.edt:
For the OTA, all four versions were prepared by the OUCS KDEM service in 1985 for Dr David C Bennett of the School of Oriental And African Studies at London University. The OTA has copies of letters from all the publishers concerned which appear to indicate that academic use of the text is permitted. A.M. Heath (the agents for Orwell's estate) state that the purposes ``must be solely for the use of research and not for sale in any way''. We have been unable to make contact with Dr Bennett: the texts here have not been encoded or proofread in any way since they were produced (other than the English text, which was converted to an SGML like encoding by John Price-Wilkin, and subsequently automatically converted to conform to the OTA's dtd by myself and Alan Morrison.
(ECI: LB, Nov 1992)
The above would seem to indicate that the OTA digital source for the English ``1984'' has the same distribution restrictions as those imposed by the MULTEXT-East project itself, namely, that the use of the materials is free for academic purposes. This, however, does not apply to the ECI edition of the corpus, which in its copyright statement explicitly forbids further distribution of its corpus.
The English MULTEXT-East Orwell has 91619 words.
The English ``1984'' corpus body consists of four <div type=part>. Each part (except <div type=part n=APPENDIX>) is further subdivided into a number of <div type=chapter>. In the English version, only the appendix is followed by a <head>.
The text is segmented into paragraphs, with the <head>, <quote>, <note>, and <poem> elements marked-up at the paragraph level.
Sub-paragraph tagging consists of <hi>, <q>, <mentioned>, and <name> and <term>. The tag <mentioned> is reserved for Newspeak words, while <term> gives the description of such Newspeak words. This is possibly an abuse of the CES, but CES currently does not give the <gloss> tag, which was used in the digital source for the descriptions of Newspeak words.
The first chapter is also additionally marked-up for <date>, <foreign>, and <num>.
Rendering information, where given, is a descriptive (e.g. italics) value of the rend attribute. Rendering has, except for capitalisation, been removed from the tag content.
The following is an example from the English ``1984'' corpus:
<s id="Oen.220.127.116.11"><name type=org>Ministry of Truth</name>,
<name type=org lang=ns>Minitrue</name>,
<ptr id="Oen.18.104.22.168.4" target="Oen.1.1.8" rend=asterisk>
— was startlingly different from any other object in sight.</s>
was an enormous pyramidal structure of glittering white concrete,
soaring up, terrace after terrace,
metres into the air.</s>
<s id="Oen.22.214.171.124">From where
stood it was just possible to read, picked out on its white face in
elegant lettering, the three slogans of the
<q id="Oen.126.96.36.199.3" rend="CE CA" type=slogan>War is peace</q>
<q id="Oen.188.8.131.52.4" rend="CE CA" type=slogan>Freedom is slavery</q>
<q id="Oen.184.108.40.206.5" rend="CE CA" type=slogan>Ignorance is strength.</q></s>
<note id="Oen.1.1.8" place=foot>
was the official language of
For an account of its structure and etymology see Appendix.
The ECI version which was used as the digital source was encoded in a TEI compliant DTD, but not proofread. For a history of it's evolution, see the introduction at the beginning of this section. The following is an example from the ECI mul08/mul08a.eci:
<p>The Ministry of Truth — Minitrue, in Newspeak* — was
startlingly different from any other object in sight. It was
an enormous pyramidal structure of glittering white
concrete, soaring up, terrace after terrace, 300 metres into
the air. From where Winston stood it was just possible to
read, picked out on its white face in elegant lettering, the
three slogans of the Party:
<q>WAR IS PEACE </q>
<q>FREEDOM IS SLAVERY </q>
<q>IGNORANCE IS STRENGTH</q> The Ministry of Truth contained,
The ECI version which was used as the digital source was manually converted to CES-1 compliance. Original OCR mistakes were corrected with a spelling checker. The first chapter has been additionally marked-up.