Contributors: Heiki-Jaan Kaalep, Viire Villandi and Heili Orav
There was no digital version available for the Estonian translation of ``1984'', so the book had to be typed in. A written permission was acquired from the publisher and the translator of ``1984'' to freely distribute it in any form provided that no commercial benefit was sought, and to use it for research and academic purposes without restrictions.
The Estonian version of ``1984'' contains 79334 words, as indicated in the header of the tagged version.
The Estonian ``1984'' corpus body consists of three <div type=part n=1, 2, 3> and of one <div type=part n=appendix>. Each part is further subdivided into a number of <div type=chapter n=1, 2, ...>. In the Estonian version, each <div> is followed by a <head>, giving the part or chapter number as it appears in the original text. Counting of chapters starts from 1 in every part.
All the tagging was done by hand.
The text is marked up to the level of sentences: <p>, <quote> plus marking of sub-paragraph element <q>, incl. broken Qs. Some marking of particular sub-paragraph elements.
<body>, <div>, <head>, <item>, <l>, <list>, <note>, <p>, <poem>, <ptr>, <quote>, <text> are used so that to be in harmony with the English electronic version of 1984 for MULTEXT-EAST v. 4; the differences are due only to the differences between the English electronic and Estonian printed version.
<abbr>, <date>, <foreign>, <hi>, <mentioned>, <name>, <num>, <q>, <title> are used sloppily.
Rendering information attribute has been in most cases included with the appropriated tags. The possible values for rend attribute are: CA, IT, , smallpoint, PRE ldquor, POST ldquo, PRE ldquor POST ldquo, PRE rsquor POST rsquor.
The following is an example from the Estonian``1984'' corpus:
<ptr target=oet.N1 rend=asterisk>
— erines rabavalt kõigest muust, mida oli näha.
See oli tohutu kiiskavvalgest betoonist püramiidne ehitis, mis kerkis
seisis, seletas silm veel parajasti valgel seinal elegantses kirjas
<q id="Oet.184.108.40.206.3" rend=CA type=slogan>
Sõda on rahu
<q id="Oet.220.127.116.11.4" rend=CA type=slogan>
Vabadus on orjus
<q id="Oet.18.104.22.168.5" rend=CA type=slogan>
Teadmatus on jõud
The original was typed in using a text editor under DOS. The Estonian
characters were typed, using codes from extended ASCII. Some mark-up
was included in the typing phase already. Follows an example from the
T=F4eministeerium - uuskeeles <note place=foot>Uuskeel oli
Okeaania ametlik keel. Selle struktuuri ja et=FCmoloogia kohta vt.
Lisa.</note> T=F4min - erines rabavalt k=F4igest muust, mida oli
n=E4ha. See oli tohutu kiiskavvalgest betoonist p=FCramiidne ehitis,
mis kerkis astanguliselt 300 meetri k=F4rgusele. Sealt, kus
Winston seisis, seletas silm veel parajasti valgel seinal
elegantses kirjas ilutsevat Partei kolme loosungit:
SO^DA ON RAHU
VABADUS ON ORJUS
TEADMATUS ON JO^UD
The electronic version of ``1984'' was proofread and marked up to CES1 conformance. The whole Estonian ``1984'' CES1 corpus was cross-checked with the printed edition, and the printed edition was used to insert additional (e.g. <hi>) markup. In the process, a number of typographical errors were discovered in the electronic version of ``1984''.