COP project 106 MULTEXT-East ``1984'', Estonian
Contributors: Heiki-Jaan Kaalep, Viire Villandi and Heili Orav
There was no digital version available for the Estonian translation of "1984", so the book had to be typed in. A written permission was acquired from the publisher and the translator of "1984" to freely distribute it in any form provided that no commercial benefit was sought, and to use it for research and academic purposes without restrictions.
The Estonian version of "1984" contains 79439 words, as indicated in the header of the tagged version.
The Estonian ``1984'' corpus body consists of three <div type=part n=1, 2, 3> and of one <div type=part n=appendix> . Each part is further subdivided into a number of <div type=chapter n=1, 2, ...> . In the Estonian version, each <div> is followed by a <head> , giving the part or chapter number as it appears in the original text. Counting of chapters starts from 1 in every part.
The text is segmented into paragraphs, with the <head> , <quote> , <note> , <poem> and <title>
elements marked-up at the paragraph level.
Sub-paragraph tagging consists of <hi> , <q> and <name> . All the tagging was done by hand.
Rendering information attribute has been in most cases included with the appropriated tags. The possible values for rend attribute are: caps , italic , italics , smallpoint , caps , PRE ldquor POST ldquo , PRE rsquor POST rsquor .
The following is an example from the Estonian``1984'' corpus:
<p> <name type="org">Tõeministeerium</name> — uuskeeles <ptr target="N1"> <foreign lang="ns">Tõmin</foreign> — erines rabavalt kõigest muust, mida oli näha. See oli tohutu kiiskavvalgest betoonist püramiidne ehitis, mis kerkis astanguliselt 300 meetri kõrgusele. Sealt, kus <name type="person">Winston</name> seisis, seletas silm veel parajasti valgel seinal elegantses kirjas ilutsevat <name type="org">Partei</name> kolme loosungit: <q rend="caps" type="slogan">Sõda on rahu</q> <q rend="caps" type="slogan">Vabadus on orjus</q> <q rend="caps" type="slogan">Teadmatus on jõud</q> </p>
The original was typed in using a text editor under DOS. The Estonian characters were typed, using codes from extended ASCII. Some mark-up was included in the typing phase already. Follows an example from the original:
T=F4eministeerium - uuskeeles <note place=foot>Uuskeel oli Okeaania ametlik keel. Selle struktuuri ja et=FCmoloogia kohta vt. Lisa.</note> T=F4min - erines rabavalt k=F4igest muust, mida oli n=E4ha. See oli tohutu kiiskavvalgest betoonist p=FCramiidne ehitis, mis kerkis astanguliselt 300 meetri k=F4rgusele. Sealt, kus Winston seisis, seletas silm veel parajasti valgel seinal elegantses kirjas ilutsevat Partei kolme loosungit: SO^DA ON RAHU VABADUS ON ORJUS TEADMATUS ON JO^UD
The electronic version of "1984" was proofread and marked up to CES1 conformance. The whole Estonian ``1984'' CES1 corpus was cross-checked with the printed edition, and the printed edition was used to insert additional (e.g. <hi> ) markup. In the process, a number of typographical errors were discovered the electronic version of ``1984''.