next up previous contents
Next: Hungarian Up: Multilingual Parallel: Orwell's ``1984'' Previous: Czech

Subsections

Estonian

 COP project 106 MULTEXT-East Deliverable D2.1 F ``1984'', Estonian

Contributors: Heiki-Jaan Kaalep, Viire Villandi and Heili Orav

Description of the Corpus

There was no digital version available for the Estonian translation of ``1984'', so the book had to be typed in. A written permission was acquired from the publisher and the translator of ``1984'' to freely distribute it in any form provided that no commercial benefit was sought, and to use it for research and academic purposes without restrictions.

The Estonian version of ``1984'' contains 79334 words, as indicated in the header of the tagged version.

Structure of the Corpus

The Estonian ``1984'' corpus body consists of three <div type=part n=1, 2, 3> and of one <div type=part n=appendix>. Each part is further subdivided into a number of <div type=chapter n=1, 2, ...>. In the Estonian version, each <div> is followed by a <head>, giving the part or chapter number as it appears in the original text. Counting of chapters starts from 1 in every part.

All the tagging was done by hand.

The text is marked up to the level of sentences: <p>, <quote> plus marking of sub-paragraph element <q>, incl. broken Qs. Some marking of particular sub-paragraph elements.

<body>, <div>, <head>, <item>, <l>, <list>, <note>, <p>, <poem>, <ptr>, <quote>, <text> are used so that to be in harmony with the English electronic version of 1984 for MULTEXT-EAST v. 4; the differences are due only to the differences between the English electronic and Estonian printed version.

<abbr>, <date>, <foreign>, <hi>, <mentioned>, <name>, <num>, <q>, <title> are used sloppily.

Rendering information attribute has been in most cases included with the appropriated tags. The possible values for rend attribute are: CA, IT, , smallpoint, PRE ldquor, POST ldquo, PRE ldquor POST ldquo, PRE rsquor POST rsquor.

The following is an example from the Estonian``1984'' corpus:

<p id="Oet.1.2.7">
<s id="Oet.1.2.7.1">
<name type=org>
T&otilde;eministeerium
</name>
 &mdash;  
<name type=language>
uuskeeles
</name>
<ptr target=oet.N1 rend=asterisk>
<name type=org>
T&otilde;min
</name>
 &mdash;  erines rabavalt k&otilde;igest muust, mida oli n&auml;ha.
</s>
<s id="Oet.1.2.7.2">
See oli tohutu kiiskavvalgest betoonist p&uuml;ramiidne ehitis, mis kerkis
astanguliselt  
<num>
300
</num>
  meetri k&otilde;rgusele.
</s>
<s id="Oet.1.2.7.3">
Sealt, kus 
<name type=person>
Winston
</name>
  seisis, seletas silm veel parajasti valgel seinal elegantses kirjas
ilutsevat  
<name type=org>
Partei
</name>
 kolme loosungit: 
<q id="Oet.1.2.7.3.3" rend=CA type=slogan>
S&otilde;da on rahu
</q>
<q id="Oet.1.2.7.3.4" rend=CA type=slogan>
Vabadus on orjus
</q>
<q id="Oet.1.2.7.3.5" rend=CA type=slogan>
Teadmatus on j&otilde;ud
</q>
</s>
</p>

Structure of the Original

The original was typed in using a text editor under DOS. The Estonian characters were typed, using codes from extended ASCII. Some mark-up was included in the typing phase already. Follows an example from the original:

T=F4eministeerium - uuskeeles <note place=foot>Uuskeel oli
Okeaania ametlik keel. Selle struktuuri ja et=FCmoloogia kohta vt.
Lisa.</note> T=F4min - erines rabavalt k=F4igest muust, mida oli
n=E4ha. See oli tohutu kiiskavvalgest betoonist p=FCramiidne ehitis,
mis kerkis astanguliselt 300 meetri k=F4rgusele. Sealt, kus
Winston seisis, seletas silm veel parajasti valgel seinal
elegantses kirjas ilutsevat Partei kolme loosungit:
 
        SO^DA ON RAHU
 
        VABADUS ON ORJUS
 
        TEADMATUS ON JO^UD

Markup Process

The electronic version of ``1984'' was proofread and marked up to CES1 conformance. The whole Estonian ``1984'' CES1 corpus was cross-checked with the printed edition, and the printed edition was used to insert additional (e.g. <hi>) markup. In the process, a number of typographical errors were discovered in the electronic version of ``1984''.


next up previous contents
Next: Hungarian Up: Multilingual Parallel: Orwell's ``1984'' Previous: Czech
Multext-East