next up previous contents
Next: Hungarian Up: Multilingual Parallel: Orwell's Previous: Czech

Estonian

  COP project 106 MULTEXT-East ``1984'', Estonian

Contributors: Heiki-Jaan Kaalep, Viire Villandi and Heili Orav

Description of the Corpus

There was no digital version available for the Estonian translation of "1984", so the book had to be typed in. A written permission was acquired from the publisher and the translator of "1984" to freely distribute it in any form provided that no commercial benefit was sought, and to use it for research and academic purposes without restrictions.

The Estonian version of "1984" contains 79439 words, as indicated in the header of the tagged version.

Structure of the Corpus

The Estonian ``1984'' corpus body consists of three <div type=part n=1, 2, 3> and of one <div type=part n=appendix> . Each part is further subdivided into a number of <div type=chapter n=1, 2, ...> . In the Estonian version, each <div> is followed by a <head> , giving the part or chapter number as it appears in the original text. Counting of chapters starts from 1 in every part.

The text is segmented into paragraphs, with the <head> , <quote> , <note> , <poem> and <title>

elements marked-up at the paragraph level.

Sub-paragraph tagging consists of <hi> , <q> and <name> . All the tagging was done by hand.

Rendering information attribute has been in most cases included with the appropriated tags. The possible values for rend attribute are: caps , italic , italics , smallpoint , caps , PRE ldquor POST ldquo , PRE rsquor POST rsquor .

The following is an example from the Estonian``1984'' corpus:

<p>
<name type="org">T&otilde;eministeerium</name> &mdash; uuskeeles
<ptr target="N1">
<foreign lang="ns">T&otilde;min</foreign> &mdash;
erines rabavalt k&otilde;igest muust, mida oli n&auml;ha.
See oli tohutu kiiskavvalgest betoonist p&uuml;ramiidne ehitis,
mis kerkis astanguliselt 300 meetri k&otilde;rgusele.
Sealt, kus
<name type="person">Winston</name> seisis, seletas silm veel parajasti
valgel seinal
elegantses kirjas ilutsevat <name type="org">Partei</name> kolme loosungit:
<q rend="caps" type="slogan">S&otilde;da on rahu</q>     
<q rend="caps" type="slogan">Vabadus on orjus</q>      
<q rend="caps" type="slogan">Teadmatus on j&otilde;ud</q>
</p>

Structure of the Original

The original was typed in using a text editor under DOS. The Estonian characters were typed, using codes from extended ASCII. Some mark-up was included in the typing phase already. Follows an example from the original:

T=F4eministeerium - uuskeeles <note place=foot>Uuskeel oli
Okeaania ametlik keel. Selle struktuuri ja et=FCmoloogia kohta vt.
Lisa.</note> T=F4min - erines rabavalt k=F4igest muust, mida oli
n=E4ha. See oli tohutu kiiskavvalgest betoonist p=FCramiidne ehitis,
mis kerkis astanguliselt 300 meetri k=F4rgusele. Sealt, kus
Winston seisis, seletas silm veel parajasti valgel seinal
elegantses kirjas ilutsevat Partei kolme loosungit:

        SO^DA ON RAHU

        VABADUS ON ORJUS

        TEADMATUS ON JO^UD

Markup Process

The electronic version of "1984" was proofread and marked up to CES1 conformance. The whole Estonian ``1984'' CES1 corpus was cross-checked with the printed edition, and the printed edition was used to insert additional (e.g. <hi> ) markup. In the process, a number of typographical errors were discovered the electronic version of ``1984''.



next up previous contents
Next: Hungarian Up: Multilingual Parallel: Orwell's Previous: Czech



Tomaz Erjavec
Sat May 18 20:25:31 MDT 1996