  COP project 106 MULTEXT-East ``1984'', Hungarian

Contributors: Csaba Oravecz and Laszló Tihanyi (RIL)

Description of the Corpus

Since there was no digital source serving as the basis of the CES1 encoded version of the Hungarian translation of ``1984'', the book had to be typed in. Copyright issues regarding the free use of the translation for academic and research purposes have been satisfactorily settled.

The Hungarian version contains 81167 words, as indicated in the header of the encoded corpus.

Structure of the Corpus

The Hungarian corpus of ``1984'' is composed of three <div type=part> and of one <div type=appendix> . Each part is further subdivided into a number of <div type=chapter> . In the Hungarian version, each <div type=part> is followed by a <head> , rendering the number of the part as it is indicated in literal written form in the printed edition.

The <div> elements have the n attribute, giving the successive number of the appropriate level of the <div> , and the id attribute, whose value has the prefix ORWhu , and the chapter and section numbers separated by periods, e.g. <div type=chapter n=2 id=ORWhu.1.2> .

The text is segmented into paragraphs, with the <quote> , <note> , <poem> and <title> elements marked-up at the paragraph level.

Sub-paragraph tagging is represented by <hi> , <q> and <name> . Names have been tagged only in the first chapter by hand, Due to this hand tagging, all the tagged names are correct, but apart from this first chapter, the text contains untagged names. The name tags, however, do contain the type attribute.

Rendering information has been in most cases included with the appropriated tags with the possible values: asterisk , italics , caps , centered caps , PRE mdash POST mdash , PRE mdash, PRE ldquor POST rdquor .

The following is an example from the Hungarian ``1984'' corpus:

<name type=org>Igazs&aacute;g-miniszt&eacute;rium</name> &mdash;
<name type=org lang=ns>Minigaz</name>, ahogy
&uacute;jbesz&eacute;l&uuml;l<ptr target=N1 rend=asterisk> nevezt&eacute;k
&mdash; ijeszt&odblac;en el&uuml;t&ouml;tt a k&ouml;rny&eacute;k&eacute;n
l&eacute;v&odblac; t&ouml;bbi &eacute;p&uuml;lett&odblac;l. Ragyog&oacute;
feh&eacute;r betonb&oacute;l
k&eacute;sz&uuml;lt, &oacute;ri&aacute;si, piramis alak&uacute;
&eacute;p&iacute;tm&eacute;ny volt, s h&aacute;romsz&aacute;z
m&eacute;ter magasan ny&uacute;lt fel a leveg&odblac;be. Onnan, ahol
<name type=person>Winston</name> &aacute;llt, &eacute;ppen el lehetett olvasni
<name type=org>P&aacute;rt</name>
h&aacute;rom jelmondat&aacute;t, amely d&iacute;szes
bet&udblac;kb&odblac;l volt kirakva az &eacute;p&uuml;let feh&eacute;r
<q rend="centered caps" type=slogan>
A H&Aacute;BOR&Uacute;: B&Eacute;KE
<q rend="centered caps" type=slogan>
<q rend="centered caps" type=slogan>
A TUDATLANS&Aacute;G: ER&Odblac;
<note place=foot id=N1>Az &uacute;jbesz&eacute;l
<name type=place>&Oacute;ce&aacute;nia</name>
hivatalos nyelve. Nyelvtani rendszer&eacute;nek &eacute;s
sz&oacute;kincs&eacute;nek magyar&aacute;zat&aacute;t l&aacute;sd a

Structure of the Original

The original was typed into Word for Windows 6.0, and then a number of conversion programs were made and used to convert it into ASCII. Rendition information was automatically extracted from the Word version and converted into mark-up, and then checked and supplemented by hand. Here follows an example from the DOS-text version:

Az Igazság-minisztérium - Minigaz, ahogy újbeszélül*
nevezték - ijesztôen elütött a környékén lévô többi épület-

* Az újbeszél ňceánia hivatalos nyelve. Nyelvtani
rendszerének és szókincsének magyarázatát lásd a

tôl. Ragyogó fehér betonból készült, óriási, piramis alakú
építmény volt, s háromszáz méter magasan nyúlt fel a
levegôbe. Onnan, ahol Winston állt, éppen el lehetett
olvasni a Párt három jelmondatát, amely díszes betűkbôl volt
kirakva az épület fehér homlokzatára:


Markup Process

The Word doc-file, converted into DOS text format, was the basis of the encoding. It was checked with the printed edition and corrections and additional markup were supplied by hand. A number of errors in the electronic version as well as in the printed edition have been detected. Corrections related to the latter are indicated in the header of ``1984''.

Tomaz Erjavec
Sat May 18 20:25:31 MDT 1996