 COP project 106 MULTEXT-East Deliverable D2.1 F ``1984'', Hungarian

Contributors: Csaba Oravecz and Laszló Tihanyi (RIL)

Description of the Corpus

Since there was no digital source serving as the basis of the CES1 encoded version of the Hungarian translation of ``1984'', the book had to be typed in. Copyright issues regarding the free use of the translation for academic and research purposes have been satisfactorily settled.

The Hungarian version contains 81167 words, as indicated in the header of the encoded corpus.

Structure of the Corpus

The Hungarian corpus of ``1984'' is composed of three <div type=part> and of one <div type=appendix>. Each part is further subdivided into a number of <div type=chapter>. In the Hungarian version, each <div type=part> is followed by a <head>, rendering the number of the part as it is indicated in literal written form in the printed edition.

The <div> elements have the n attribute, giving the successive number of the appropriate level of the <div>, and the id attribute, the value of which is prefixed with Ohu, and contains a number scheme indicating the hierarchical position of the element in the sgml tree, e.g. <div id="Ohu.1.2" type=chapter n=1>. The id attribute is also specified on each element down to the sentence level.

The text is segmented into paragraphs, with the <quote>, <note>, <poem> and <list> elements marked up at the paragraph level.

Sub-paragraph tagging is represented by <hi>, <q> and <name>. Frequently occurring names of people, places, organizations, products, languages, and events, are marked throughout the text.

Rendering information is specified in the header of the corpus file. If not explicitly indicated therein, it is included with the appropriate tag in the rend attribute. Possible values are: asterisk, IT for italics, CA for capitals, CE CA for centered caps, PRE mdash, PRE ldquor POST rdquor.

The following is an example from the Hungarian 1984 corpus:

<p id="Ohu.1.2.7">
<s id="Ohu.">Az
<name type=org>Igazs&aacute;g-miniszt&eacute;rium</name> &mdash;
<name type=org lang=ns-hu>Minigaz</name>, ahogy
&uacute;jbesz&eacute;l&uuml;l<ptr id="Ohu."
target="Ohu.1.2.8" rend=asterisk> nevezt&eacute;k
&mdash; ijeszt&odblac;en el&uuml;t&ouml;tt a
l&eacute;v&odblac; t&ouml;bbi &eacute;p&uuml;lett&odblac;l.</s> 
feh&eacute;r betonb&oacute;l
k&eacute;sz&uuml;lt, &oacute;ri&aacute;si, piramis alak&uacute;
&eacute;p&iacute;tm&eacute;ny volt, s h&aacute;romsz&aacute;z
m&eacute;ter magasan ny&uacute;lt fel a leveg&odblac;be.</s> 
<s id="Ohu.">Onnan, ahol
<name type=person>Winston</name> &aacute;llt, &eacute;ppen el lehetett
olvasni a
<name type=org>P&aacute;rt</name>
h&aacute;rom jelmondat&aacute;t, amely d&iacute;szes
bet&udblac;kb&odblac;l volt kirakva az &eacute;p&uuml;let feh&eacute;r
<q id="Ohu."    rend="CE CA" type=slogan>
A h&aacute;bor&uacute;: b&eacute;ke
<q id="Ohu."    rend="CE CA" type=slogan>
A szabads&aacute;g: szolgas&aacute;g
<q id="Ohu."    rend="CE CA" type=slogan>
A tudatlans&aacute;g: er&odblac;
<note id="Ohu.1.2.8"    place=foot>Az &uacute;jbesz&eacute;l
<name type=place>&Oacute;ce&aacute;nia</name>
hivatalos nyelve. Nyelvtani rendszer&eacute;nek &eacute;s
sz&oacute;kincs&eacute;nek magyar&aacute;zat&aacute;t l&aacute;sd a

Structure of the Original

The original was typed into Word for Windows 6.0, and then a number of conversion programs were made and used to convert it into ASCII. Rendition information was automatically extracted from the Word version and converted into mark-up, and then checked and supplemented by hand. Here follows an example from the DOS-text version:

Az Igazság-minisztérium - Minigaz, ahogy újbeszélül*
nevezték - ijesztôen elütött a környékén lévô többi épület-

* Az újbeszél ňceánia hivatalos nyelve. Nyelvtani
rendszerének és szókincsének magyarázatát lásd a

tôl. Ragyogó fehér betonból készült, óriási, piramis alakú
építmény volt, s háromszáz méter magasan nyúlt fel a
levegôbe. Onnan, ahol Winston állt, éppen el lehetett
olvasni a Párt három jelmondatát, amely díszes betűkbôl volt
kirakva az épület fehér homlokzatára:


Markup Process

The Word doc-file, converted into DOS text format, was the basis of the encoding. It was checked with the printed edition and corrections and additional markup were supplied by hand. A number of errors in the electronic version as well as in the printed edition have been detected. Corrections related to the latter are indicated in the header of ``1984''.

