next up previous contents
Next: Romanian Up: Multilingual Comparable 1: Fiction Previous: Estonian

Subsections

Hungarian

 COP project 106 MULTEXT-East Deliverable D2.1 F Fiction, Hungarian

Contributors: Csaba Oravecz and Júlia Pajzs (RIL)

Description of the Corpus

The contents of the Hungaian MULTEXT-East fiction corpus is four excerpts from four 20th century Hungarian novels. These are ``Számdás'' and ``Falusi krónika'' by Péter Veres, published in 1937 and 1944, respectively; ``Budai oroszlán'' by István Sotér, published in 1978; and ``Égeto Eszter'' by László Németh, published in 1956. There are approximately 50 pages from each novel in the corpus.

The digital source of the encoding was the National Corpus for the Hungarian Historical Dictionary. Since the distributor for this corpus is also the Research Institute for Linguistics, no licence agreement was necessary to allow its use. The copyright of the original printed versions is settled with three sources, for the fourth, verbal agreement is obtained, written licence is to be received.

The Hungarian Fiction Corpus now has 71777 words.

Structure of the Corpus

The corpus body consists of 4 <div type=excerpt>, each of which starts off with a <head> giving under <byline> the title, the author, the publisher and the publishing date of the novel from which the excerpt in question is taken. The number of pages appearing in the corpus is also indicated. <div type=excerpt> is sub-divided into <div type=section>, denoted in the printed editions by starting the paragraph on a new page.

The <div type=section> elements have the n attribute, giving the section number.

The text is segmented into paragraphs, with the <quote> element marked-up at the paragraph level.

Sub-paragraph tagging consists of <abbr>, <foreign>, <hi> and <q>. Direct speech has been marked-up by <q> even where there is no typographical marking in the printed text. Names with type attribute are only marked up in the sample version.

Rendering information is given in the same way as in the Hungarian CES1 version of ``1984''.

Here follows an example from the corpus:

<mentioned rend="PRE ldquor POST rdquor">magyar</mentioned>-t
l&aacute;tta benn&uuml;nk (sz&eacute;p,
de ritka volt ez az igazi nemzeti &eacute;rz&eacute;s) &eacute;s az &uuml;gy&uuml;nket
a maguk&eacute;v&aacute; tett&eacute;k. De &eacute;ppen ez lett a
baj. Mert mi ki akartunk sz&aacute;llni j&oacute;val
<name type=place>Nagyv&aacute;rad</name> el&odblac;tt s &odblac;k biztattak, hogy a 
v&aacute;rad-velencei kis&aacute;llom&aacute;son sincsen &odblac;rs&eacute;g.
</p>
<p>
Az asszonyka &eacute;s a kalauz az ajt&oacute;b&oacute;l lest&eacute;k, mi meg az
ablakon, hogy van-e vizsg&aacute;lat. S amikor behaladt a vonat &eacute;s
megl&aacute;ttuk a szuronyokat, megijedt&uuml;nk.
<q rend="PRE mdash POST mdash">Na most belement&uuml;nk
a farkasverembe </q> gondoltuk &eacute;s hamarj&aacute;ban nem is tudtuk, hogy

Structure of the Original

The electronic source was a not entirely SGML conformant encoded ASCII file, with Hungarian characters being encoded in a special `Prószéky'-code. Paragraph boundaries were marked, and additional sub-paragraph information was also included, such as foreign words and highlighted text marked. For a sample see below:

<l></par></l>
<l><par>Amikor azta1n ala1i1rta egy-egy csapat a szerzo3de1st, akkor</l>
<l>bementek a kocsma1ba. Minden csapat a megszokott kocsma1ja1ba.</l>
<l>Az u1jvila1giak a ,,kis zsido1''-hoz, a ba1nlakosiak Zsiga1hoz,</l>
<l>ke1so3bb Spitzerhez. </par></l>
<l><par>A bandagazda felva1ltotta a nagy pe1nzeket e1s elosztotta.</l>
<l>Egy-egy forint ,,felpe1nz'' ja1rt mindenkinek. Persze, mindenki</l>
<l>rendelt legala1bb egy fe1ldecit, ez ma1r a tisztesse1g </l>
</page>

Markup Process

The electronic source was first automatically converted into a quasi-sgml encoding. This version was then additionally corrected and marked up to CES1 conformance by hand. Name tagging in the sample was carried out automatically, with manual correction. Since in Hungarian printing the typographical rendering of direct speech overlaps with the typographical rendering of other types of text, marking with <q> tags had to be done manually. The corpus text was also cross-checked against the printed editions.


next up previous contents
Next: Romanian Up: Multilingual Comparable 1: Fiction Previous: Estonian
Multext-East