next up previous contents
Next: Romanian Up: Multilingual Comparable 2: Newspapers Previous: Estonian

Subsections

Hungarian

 COP project 106 MULTEXT-East Deliverable D2.1 F Newspapers, Hungarian

Contributors: Csaba Oravecz, Tamás Váradi and Gábor (Kiss, RIL)

Description of the Corpus

The Hungarian newspaper corpus contains 205 articles from the daily newspaper ``Magyar Hírlap''. The articles are from the Jan 25 and Jan 31, 1996, issues. The digital source was provided by the Magyar Hírlap Publishing House Ltd. and was composed of data files with, if any, idiosyncratic encoding and embedded comments for the typesetters. The corpus represents a wide variety of articles each being characteristic of everyday journalism (i.e. long, essay type articles typically from supplements to weekend editions were not included).

A licence agreement was obtained allowing the use of these articles for the purposes of the MULTEXT-East project.

The Hungarian Newspaper Corpus now has 63351 words. The corpus is currently still under extension.

Structure of the Corpus

The corpus body consists of 2 <div type=newspaper>, each of which contains 94 and 111 <div type=articles>, respectively. In addition, a number of articles are grouped under one <div type=storylist>, where they appeared as short snippets each having its own head in a coloumn under a common heading.

The <div type=article> usually begins with one or more <head> tags, giving the headline(s), and one <byline> representing the source the content of the article came from (typically a news agency). After this <byline> another head can potentially follow giving an abstract of the article.

Captions to the pictures accompanying the articles, when they were represented in the digital source, were also included in the corpus. They are normally given at the beginning of <div type=article>.

The <div> elements apart from the type attribute, have no other attributes.

Document authors are included where they appeared in the original, usually at the end of the articles; they are marked-up as <byline> <docAuthor>Author or Initials</docAuthor> </byline>.

The text is segmented into paragraphs, other paragraph level tagging is <note>. This is used when the article is continued on or from another page. In this case reference to the page is given between <ref>. However, no pointer is included.

Sub-paragraph tagging consists of <abbr>, <name>, and <q>. The former two were only marked up in the Jan 25 issue; semi-automatically, then were manually corrected and provided with the type attribute. <q> is given for the whole corpus.

The rend attributes on sub-paragraph tags are included in the same way as in the Hungarian version of ``1984''. No quotation marks are retained in the corpus.

Here follows an example from the corpus:

<byline><abbr>MH</abbr>-inform&aacute;ci&oacute;</byline>

<head>Az er&odblac;s havaz&aacute;s megb&eacute;n&iacute;totta a 
<name type=place>Tasz&aacute;rr&oacute;l</name>, illetve 
<name type=place>Kaposv&aacute;rr&oacute;l</name> indulni
k&eacute;sz&uuml;l&odblac; <name type=org>IFOR-konvojok</name>
mozg&aacute;s&aacute;t. Tegnap egyetlen g&eacute;pkocsiegys&eacute;g
sem tudott boszniai rendeltet&eacute;si helye fel&eacute; indulni.
</head>

<p> 
A f&odblac;v&aacute;rosi <name type=org>Rend&odblac;ri
Ezred</name> rendfenntart&oacute;i &eacute;s a <name type=org>Somogy
Megyei Rend&odblac;r-f&odblac;kapit&aacute;nys&aacute;g </name>

Structure of the Original

The digital source used as the basis of encoding was provided by Magyar Hírlap Publishing House Ltd. It consisted of data files, one file per issue. Pratically all information regarding the actual layout of the text was encoded idiosyncratically by dint of a number of methods (special characters, line spacing, etc.). Some difference between the electronic text and the printed paper was at places observed. When possible, the printed version was considered as basis for the corpus encoding.

An example from the original (8 bit characters are not rendered):

MH-informçciù
Az erÖs havazçs megbÄnÆtotta a
Taszçrrùl, illetve Kaposvçrrùl indulni kÄsz*lÖ
IFOR-konvojok mozgçsçt. Tegnap egyetlen gÄpkocsiegysÄg sem
tudott boszniai rendeltetÄsi helye felÄ indulni. 

A
fÖvçrosi RendÖri Ezred rendfenntartùi Äs a Somogy
Megyei RendÖr-fÖkapitçnysçg kÜzlekedÄsi rendÖrei
egÄsz nap vçrtçk, hogy a kÄt bçzisrùl felvezetÖi
felkÄrÄst kapjanak, de az - lapzçrtçnkig - elmaradt.
A
fÖutak az *tinform dÄlutçni tçjÄkoztatùja szerint
egyelÖre mindenhol jçrhatùak. A havazçs a Dunçnt£lon
a legintenzÆvebb, elsÖsorban Zala Äs Vas megyÄben okoz
kÜzlekedÄsi gondokat. E ter*letek alsùbbrend* £tjain
fÄlszÄlessÄgben szçmÆtani kell hùf£vçsokra is,
az orszçg tÜbbi £tjçra çltalçban a hùkçsçs,
latyakos felszÆn jellemzÖ. Az autùpçlyçkat
sùzzçk, de a hù nem vagy csak nagyon lassan olvad abban a
sçvban, amelyikben most ritkçbban kÜzlekednek a
jçrm*vek. Fennakadçsokra a hegyes, dombos vidÄkeken lehet
szçmÆtani. MegyÄnkÄnt mintegy 25 gÄp tisztÆtja az
utakat * tudatta az *tinform. 
Budapesten is lelassult a
kÜzlekedÄs, az utak sùzçsa nem hoz eredmÄnyt, mivel a
mÆnusz 5-7 fokos hidegben a hùkçsa rçfagy az
aszfaltra.
$$$

***

Markup Process

The original digital version was converted into an ASCII text version where possible extracting all information present, and converting it into CES1 conformant markup. However, given the rather unreliable methods of rendering layout information in the original, laborious correction process, done manually, was necessary to ensure conformance to the printed issues. During this process, extensive sub-paragraph marking was carried out.


next up previous contents
Next: Romanian Up: Multilingual Comparable 2: Newspapers Previous: Estonian
Multext-East