 COP project 106 MULTEXT-East Deliverable D2.1 F Newspapers, Hungarian

Contributors: Csaba Oravecz, Tamás Váradi and Gábor (Kiss, RIL)

Description of the Corpus

The Hungarian newspaper corpus contains 205 articles from the daily newspaper ``Magyar Hírlap''. The articles are from the Jan 25 and Jan 31, 1996, issues. The digital source was provided by the Magyar Hírlap Publishing House Ltd. and was composed of data files with, if any, idiosyncratic encoding and embedded comments for the typesetters. The corpus represents a wide variety of articles each being characteristic of everyday journalism (i.e. long, essay type articles typically from supplements to weekend editions were not included).

A licence agreement was obtained allowing the use of these articles for the purposes of the MULTEXT-East project.

The Hungarian Newspaper Corpus now has 63351 words. The corpus is currently still under extension.

Structure of the Corpus

The corpus body consists of 2 <div type=newspaper>, each of which contains 94 and 111 <div type=articles>, respectively. In addition, a number of articles are grouped under one <div type=storylist>, where they appeared as short snippets each having its own head in a coloumn under a common heading.

The <div type=article> usually begins with one or more <head> tags, giving the headline(s), and one <byline> representing the source the content of the article came from (typically a news agency). After this <byline> another head can potentially follow giving an abstract of the article.

Captions to the pictures accompanying the articles, when they were represented in the digital source, were also included in the corpus. They are normally given at the beginning of <div type=article>.

The <div> elements apart from the type attribute, have no other attributes.

Document authors are included where they appeared in the original, usually at the end of the articles; they are marked-up as <byline> <docAuthor>Author or Initials</docAuthor> </byline>.

The text is segmented into paragraphs, other paragraph level tagging is <note>. This is used when the article is continued on or from another page. In this case reference to the page is given between <ref>. However, no pointer is included.

Sub-paragraph tagging consists of <abbr>, <name>, and <q>. The former two were only marked up in the Jan 25 issue; semi-automatically, then were manually corrected and provided with the type attribute. <q> is given for the whole corpus.

The rend attributes on sub-paragraph tags are included in the same way as in the Hungarian version of ``1984''. No quotation marks are retained in the corpus.

Here follows an example from the corpus:


<head>Az er&odblac;s havaz&aacute;s megb&eacute;n&iacute;totta a 
<name type=place>Tasz&aacute;rr&oacute;l</name>, illetve 
<name type=place>Kaposv&aacute;rr&oacute;l</name> indulni
k&eacute;sz&uuml;l&odblac; <name type=org>IFOR-konvojok</name>
mozg&aacute;s&aacute;t. Tegnap egyetlen g&eacute;pkocsiegys&eacute;g
sem tudott boszniai rendeltet&eacute;si helye fel&eacute; indulni.

A f&odblac;v&aacute;rosi <name type=org>Rend&odblac;ri
Ezred</name> rendfenntart&oacute;i &eacute;s a <name type=org>Somogy
Megyei Rend&odblac;r-f&odblac;kapit&aacute;nys&aacute;g </name>

Structure of the Original

The digital source used as the basis of encoding was provided by Magyar Hírlap Publishing House Ltd. It consisted of data files, one file per issue. Pratically all information regarding the actual layout of the text was encoded idiosyncratically by dint of a number of methods (special characters, line spacing, etc.). Some difference between the electronic text and the printed paper was at places observed. When possible, the printed version was considered as basis for the corpus encoding.

An example from the original (8 bit characters are not rendered):

Az er÷s havazÁs megbńn∆totta a
TaszÁrrýl, illetve KaposvÁrrýl indulni kńsz*l÷
IFOR-konvojok mozgÁsÁt. Tegnap egyetlen gńpkocsiegysńg sem
tudott boszniai rendeltetńsi helye felń indulni. 

f÷vÁrosi Rend÷ri Ezred rendfenntartýi ńs a Somogy
Megyei Rend÷r-f÷kapitÁnysÁg k‹zlekedńsi rend÷rei
egńsz nap vÁrtÁk, hogy a kńt bÁzisrýl felvezet÷i
felkńrńst kapjanak, de az - lapzÁrtÁnkig - elmaradt.
f÷utak az *tinform dńlutÁni tÁjńkoztatýja szerint
egyel÷re mindenhol jÁrhatýak. A havazÁs a DunÁnt£lon
a legintenz∆vebb, els÷sorban Zala ńs Vas megyńben okoz
k‹zlekedńsi gondokat. E ter*letek alsýbbrend* £tjain
fńlszńlessńgben szÁm∆tani kell hýf£vÁsokra is,
az orszÁg t‹bbi £tjÁra ÁltalÁban a hýkÁsÁs,
latyakos felsz∆n jellemz÷. Az autýpÁlyÁkat
sýzzÁk, de a hý nem vagy csak nagyon lassan olvad abban a
sÁvban, amelyikben most ritkÁbban k‹zlekednek a
jÁrm*vek. FennakadÁsokra a hegyes, dombos vidńkeken lehet
szÁm∆tani. Megyńnkńnt mintegy 25 gńp tiszt∆tja az
utakat * tudatta az *tinform. 
Budapesten is lelassult a
k‹zlekedńs, az utak sýzÁsa nem hoz eredmńnyt, mivel a
m∆nusz 5-7 fokos hidegben a hýkÁsa rÁfagy az


Markup Process

The original digital version was converted into an ASCII text version where possible extracting all information present, and converting it into CES1 conformant markup. However, given the rather unreliable methods of rendering layout information in the original, laborious correction process, done manually, was necessary to ensure conformance to the printed issues. During this process, extensive sub-paragraph marking was carried out.

