Contributors: Csaba Oravecz, Tamás Váradi and Gábor (Kiss, RIL)
The Hungarian newspaper corpus contains 205 articles from the daily newspaper ``Magyar Hírlap''. The articles are from the Jan 25 and Jan 31, 1996, issues. The digital source was provided by the Magyar Hírlap Publishing House Ltd. and was composed of data files with, if any, idiosyncratic encoding and embedded comments for the typesetters. The corpus represents a wide variety of articles each being characteristic of everyday journalism (i.e. long, essay type articles typically from supplements to weekend editions were not included).
A licence agreement was obtained allowing the use of these articles for the purposes of the MULTEXT-East project.
The Hungarian Newspaper Corpus now has 63351 words. The corpus is currently still under extension.
The corpus body consists of 2 <div type=newspaper>, each of which contains 94 and 111 <div type=articles>, respectively. In addition, a number of articles are grouped under one <div type=storylist>, where they appeared as short snippets each having its own head in a coloumn under a common heading.
The <div type=article> usually begins with one or more <head> tags, giving the headline(s), and one <byline> representing the source the content of the article came from (typically a news agency). After this <byline> another head can potentially follow giving an abstract of the article.
Captions to the pictures accompanying the articles, when they were represented in the digital source, were also included in the corpus. They are normally given at the beginning of <div type=article>.
The <div> elements apart from the type attribute, have no other attributes.
Document authors are included where they appeared in the original, usually at the end of the articles; they are marked-up as <byline> <docAuthor>Author or Initials</docAuthor> </byline>.
The text is segmented into paragraphs, other paragraph level tagging is <note>. This is used when the article is continued on or from another page. In this case reference to the page is given between <ref>. However, no pointer is included.
Sub-paragraph tagging consists of <abbr>, <name>, and <q>. The former two were only marked up in the Jan 25 issue; semi-automatically, then were manually corrected and provided with the type attribute. <q> is given for the whole corpus.
The rend attributes on sub-paragraph tags are included in the same way as in the Hungarian version of ``1984''. No quotation marks are retained in the corpus.
Here follows an example from the corpus:
<head>Az erős havazás megbénította a
<name type=place>Taszárról</name>, illetve
<name type=place>Kaposvárról</name> indulni
készülő <name type=org>IFOR-konvojok</name>
mozgását. Tegnap egyetlen gépkocsiegység
sem tudott boszniai rendeltetési helye felé indulni.
A fővárosi <name type=org>Rendőri
Ezred</name> rendfenntartói és a <name type=org>Somogy
Megyei Rendőr-főkapitányság </name>
The digital source used as the basis of encoding was provided by Magyar Hírlap Publishing House Ltd. It consisted of data files, one file per issue. Pratically all information regarding the actual layout of the text was encoded idiosyncratically by dint of a number of methods (special characters, line spacing, etc.). Some difference between the electronic text and the printed paper was at places observed. When possible, the printed version was considered as basis for the corpus encoding.
An example from the original (8 bit characters are not rendered):
Az er÷s havazÁs megbńn∆totta a
TaszÁrrýl, illetve KaposvÁrrýl indulni kńsz*l÷
IFOR-konvojok mozgÁsÁt. Tegnap egyetlen gńpkocsiegysńg sem
tudott boszniai rendeltetńsi helye felń indulni.
f÷vÁrosi Rend÷ri Ezred rendfenntartýi ńs a Somogy
Megyei Rend÷r-f÷kapitÁnysÁg k‹zlekedńsi rend÷rei
egńsz nap vÁrtÁk, hogy a kńt bÁzisrýl felvezet÷i
felkńrńst kapjanak, de az - lapzÁrtÁnkig - elmaradt.
f÷utak az *tinform dńlutÁni tÁjńkoztatýja szerint
egyel÷re mindenhol jÁrhatýak. A havazÁs a DunÁnt£lon
a legintenz∆vebb, els÷sorban Zala ńs Vas megyńben okoz
k‹zlekedńsi gondokat. E ter*letek alsýbbrend* £tjain
fńlszńlessńgben szÁm∆tani kell hýf£vÁsokra is,
az orszÁg t‹bbi £tjÁra ÁltalÁban a hýkÁsÁs,
latyakos felsz∆n jellemz÷. Az autýpÁlyÁkat
sýzzÁk, de a hý nem vagy csak nagyon lassan olvad abban a
sÁvban, amelyikben most ritkÁbban k‹zlekednek a
jÁrm*vek. FennakadÁsokra a hegyes, dombos vidńkeken lehet
szÁm∆tani. Megyńnkńnt mintegy 25 gńp tiszt∆tja az
utakat * tudatta az *tinform.
Budapesten is lelassult a
k‹zlekedńs, az utak sýzÁsa nem hoz eredmńnyt, mivel a
m∆nusz 5-7 fokos hidegben a hýkÁsa rÁfagy az
The original digital version was converted into an ASCII text version where possible extracting all information present, and converting it into CES1 conformant markup. However, given the rather unreliable methods of rendering layout information in the original, laborious correction process, done manually, was necessary to ensure conformance to the printed issues. During this process, extensive sub-paragraph marking was carried out.