COP project 106 MULTEXT-East Newspapers, Hungarian
Contributors: Csaba Oravecz, Tamás Váradi and Gábor (Kiss, RIL)
The Hungarian newspaper corpus contains 205 articles from the daily newspaper ``Magyar Hírlap''. The articles are from the Jan 25 and Jan 31, 1996, issues. The digital source was provided by the Magyar Hírlap Publishing House Ltd. and was composed of data files with, if any, idiosyncratic encoding and embedded comments for the typesetters. The corpus represents a wide variety of articles each being characteristic of everyday journalism (i.e. long, essay type articles typically from supplements to weekend editions were not included).
A licence agreement was obtained allowing the use of these articles for the purposes of the MULTEXT-East project.
The Hungarian Newspaper Corpus now has 63351 words. The corpus is currently still under extension.
The corpus body consists of 2 <div type=newspaper> , each of which contains 94 and 111 <div type=articles> , respectively. In addition, a number of articles are grouped under one <div type=storylist> , where they appeared as short snippets each having its own head in a coloumn under a common heading.
The <div type=article> usually begins with one or more <head> tags, giving the headline(s), and one <byline>
representing the source the content of the article came from (typically a news agency). After this <byline> another head can potentially follow giving an abstract of the article.
Captions to the pictures accompanying the articles, when they were represented in the digital source, were also included in the corpus. They are normally given at the beginning of <div type=article> .
The <div> elements apart from the type attribute, have no other attributes.
Document authors are included where they appeared in the original, usually at the end of the articles; they are marked-up as <byline> <docAuthor> Author or Initials </docAuthor> </byline> .
The text is segmented into paragraphs, other paragraph level tagging is <note> . This is used when the article is continued on or from another page. In this case reference to the page is given between <ref> . However, no pointer is included.
Sub-paragraph tagging consists of <abbr> , <name> , and <q> . The former two were only marked up in the Jan 25 issue; semi-automatically, then were manually corrected and provided with the type attribute. <q> is given for the whole corpus.
The rend attributes on sub-paragraph tags are included in the same way as in the Hungarian version of ``1984''. No quotation marks are retained in the corpus.
Here follows an example from the corpus:
<byline><abbr>MH</abbr>-információ</byline> <head>Az erős havazás megbénította a <name type=place>Taszárról</name>, illetve <name type=place>Kaposvárról</name> indulni készülő <name type=org>IFOR-konvojok</name> mozgását. Tegnap egyetlen gépkocsiegység sem tudott boszniai rendeltetési helye felé indulni. </head> <p> A fővárosi <name type=org>Rendőri Ezred</name> rendfenntartói és a <name type=org>Somogy Megyei Rendőr-főkapitányság </name>
The digital source used as the basis of encoding was provided by Magyar Hírlap Publishing House Ltd. It consisted of data files, one file per issue. Pratically all information regarding the actual layout of the text was encoded idiosyncratically by dint of a number of methods (special characters, line spacing, etc.). Some difference between the electronic text and the printed paper was at places observed. When possible, the printed version was considered as basis for the corpus encoding.
An example from the original (8 bit characters are not rendered):
MH-informçciù Az erÖs havazçs megbÄnÆtotta a Taszçrrùl, illetve Kaposvçrrùl indulni kÄsz*lÖ IFOR-konvojok mozgçsçt. Tegnap egyetlen gÄpkocsiegysÄg sem tudott boszniai rendeltetÄsi helye felÄ indulni. A fÖvçrosi RendÖri Ezred rendfenntartùi Äs a Somogy Megyei RendÖr-fÖkapitçnysçg kÜzlekedÄsi rendÖrei egÄsz nap vçrtçk, hogy a kÄt bçzisrùl felvezetÖi felkÄrÄst kapjanak, de az - lapzçrtçnkig - elmaradt. A fÖutak az *tinform dÄlutçni tçjÄkoztatùja szerint egyelÖre mindenhol jçrhatùak. A havazçs a Dunçnt£lon a legintenzÆvebb, elsÖsorban Zala Äs Vas megyÄben okoz kÜzlekedÄsi gondokat. E ter*letek alsùbbrend* £tjain fÄlszÄlessÄgben szçmÆtani kell hùf£vçsokra is, az orszçg tÜbbi £tjçra çltalçban a hùkçsçs, latyakos felszÆn jellemzÖ. Az autùpçlyçkat sùzzçk, de a hù nem vagy csak nagyon lassan olvad abban a sçvban, amelyikben most ritkçbban kÜzlekednek a jçrm*vek. Fennakadçsokra a hegyes, dombos vidÄkeken lehet szçmÆtani. MegyÄnkÄnt mintegy 25 gÄp tisztÆtja az utakat * tudatta az *tinform. Budapesten is lelassult a kÜzlekedÄs, az utak sùzçsa nem hoz eredmÄnyt, mivel a mÆnusz 5-7 fokos hidegben a hùkçsa rçfagy az aszfaltra. $$$ ***
The original digital version was converted into an ASCII text version where possible extracting all information present, and converting it into CES1 conformant markup. However, given the rather unreliable methods of rendering layout information in the original, laborious correction process, done manually, was necessary to ensure conformance to the printed issues. During this process, extensive sub-paragraph marking was carried out.