COP project 106 MULTEXT-East Newspapers, Czech
Contributors: Vladimír Petkevic, Vera Schmiedtová (UJC AVCR) and Jana Klímová (FFUK)
The contents of the Czech MULTEXT-East newspaper corpus is formed by 451 articles from the Czech daily newspaper Lidové noviny from the period 1991-1994 (some articles published in this period were chosen). The newspaper texts were available on the basis of the written contract. The newspaper data coming into MULTEXT-East were already automatically marked up. This markup used for the texts in the Czech National Corpus which is different from the one required by CES was automatically transformed to something near CES. Then semiautomatic and hand validation had to be performed so as the markup were CES conformant. The main bulk of the corpus is formed by short articles from various areas ranging from politics over interviews to sport. When choosing the articles for the MULTEXT-East corpus, no specific guidelines were respected. The articles reflect a usual newspaper style. On the basis of the contract with the publisher Lidové noviny the newspaper data can be used solely for scientific and academic purposes.
As computed by the Unix program wc over the whole CES-1 document, the Czech Fiction corpus has 99975 words.
The corpus body consists of 451 <div type=article> , each of which contains one article from the original newspaper data. The topmost segmentation is very flat. Most articles are introduced by one or more <head> (s) specifying the headline(s) or by one <opener> . The article is composed of paragraphs. The whole article is mostly ended by <byline> giving the name of the document author. Document authors are included where they appeared in the original, usually at the end of the articles; they are marked-up as:
<byline> <docauthor> Author </docauthor> </byline> .
The <div> elements have the n attribute, giving the successive number of the article and the id
attribute. Articles are not further divided into article parts by a <div> tag.
Example from the corpus:
<div complete=y type=article n=1 id=NEWS.1> <head> Zpracovatelům druhotných surovin hrozí krach </head> <p> <hi rend=bo> Úplným zastavením dovozu druhotných surovin ohrozilo ministerstvo životního prostředí podnikatele, kteří se zaměřili na ekologické technologie. Protože se u nás takřka nepodařilo zavést třídění odpadů, řada výrobců je na dovozu druhotných surovin závislá. Na přímé nebezpečí úpadku <abbr>LN</abbr> upozornila českolipská firma <name type=org>Presta</name> , zpracovávající polystyren. </hi> </p> <p> Hromadný zákaz dovozu byl motivován snahou ministerstva, aby se <abbr>ČR</abbr> nestala <q>smetištěm <name type=place>Evropy</name> </q>. Ministerští úředníci totiž dosud povolovali i dodávky odpadů, z nichž byla využitelná jen část a zbytek se hromadil na našich skládkách nebo přispěl znečištění ovzduší při spalování. Pod pláštíkem dovozu železného šrotu bylo <abbr>např.</abbr> povoleno dovážet špony, znečištěné <abbr>PCB</abbr>. Odpovědný pracovník ministerstva <abbr>ing.</abbr> <name type=person>Durdík</name> však nebyl schopen vysvětlit, proč není povolen dovoz použitého polystyrenu ze <abbr>SRN</abbr>, když se u nás zpracovává bez jediného gramu odpadu a je <q>čistší</q> než mnohá přírodní surovina. Přestože se po stížnostech <name>Presty</name> do řešení problému angažoval i premiér <name type=person>Klaus</name> , povolení ve slíbeném termínu <abbr>MŽP</abbr> nevydalo. </p>
The original text for the MULTEXT-East corpus was already marked up according to the DTD used for the Czech National Corpus. Simple transformation was used to get near the CES needed for the ME files.
Original rendition is mostly reflected in the CES markup. Single quotes are used for direct speech in the newspapers.
The subparagraph markup is relatively detailed, the <date> , <num> , <abbr> and <q> are primarily used. However, no name tags are used. During markup many typos were corrected. The whole corpus was finally validated by nsgmls validator.