Contributors: Vladimír Petkevic, Vera Schmiedtová (UJC AVCR) and Jana Klímová (FFUK)
The contents of the Czech MULTEXT-East newspaper corpus is formed by 451 articles from the Czech daily newspaper Lidové noviny from the period 1991-1994 (some articles published in this period were chosen). The newspaper texts were available on the basis of the written contract. The newspaper data coming into MULTEXT-East were already automatically marked up. This markup used for the texts in the Czech National Corpus which is different from the one required by CES was automatically transformed to something near CES. Then semiautomatic and hand validation had to be performed so as the markup were CES conformant. The main bulk of the corpus is formed by short articles from various areas ranging from politics over interviews to sport. When choosing the articles for the MULTEXT-East corpus, no specific guidelines were respected. The articles reflect a usual newspaper style. On the basis of the contract with the publisher Lidové noviny the newspaper data can be used solely for scientific and academic purposes.
As computed by the Unix program wc over the whole CES-1 document, the Czech Newspaper corpus has 99564 words in 1210258 bytes.
The corpus body consists of 451 <div type=article>, each of which contains one article from the original newspaper data. The topmost segmentation is very flat. Most articles are introduced by one or more <head>(s) specifying the headline(s) or by one <opener>. The article is composed of paragraphs. The whole article is mostly ended by <byline> giving the name of the document author. Document authors are included where they appeared in the original, usually at the end of the articles; they are marked-up as:
<byline><docauthor>Author</docauthor></byline>.
The <div> elements have the n attribute, giving the successive number of the article and the id attribute. Articles are not further divided into article parts by a <div> tag.
Example from the corpus:
<div complete=y type=article n=1 id=NEWS.1>
<head>
Zpracovatelům druhotných surovin hrozí krach
</head>
<p>
<hi rend=bo>
Úplným zastavením dovozu druhotných
surovin ohrozilo ministerstvo životního
prostředí podnikatele, kteří se
zaměřili na ekologické technologie.
Protože se u nás takřka nepodařilo
zavést třídění odpadů,
řada výrobců je na dovozu druhotných
surovin závislá. Na přímé
nebezpečí úpadku
<abbr>LN</abbr>
upozornila českolipská firma
<name type=org>Presta</name>
, zpracovávající polystyren.
</hi>
</p>
<p>
Hromadný zákaz dovozu byl motivován snahou
ministerstva, aby se
<abbr>ČR</abbr>
nestala
<q>smetištěm
<name type=place>Evropy</name>
</q>.
Ministerští úředníci totiž
dosud povolovali i dodávky odpadů, z nichž byla
využitelná jen část a zbytek se hromadil
na našich skládkách nebo přispěl
znečištění ovzduší
při spalování. Pod pláštíkem
dovozu železného šrotu bylo
<abbr>např.</abbr>
povoleno dovážet špony,
znečištěné
<abbr>PCB</abbr>.
Odpovědný pracovník ministerstva
<abbr>ing.</abbr>
<name type=person>Durdík</name>
však nebyl schopen vysvětlit, proč není
povolen dovoz použitého polystyrenu ze
<abbr>SRN</abbr>,
když se u nás zpracovává bez
jediného gramu odpadu a je <q>čistší</q>
než mnohá přírodní surovina.
Přestože se po stížnostech
<name>Presty</name>
do řešení problému angažoval i
premiér
<name type=person>Klaus</name>
, povolení ve slíbeném termínu
<abbr>MŽP</abbr>
nevydalo.
</p>
The original text for the MULTEXT-East corpus was already marked up according to the DTD used for the Czech National Corpus. Simple transformation was used to get near the CES needed for the ME files.
Original rendition is mostly reflected in the CES markup. Single quotes are used for direct speech in the newspapers.
The subparagraph markup is relatively detailed, the <date>, <num>, <abbr> and <q> are primarily used. However, no name tags are used. During markup many typos were corrected. The whole corpus was finally validated by nsgmls validator.