next up previous contents
Next: Estonian Up: Multilingual Comparable 2: Newspapers Previous: Bulgarian



 COP project 106 MULTEXT-East Deliverable D2.1 F Newspapers, Czech

Contributors: Vladimír Petkevic, Vera Schmiedtová (UJC AVCR) and Jana Klímová (FFUK)

Description of the corpus

The contents of the Czech MULTEXT-East newspaper corpus is formed by 451 articles from the Czech daily newspaper Lidové noviny from the period 1991-1994 (some articles published in this period were chosen). The newspaper texts were available on the basis of the written contract. The newspaper data coming into MULTEXT-East were already automatically marked up. This markup used for the texts in the Czech National Corpus which is different from the one required by CES was automatically transformed to something near CES. Then semiautomatic and hand validation had to be performed so as the markup were CES conformant. The main bulk of the corpus is formed by short articles from various areas ranging from politics over interviews to sport. When choosing the articles for the MULTEXT-East corpus, no specific guidelines were respected. The articles reflect a usual newspaper style. On the basis of the contract with the publisher Lidové noviny the newspaper data can be used solely for scientific and academic purposes.

As computed by the Unix program wc over the whole CES-1 document, the Czech Newspaper corpus has 99564 words in 1210258 bytes.

Structure of the corpus

The corpus body consists of 451 <div type=article>, each of which contains one article from the original newspaper data. The topmost segmentation is very flat. Most articles are introduced by one or more <head>(s) specifying the headline(s) or by one <opener>. The article is composed of paragraphs. The whole article is mostly ended by <byline> giving the name of the document author. Document authors are included where they appeared in the original, usually at the end of the articles; they are marked-up as:


The <div> elements have the n attribute, giving the successive number of the article and the id attribute. Articles are not further divided into article parts by a <div> tag.

Example from the corpus:

<div complete=y type=article n=1 id=NEWS.1>

 Zpracovatel&uring;m druhotn&yacute;ch surovin hroz&iacute; krach

<hi rend=bo>

&Uacute;pln&yacute;m zastaven&iacute;m dovozu druhotn&yacute;ch
surovin ohrozilo ministerstvo &zcaron;ivotn&iacute;ho
prost&rcaron;ed&iacute; podnikatele, kte&rcaron;&iacute; se
zam&ecaron;&rcaron;ili na ekologick&eacute; technologie.
Proto&zcaron;e se u n&aacute;s tak&rcaron;ka nepoda&rcaron;ilo
zav&eacute;st t&rcaron;&iacute;d&ecaron;n&iacute; odpad&uring;,
&rcaron;ada v&yacute;robc&uring; je na dovozu druhotn&yacute;ch
surovin z&aacute;visl&aacute;. Na p&rcaron;&iacute;m&eacute;
nebezpe&ccaron;&iacute; &uacute;padku
 upozornila &ccaron;eskolipsk&aacute; firma
<name type=org>Presta</name>
, zpracov&aacute;vaj&iacute;c&iacute; polystyren.


Hromadn&yacute; z&aacute;kaz dovozu byl motivov&aacute;n snahou
ministerstva, aby se
<name type=place>Evropy</name>

Minister&scaron;t&iacute; &uacute;&rcaron;edn&iacute;ci toti&zcaron;
dosud povolovali i dod&aacute;vky odpad&uring;, z nich&zcaron; byla
vyu&zcaron;iteln&aacute; jen &ccaron;&aacute;st a zbytek se hromadil
na na&scaron;ich skl&aacute;dk&aacute;ch nebo p&rcaron;isp&ecaron;l
zne&ccaron;i&scaron;t&ecaron;n&iacute; ovzdu&scaron;&iacute;
p&rcaron;i spalov&aacute;n&iacute;. Pod pl&aacute;&scaron;t&iacute;kem
dovozu &zcaron;elezn&eacute;ho &scaron;rotu bylo
povoleno dov&aacute;&zcaron;et &scaron;pony,
 Odpov&ecaron;dn&yacute; pracovn&iacute;k ministerstva
<name type=person>Durd&iacute;k</name>
 v&scaron;ak nebyl schopen vysv&ecaron;tlit, pro&ccaron; nen&iacute;
povolen dovoz pou&zcaron;it&eacute;ho polystyrenu ze

kdy&zcaron; se u n&aacute;s zpracov&aacute;v&aacute; bez
jedin&eacute;ho gramu odpadu a je <q>&ccaron;ist&scaron;&iacute;</q>
ne&zcaron; mnoh&aacute; p&rcaron;&iacute;rodn&iacute; surovina.
P&rcaron;esto&zcaron;e se po st&iacute;&zcaron;nostech
do &rcaron;e&scaron;en&iacute; probl&eacute;mu anga&zcaron;oval i
<name type=person>Klaus</name>
, povolen&iacute; ve sl&iacute;ben&eacute;m term&iacute;nu

Structure of the original

The original text for the MULTEXT-East corpus was already marked up according to the DTD used for the Czech National Corpus. Simple transformation was used to get near the CES needed for the ME files.

Original rendition is mostly reflected in the CES markup. Single quotes are used for direct speech in the newspapers.

The markup

The subparagraph markup is relatively detailed, the <date>, <num>, <abbr> and <q> are primarily used. However, no name tags are used. During markup many typos were corrected. The whole corpus was finally validated by nsgmls validator.

next up previous contents
Next: Estonian Up: Multilingual Comparable 2: Newspapers Previous: Bulgarian