Next: Estonian Up: Multilingual Comparable 2: Previous: Bulgarian

Czech

COP project 106 MULTEXT-East Newspapers, Czech

Contributors: Vladimír Petkevic, Vera Schmiedtová (UJC AVCR) and Jana Klímová (FFUK)

Description of the corpus

The contents of the Czech MULTEXT-East newspaper corpus is formed by 451 articles from the Czech daily newspaper Lidové noviny from the period 1991-1994 (some articles published in this period were chosen). The newspaper texts were available on the basis of the written contract. The newspaper data coming into MULTEXT-East were already automatically marked up. This markup used for the texts in the Czech National Corpus which is different from the one required by CES was automatically transformed to something near CES. Then semiautomatic and hand validation had to be performed so as the markup were CES conformant. The main bulk of the corpus is formed by short articles from various areas ranging from politics over interviews to sport. When choosing the articles for the MULTEXT-East corpus, no specific guidelines were respected. The articles reflect a usual newspaper style. On the basis of the contract with the publisher Lidové noviny the newspaper data can be used solely for scientific and academic purposes.

As computed by the Unix program wc over the whole CES-1 document, the Czech Fiction corpus has 99975 words.

Structure of the corpus

The corpus body consists of 451 <div type=article> , each of which contains one article from the original newspaper data. The topmost segmentation is very flat. Most articles are introduced by one or more <head> (s) specifying the headline(s) or by one <opener> . The article is composed of paragraphs. The whole article is mostly ended by <byline> giving the name of the document author. Document authors are included where they appeared in the original, usually at the end of the articles; they are marked-up as:

<byline> <docauthor> Author </docauthor> </byline> .

The <div> elements have the n attribute, giving the successive number of the article and the id

attribute. Articles are not further divided into article parts by a <div> tag.

Example from the corpus:

<div complete=y type=article n=1 id=NEWS.1>

<head>
 Zpracovatel&uring;m druhotn&yacute;ch surovin hroz&iacute; krach
</head>

<p>
<hi rend=bo>

&Uacute;pln&yacute;m zastaven&iacute;m dovozu druhotn&yacute;ch surovin ohrozilo
ministerstvo &zcaron;ivotn&iacute;ho prost&rcaron;ed&iacute; podnikatele,
kte&rcaron;&iacute; se zam&ecaron;&rcaron;ili na ekologick&eacute; technologie.
Proto&zcaron;e se u n&aacute;s tak&rcaron;ka nepoda&rcaron;ilo zav&eacute;st
t&rcaron;&iacute;d&ecaron;n&iacute; odpad&uring;, &rcaron;ada v&yacute;robc&uring; je na
dovozu druhotn&yacute;ch surovin z&aacute;visl&aacute;. Na p&rcaron;&iacute;m&eacute;
nebezpe&ccaron;&iacute; &uacute;padku
<abbr>LN</abbr>
 upozornila &ccaron;eskolipsk&aacute; firma
<name type=org>Presta</name>
, zpracov&aacute;vaj&iacute;c&iacute; polystyren.
</hi>
</p>

<p>
Hromadn&yacute; z&aacute;kaz dovozu byl motivov&aacute;n snahou ministerstva, aby
se
<abbr>&Ccaron;R</abbr>
 nestala
<q>smeti&scaron;t&ecaron;m
<name type=place>Evropy</name>
</q>.

Minister&scaron;t&iacute; &uacute;&rcaron;edn&iacute;ci toti&zcaron; dosud povolovali i
dod&aacute;vky odpad&uring;, z nich&zcaron; byla vyu&zcaron;iteln&aacute; jen
&ccaron;&aacute;st a zbytek se hromadil na na&scaron;ich skl&aacute;dk&aacute;ch nebo
p&rcaron;isp&ecaron;l zne&ccaron;i&scaron;t&ecaron;n&iacute; ovzdu&scaron;&iacute;
p&rcaron;i spalov&aacute;n&iacute;. Pod pl&aacute;&scaron;t&iacute;kem dovozu
&zcaron;elezn&eacute;ho &scaron;rotu bylo
<abbr>nap&rcaron;.</abbr>
povoleno dov&aacute;&zcaron;et &scaron;pony, zne&ccaron;i&scaron;t&ecaron;n&eacute;
<abbr>PCB</abbr>.
 Odpov&ecaron;dn&yacute; pracovn&iacute;k ministerstva
<abbr>ing.</abbr>
<name type=person>Durd&iacute;k</name>
 v&scaron;ak nebyl schopen vysv&ecaron;tlit, pro&ccaron; nen&iacute; povolen dovoz
pou&zcaron;it&eacute;ho polystyrenu ze
<abbr>SRN</abbr>,
kdy&zcaron; se u n&aacute;s zpracov&aacute;v&aacute; bez jedin&eacute;ho gramu odpadu a je
<q>&ccaron;ist&scaron;&iacute;</q> ne&zcaron; mnoh&aacute; p&rcaron;&iacute;rodn&iacute;
surovina. P&rcaron;esto&zcaron;e se po st&iacute;&zcaron;nostech
<name>Presty</name>
do &rcaron;e&scaron;en&iacute; probl&eacute;mu anga&zcaron;oval i premi&eacute;r
<name type=person>Klaus</name>
, povolen&iacute; ve sl&iacute;ben&eacute;m term&iacute;nu
<abbr>M&Zcaron;P</abbr>
 nevydalo.
</p>

Structure of the original

The original text for the MULTEXT-East corpus was already marked up according to the DTD used for the Czech National Corpus. Simple transformation was used to get near the CES needed for the ME files.

Original rendition is mostly reflected in the CES markup. Single quotes are used for direct speech in the newspapers.

The markup

The subparagraph markup is relatively detailed, the <date> , <num> , <abbr> and <q> are primarily used. However, no name tags are used. During markup many typos were corrected. The whole corpus was finally validated by nsgmls validator.

Next: Estonian Up: Multilingual Comparable 2: Previous: Bulgarian

Tomaz Erjavec
Sat May 18 20:25:31 MDT 1996