next up previous contents
Next: Estonian Up: Multilingual Parallel: Orwell's Previous: Bulgarian

Czech

  COP project 106 MULTEXT-East ``1984'', Czech

Contributors: Vladimír Petkevic (UTKL) and Jana Klímová (FFUK)

Description of the Corpus

The electronic version of the Czech translation of ``1984'' was obtained by OCR in the Institute of Theoretical and Computational Linguistics at the Faculty of Philosophy, Charles University, Prague.

The electronic version of the book can be used for research/academic purposes only. We do not possess a written permission from the publisher to use the text but there exists an oral agreement between the abovementioned publisher and us in this respect.

The book was published by the publishing house Nase vojsko, Prague, Czech Republic, in 1991 as the first and so far last edition of the book in question.

As computed by the Unix program wc over the whole CES-1 document, the Czech version of ``1984'' has 99977 words.

Structure of the Corpus

The Czech ``1984'' corpus body consists of three <div type=part>

and of one <div type=appendix> . Each part is further subdivided into a number of <div type=chapter> .

The <div> elements have the n attribute, giving the successive number of the appropriate level of the <div> , and the id attribute, whose value has the prefix ORW1984

and the chapter numbers separated by periods, e.g. <div type=chapter n=2 id=ORW1984.1.2> .

The text within chapters is segmented into paragraphs. Detailed subparagraph markup was elaborated using the <name> , <hi> , <num> , <q> , <abbr> , <foreign> , <date> and other tags. The markup was performed semiautomatically. The text was spell-checked. In accordance with CES, only proper nouns have been tagged, while adjectives derived from proper nouns, e.g.Winstonovo, have not.

As the rendering information, italics, bold face and capitalisation was used. The entire book is marked up using the same level of detail, i.e. no part is more detailed than the rest. The following is an example from the Czech ``1984'' corpus:

<p>
<name type=org>
Ministerstvo pravdy
</name>
, v
<foreign lang=ns>
newspeaku
</foreign>
<name type=org lang=ns>
Pramini
</name>
(
<foreign lang=ns>
Newspeak
</foreign>
byl
&uacute;&rcaron;edn&iacute;m jazykem
<name type=place>
Oce&aacute;nie
</name>
.
O jej&iacute; struktu&rcaron;e a etymologii viz
<name>
Dodatek
</name>
.), se d&ecaron;siv&ecaron; li&scaron;ilo od v&scaron;ech
ostatn&iacute;ch objekt&uring; v dohledu. Byla to obrovsk&aacute;
stavba tvaru pyramidy ze
z&aacute;&rcaron;iv&ecaron; b&iacute;l&eacute;ho betonu,
kter&aacute; se terasovit&ecaron; vyp&iacute;nala do
v&yacute;&scaron;ky
<num>
300
</num>
metr&uring;. Z m&iacute;sta, kde st&aacute;l
<name type=person>
Winston
</name>
, se dala na b&iacute;l&eacute;m pr&uring;&ccaron;el&iacute;
p&rcaron;e&ccaron;&iacute;st
ozdobn&yacute;m p&iacute;smem vyveden&aacute; t&rcaron;i hesla
<name type=org>
Strany
</name>
:
<q rend="CN CA" type=slogan>
V&Aacute;LKA JE M&Iacute;R
</q>
<q rend="CN CA" type=slogan>
SVOBODA JE OTROCTV&Iacute;
</q>
<q rend="CN CA" type=slogan>
NEV&Ecaron;DOMOST JE S&Iacute;LA
</q>
</p>

Markup Process

The scanned text of the Czech ``1984'' was spell-checked and marked up to CES1 conformance. In the process, a few typographical errors were encountered. These were corrected.



next up previous contents
Next: Estonian Up: Multilingual Parallel: Orwell's Previous: Bulgarian



Tomaz Erjavec
Sat May 18 20:25:31 MDT 1996