 COP project 106 MULTEXT-East Deliverable D2.1 F ``1984'', Czech

Contributors: Vladimír Petkevic (UTKL) and Jana Klímová (FFUK)

Description of the Corpus

The electronic version of the Czech translation of ``1984'' was obtained by OCR in the Institute of Theoretical and Computational Linguistics at the Faculty of Philosophy, Charles University, Prague.

The electronic version of the book can be used for research/academic purposes only. We do not possess a written permission from the publisher to use the text but there exists an oral agreement between the abovementioned publisher and us in this respect.

The book was published by the publishing house Nase vojsko, Prague, Czech Republic, in 1991 as the first and so far last edition of the book in question.

As computed by the Unix program wc over the whole CES-1 document, the Czech version of ``1984'' has 99180 words in 1231059 bytes.

Structure of the Corpus

The Czech ``1984'' corpus body consists of three <div type=part> and of one <div type=appendix>. Each part is further subdivided into a number of <div type=chapter>.

The <div> elements have the n attribute, giving the successive number of the appropriate level of the <div>, and the id attribute, whose value has the prefix Ocs and the chapter numbers separated by periods, e.g. <div type=chapter n=2 id=Ocs.1.2>.

The text within chapters is segmented into paragraphs. Detailed subparagraph markup was elaborated using the <name>, <hi>, <num>, <q>, <abbr>, <foreign>, <date> and other tags. The markup was performed semiautomatically. The text was spell-checked. In accordance with CES, only proper nouns have been tagged, while adjectives derived from proper nouns, e.g.Winstonovo, have not.

As the rendering information, italics, bold face and capitalisation was used. The entire book is marked up using the same level of detail, i.e. no part is more detailed than the rest. The following is an example from the Czech ``1984'' corpus:

<p id="Ocs.1.1.7">
<s id="Ocs."><name type=org>Ministerstvo pravdy</name>,
<name type=language>newspeaku</name>
<ptr id="Ocs." target="Ocs.1.1.8" rend=asterisk>
<name type=org lang=ns-cs>Pramini</name>
se d&ecaron;siv&ecaron; li&scaron;ilo od v&scaron;ech ostatn&iacute;ch objekt&uring; v dohledu.</s>
<s id="Ocs.">Byla to obrovsk&aacute; stavba tvaru pyramidy ze z&aacute;&rcaron;iv&ecaron; 
b&iacute;l&eacute;ho betonu, kter&aacute; se terasovit&ecaron; vyp&iacute;nala do v&yacute;&scaron;ky
metr&uring;.</s> <s id="Ocs.">Z m&iacute;sta, kde st&aacute;l
<name type=person>Winston</name>, se dala na b&iacute;l&eacute;m pr&uring;&ccaron;el&iacute;
p&rcaron;e&ccaron;&iacute;st ozdobn&yacute;m p&iacute;smem vyveden&aacute; t&rcaron;i hesla
<name type=org>Strany</name>:
<q id="Ocs." rend="CE CA" type=slogan>V&aacute;lka je m&iacute;r</q>
<q id="Ocs." rend="CE CA" type=slogan>Svoboda je otroctv&iacute;</q>
<q id="Ocs." rend="CE CA" type=slogan>Nev&ecaron;domost je s&iacute;la</q></s>
<note id="Ocs.1.1.8" place=foot>
<name type=language>Newspeak</name>
byl &uacute;&rcaron;edn&iacute;m jazykem
<name type=place>Oce&aacute;nie</name>. O jej&iacute; struktu&rcaron;e a etymologii viz Dodatek.

Markup Process

The scanned text of the Czech ``1984'' was spell-checked and marked up to CES1 conformance. In the process, a few typographical errors were encountered. These were corrected.

