COP project 106 MULTEXT-East ``1984'', Czech
Contributors: Vladimír Petkevic (UTKL) and Jana Klímová (FFUK)
The electronic version of the Czech translation of ``1984'' was obtained by OCR in the Institute of Theoretical and Computational Linguistics at the Faculty of Philosophy, Charles University, Prague.
The electronic version of the book can be used for research/academic purposes only. We do not possess a written permission from the publisher to use the text but there exists an oral agreement between the abovementioned publisher and us in this respect.
The book was published by the publishing house Nase vojsko, Prague, Czech Republic, in 1991 as the first and so far last edition of the book in question.
As computed by the Unix program wc over the whole CES-1 document, the Czech version of ``1984'' has 99977 words.
The Czech ``1984'' corpus body consists of three <div type=part>
and of one <div type=appendix> . Each part is further subdivided into a number of <div type=chapter> .
The <div> elements have the n attribute, giving the successive number of the appropriate level of the <div> , and the id attribute, whose value has the prefix ORW1984
and the chapter numbers separated by periods, e.g. <div type=chapter n=2 id=ORW1984.1.2> .
The text within chapters is segmented into paragraphs. Detailed subparagraph markup was elaborated using the <name> , <hi> , <num> , <q> , <abbr> , <foreign> , <date> and other tags. The markup was performed semiautomatically. The text was spell-checked. In accordance with CES, only proper nouns have been tagged, while adjectives derived from proper nouns, e.g.Winstonovo, have not.
As the rendering information, italics, bold face and capitalisation was used. The entire book is marked up using the same level of detail, i.e. no part is more detailed than the rest. The following is an example from the Czech ``1984'' corpus:
<p> <name type=org> Ministerstvo pravdy </name> , v <foreign lang=ns> newspeaku </foreign> <name type=org lang=ns> Pramini </name> ( <foreign lang=ns> Newspeak </foreign> byl úředním jazykem <name type=place> Oceánie </name> . O její struktuře a etymologii viz <name> Dodatek </name> .), se děsivě lišilo od všech ostatních objektů v dohledu. Byla to obrovská stavba tvaru pyramidy ze zářivě bílého betonu, která se terasovitě vypínala do výšky <num> 300 </num> metrů. Z místa, kde stál <name type=person> Winston </name> , se dala na bílém průčelí přečíst ozdobným písmem vyvedená tři hesla <name type=org> Strany </name> : <q rend="CN CA" type=slogan> VÁLKA JE MÍR </q> <q rend="CN CA" type=slogan> SVOBODA JE OTROCTVÍ </q> <q rend="CN CA" type=slogan> NEVĚDOMOST JE SÍLA </q> </p>
The scanned text of the Czech ``1984'' was spell-checked and marked up to CES1 conformance. In the process, a few typographical errors were encountered. These were corrected.