 TELRI & MULTEXT-East Deliverable D2.1 F ``1984'', Serbo-Croatian

Contributors: Cvetana Krstev, Dusko Vitas, and Tomaz Erjavec

Description of the Corpus

The digital source of the Serbo-Croatian version of ``1984'' was obtained by Oxford Text Archive. It was prepared, together with an English, Slovenian and Croatian version by the OUCS KDEM service in 1985 for Dr David C Bennett of the School of Oriental and African Studies at London University. The OTA has copies of letters from all the publishers concerned which appear to indicate that academic use of the text is permitted. This version is in OTA labeled as ``Serbian'' although the imprint of the first edition published by ``JUGOSLAVIJA'' states that it is ``the first edition of `1984' in Serbo-Croatian''.

Publication used for the digital source is the second edition published in 1984 by ``Beogradski izdavac ko-grafic ki zavod'', Belgrade and translated by Vlada Stojiljkovic. The same translation had two previous editions by ``JUGOSLAVIJA'' in 1968 and 1977. The OTA version was obtained by OCR from the printed edition.

The digital source for the Serbo-Croatian ``1984'' has the same distribution restrictions as those imposed by the MULTEXT-East project, namely, that the use of the materials is free for academic purposes.

We currently do not have written permission to use the text, however, we do have verbal assurance from OTA that it is acceptable to use the Serbo-Croatian translation of ``1984'' for the purposes of the TELRI project.

The Serbo-Croatian version of ``1984'' contains 89749 words, as indicated in the header of the tagged version.

Structure of the Corpus

The Serbo-Croatian ``1984'' corpus body consists of three <div type=part n=1, 2, 3> and of one <div type=part n=appendix>. Each part is further subdivided into a number of <div type=chapter n=1, 2, ...>. In the Serbo-Croatian version, each <div> is followed by a <head>, giving the part or chapter number. Counting of chapters starts from 1 in every part.

Elements <body>, <div>, <head>, <item>, <l>, <list>, <note>, <p>, <poem>, <ptr>, <quote>, <text> are used so that to be in harmony with the English 1984 for MULTEXT-East (-//MTE//TEXT CES1 1984//EN); the differences are due only to the differences between the English electronic and Serbo-Croatian printed version.

Sub-paragraph tags, i.e. <abbr>, <date>, <foreign>, <mentioned>, <name>, <num>, <hi>, <q>, <title> are used extensively, but it is not ceratin that they are not missing occasionally. For example, names have been tagged automatically, and the tagged names hand-validated; therefor all the tagged names are correct, but the text might also contain untagged names. The name tags do not contain the type attribute. In accordance with CES, only proper nouns have been tagged, while adjectives derived from proer nouns, e.g. Vinstonov, have not.

Rendering information is given as the CES conformant two-letter value of the rend attribute. It has been in most cases included with the appropriated tags. The values for rend attribute are: CA, CE, CN, IT, and '*'.

Rendering, including ``all caps'' capitalisation, has been removed from the tag content.

The mark-up is even in all chapters of the novel, i.e. no chapters could be distinguished as having more type information on tags than others.

The text has been automatically sentence segmented, and the segmentation hand-validated. Some <q> elements have been split in this process; these are marked with type=MI, for ``machine inserted''. The <body>, <div>, <quote>, <p>, <poem>, <list>, <l>, <item>, <s>, and <q> tags have been marked with the id attribute.

The following is an example from the Serbo-Croatian ``1984'' corpus:

<p id="Oshs.1.2.8"><s id="Oshs.">Ministarstvo istine &mdash; u 
<name>Novogovoru</name>: <name>Ministin </name> &mdash;
o&sx;tro se razlikovalo od svega ostalog na vidiku.</s>
<s id="Oshs.">To je
bila ogromna piramidalna gra&dx;evina od svetlucavo
belog betona koja se uzdizala, terasa za terasom, tri
stotine metara u nebo.</s>
<s id="Oshs.">Sa mesta na kome je <name>Vinston</name>
stajao mogle su se tek razabrati, ispisane elegantnim
slovima na belom zidu, tri parole Partije:
<q rend="CN CA" type=slogan>Rat je mir</q>
<q rend="CN CA" type=slogan>Sloboda je ropstvo</q>
<!-- pb n=7 -->
<q rend="CN CA" type=slogan>Nezna&nx;e je mocx;</q></s>

Structure of the Original

The electronic version that was obtained by OCR preserves all the visual layout peculiarities in electronic form that are in the printed version of ``1984'' including new line, centering, capitalization, bold print, italic print, page numbers. Serbo-Croatian characters where encoded as cy, sy, dj, lj, zy, nj, cj, dzy for c, s, 0=d0-d lj, z, nj, c, and dz respectively. Two pair of codes were used for each of these upper-case letters: for instance, for C both CY and Cy were used. Encodings for 0=d0-d and dz caused problems in quite a number of words: odjednom, podjednako, odjek, odjeknuti, nadjacati, odziveti, etc.

The following is an example from the OTA version:

   Ministarstvo istine - u Novogovoru: Ministin -
osytro se razlikovalo od svega ostalog na vidiku. To je
bila ogromna piramidalna gradjevina od svetlucavo
belog betona koja se uzdizala, terasa za terasom, tri
stotine metara u nebo. Sa mesta na kome je Vinston
stajao mogle su se tek razabrati, ispisane elegantnim
slovima na belom zidu, tri parole Partije:

   <1RAT JE MIR>1


Markup Process

The OTA version of ``1984'' was taken as the basis for the encoding. The whole Serbo-Croatian ``1984'' CES1 corpus was cross-cheked with the printed edition, and the printed edition was used to insert additional (e.g. <hi>) markup. In the process, a number of typographical erros were discovered not only in the OTA edition, but also in the printed edition of the Serbo-Croatian translation of ``1984''. The transliteration of the OTA encoding of Serbo-Croatian characters cy, sy, dj, lj, zy, nj, cj, dzy by SGML entities &cy;, &sx;, &dx;, &lx;, &zx;, &nx;, &cx;, and &dy; was performed automatically and the replacements for dj and dzy were hand-validated. The replacement text for the SGML entities &cy;, &sx;, &dx;, &lx;, &zx;, &nx;, &cx;, &dy; is a character string (nj or lj), a reference of a SGML entity from the standard entity set ISOlat2 (for instance, &ccaron;) or a combination of those (d&zcaron;) as Latin alphabet was used for the prined edition.

