 COP project 106 MULTEXT-East Deliverable D2.1 F ``1984'', Slovene

Contributors: Tomaz Erjavec (IJS) and Olga Vukovic

Description of the Corpus

The digital source of the Slovene version of ``1984'' was obtained from the European Corpus Initiative, Multilingual Corpus 1 CD-ROM, which also contains the English original, as well a Croatian and a Serbian translation. The source for the ECI-MC1 versions of ``1984'' was, in turn, the Oxford Text Archive. For the OTA, all four versions were prepared by the OUCS KDEM service in 1985 for Dr David C Bennett of the School of Oriental And African Studies at London University. The OTA has copies of letters from all the publishers concerned which appear to indicate that academic use of the text is permitted.

The above would seem to indicate that the digital source for the MULTEXT-East Slovene ``1984'' has the same distribution restrictions as those imposed by the MULTEXT-East project itself, namely, that the use of the materials is free for academic purposes.

It is not clear what the Slovene printed version serving as the source of OTA's digital version was; however, it seems that the only translation and publication of ``1984'' in Slovene is the (out of print) one by ``Mladinska knjiga'', Ljubljana, published in 1983. Typical errors in the OTA edition further indicate that the OTA edition was obtained by OCR from the printed edition.

We currently do not have written permission to use the text, however, we do have verbal assurance from OTA as well as by employees of MK that it is acceptable to use the Slovene translation of ``1984'' for the purposes of the MULTEXT-East project.

The Slovene version of "1984" contains 91.619 words, as indicated in the header of the tagged version.

Structure of the Corpus

The Slovene ``1984'' corpus body consists of three <div type=part n=1, 2, 3> and of one <div type=part n=appendix>. Each part is further subdivided into a number of <div type=chapter n=1, 2, ...>. In the Slovene version, each <div> is followed by a <head>, giving the part or chapter number. Counting of chapters starts from 1 in every part.

Elements <body>, <div>, <head>, <item>, <l>, <list>, <note>, <p>, <poem>, <ptr>, <quote>, <text> are used so that to be in harmony with the English 1984 for MULTEXT-East; the differences are due only to the differences between the English electronic and Estonian printed version.

Sub-paragraph tagging, i.e. <abbr>, <date>, <foreign>, <hi>, <mentioned>, <name>, <num>, <q>, <title> are used sloppily.

For example, names have been tagged automatically, and the tagged names hand-validated; therefore all the tagged names are correct, but the text might also contain untagged names. The name tags do not contain the type attribute. In accordance with CES, only proper nouns have been tagged, while adjectives derived from proper nouns, e.g. Winstonov, have not.

Rendering information is given as the CES conformant two-letter value of the rend attribute. It has been in most cases included with the appropriated tags, except for the default preceding &mdash; of the <q> tag. The values for rend attribute are: CA, CE, CN, IT, and '*'.

Rendering, including 'all caps' capitalisation, has been removed from the tag content.

The first chapter has more detailed mark-up than the rest of the novel, e.g. it includes more type information on tags.

The text has been automatically sentence segmented, and the segmentation hand-validated. Some <q> elements have been split in this process; these are marked with type=MI, for 'machine inserted. The <body>, <div>, <quote>, <p>, <poem>, <list>, <l>, <item>, <s>, and <q> tags have been marked with the id attribute.

The following is an example from the Slovene ``1984'' corpus:

</p><p id="Osl.1.2.8">
<s id="Osl.">Ministrstvo resnice &mdash; <name>Minires</name> v
<name>Novoreku</name><ptr id="Osl." n="1" rend="*" target="Osl.1.2.9">
&mdash; se je osupljivo lo&ccaron;ilo od kateregakoli predmeta.</s>
<s id="Osl.">Bilo je velikanska
piramidasta zgradba iz ble&scaron;&ccaron;e&ccaron;e belega betona, ki
je v terasah kipela kvi&scaron;ku, tristo metrov visoko v zrak.</s>
<s id="Osl.">S
kraja, kjer je stal <name>Winston</name>, se je ravno &scaron;e dalo
prebrati tri partijske parole, ki so se v lepih &ccaron;rkah
odra&zcaron;ale z belega pro&ccaron;elja:
<q id="Osl." rend="CN CA" type=slogan>Vojna je mir</q>
<q id="Osl." rend="CN CA" type=slogan>Svoboda je su&zcaron;enjstvo</q>
<q id="Osl." rend="CN CA" type=slogan>Nevednost je mo&ccaron;</q></s>

Structure of the Original

The ECI version which was used as the digital source was essentially the same as the OTA one, except that <p> had been automatically inserted into the text. Also, neither edition had been proofread or checked in any way. The Slovene characters were encoded as cy, sy, zy for c, s, z respectively, which caused problems for Syme. The following is an example from the ECI version:

pojavljale brez ozadja in so bile vecyinoma nerazumljive.  Ministrstvo
resnice -- Minires v Novoreku* -- se je osupljivo locyilo od
kateregakoli predmeta. Bilo je velikanska piramidasta zgradba iz
blesycyecye belega betona, ki je v terasah kipela kvisyku, tristo
metrov visoko v zrak. S kraj a, kjerje stal Winston, seje ravno sye
dalo prebrati tri partijske parole, ki so se v lepih cyrkah odrazyale
z belega procyelja:
                              VOJNA JE MIR
                         SVOBODA JE SUZYNJSTVO
                            NEVEDNOST JE MOCY


Markup Process

The ECI edition was taken as the basis for the encoding. It was proofread and marked up to CES1 conformance. The whole Slovene ``1984'' CES1 corpus was cross-checked with the printed edition, and the printed edition was used to insert additional (e.g. <hi>) markup. In the process, a number of typographical errors were discovered not only in the ECI edition, but also in the printed edition of the Slovene translation of ``1984''.

