Contributors: Tomaz Erjavec (IJS) and Olga Vukovic
The digital source of the Slovene version of ``1984'' was obtained from the European Corpus Initiative, Multilingual Corpus 1 CD-ROM, which also contains the English original, as well a Croatian and a Serbian translation. The source for the ECI-MC1 versions of ``1984'' was, in turn, the Oxford Text Archive. For the OTA, all four versions were prepared by the OUCS KDEM service in 1985 for Dr David C Bennett of the School of Oriental And African Studies at London University. The OTA has copies of letters from all the publishers concerned which appear to indicate that academic use of the text is permitted.
The above would seem to indicate that the digital source for the MULTEXT-East Slovene ``1984'' has the same distribution restrictions as those imposed by the MULTEXT-East project itself, namely, that the use of the materials is free for academic purposes.
It is not clear what the Slovene printed version serving as the source of OTA's digital version was; however, it seems that the only translation and publication of ``1984'' in Slovene is the (out of print) one by ``Mladinska knjiga'', Ljubljana, published in 1983. Typical errors in the OTA edition further indicate that the OTA edition was obtained by OCR from the printed edition.
We currently do not have written permission to use the text, however, we do have verbal assurance from OTA as well as by employees of MK that it is acceptable to use the Slovene translation of ``1984'' for the purposes of the MULTEXT-East project.
The Slovene version of "1984" contains 91.619 words, as indicated in the header of the tagged version.
The Slovene ``1984'' corpus body consists of three <div type=part n=1, 2, 3> and of one <div type=part n=appendix>. Each part is further subdivided into a number of <div type=chapter n=1, 2, ...>. In the Slovene version, each <div> is followed by a <head>, giving the part or chapter number. Counting of chapters starts from 1 in every part.
Elements <body>, <div>, <head>, <item>, <l>, <list>, <note>, <p>, <poem>, <ptr>, <quote>, <text> are used so that to be in harmony with the English 1984 for MULTEXT-East; the differences are due only to the differences between the English electronic and Estonian printed version.
Sub-paragraph tagging, i.e. <abbr>, <date>, <foreign>, <hi>, <mentioned>, <name>, <num>, <q>, <title> are used sloppily.
For example, names have been tagged automatically, and the tagged names hand-validated; therefore all the tagged names are correct, but the text might also contain untagged names. The name tags do not contain the type attribute. In accordance with CES, only proper nouns have been tagged, while adjectives derived from proper nouns, e.g. Winstonov, have not.
Rendering information is given as the CES conformant two-letter value of the rend attribute. It has been in most cases included with the appropriated tags, except for the default preceding — of the <q> tag. The values for rend attribute are: CA, CE, CN, IT, and '*'.
Rendering, including 'all caps' capitalisation, has been removed from the tag content.
The first chapter has more detailed mark-up than the rest of the novel, e.g. it includes more type information on tags.
The text has been automatically sentence segmented, and the segmentation hand-validated. Some <q> elements have been split in this process; these are marked with type=MI, for 'machine inserted. The <body>, <div>, <quote>, <p>, <poem>, <list>, <l>, <item>, <s>, and <q> tags have been marked with the id attribute.
The following is an example from the Slovene ``1984'' corpus:
<p>
</p><p id="Osl.1.2.8">
<s id="Osl.1.2.8.1">Ministrstvo resnice — <name>Minires</name> v
<name>Novoreku</name><ptr id="Osl.1.2.8.1.3" n="1" rend="*" target="Osl.1.2.9">
— se je osupljivo ločilo od kateregakoli predmeta.</s>
<s id="Osl.1.2.8.2">Bilo je velikanska
piramidasta zgradba iz bleščeče belega betona, ki
je v terasah kipela kvišku, tristo metrov visoko v zrak.</s>
<s id="Osl.1.2.8.3">S
kraja, kjer je stal <name>Winston</name>, se je ravno še dalo
prebrati tri partijske parole, ki so se v lepih črkah
odražale z belega pročelja:
<q id="Osl.1.2.8.3.2" rend="CN CA" type=slogan>Vojna je mir</q>
<q id="Osl.1.2.8.3.3" rend="CN CA" type=slogan>Svoboda je suženjstvo</q>
<q id="Osl.1.2.8.3.4" rend="CN CA" type=slogan>Nevednost je moč</q></s>
</p>
The ECI version which was used as the digital source was essentially
the same as the OTA one, except that <p> had been automatically
inserted into the text. Also, neither edition had been proofread or
checked in any way. The Slovene characters were encoded as cy, sy, zy
for c, s, z respectively, which caused problems for Syme.
The following
is an
example from the ECI version:
pojavljale brez ozadja in so bile vecyinoma nerazumljive. Ministrstvo
resnice -- Minires v Novoreku* -- se je osupljivo locyilo od
kateregakoli predmeta. Bilo je velikanska piramidasta zgradba iz
blesycyecye belega betona, ki je v terasah kipela kvisyku, tristo
metrov visoko v zrak. S kraj a, kjerje stal Winston, seje ravno sye
dalo prebrati tri partijske parole, ki so se v lepih cyrkah odrazyale
z belega procyelja:
VOJNA JE MIR
SVOBODA JE SUZYNJSTVO
NEVEDNOST JE MOCY
</p>
The ECI edition was taken as the basis for the encoding. It was proofread and marked up to CES1 conformance. The whole Slovene ``1984'' CES1 corpus was cross-checked with the printed edition, and the printed edition was used to insert additional (e.g. <hi>) markup. In the process, a number of typographical errors were discovered not only in the ECI edition, but also in the printed edition of the Slovene translation of ``1984''.