COP project 106 MULTEXT-East ``1984'', Slovene
Contributors: Tomaz Erjavec (IJS) and Olga Vukovic (Spica International)
The digital source of the Slovene version of ``1984'' was obtained from the European Corpus Initiative, Multilingual Corpus 1 CD-ROM, which also contains the English original, as well a Croatian and a Serbian translation. The source for the ECI-MC1 versions of ``1984'' was, in turn, the Oxford Text Archive. For the OTA, all four versions were prepared by the OUCS KDEM service in 1985 for Dr David C Bennett of the School of Oriental And African Studies at London University. The OTA has copies of letters from all the publishers concerned which appear to indicate that academic use of the text is permitted.
The above would seem to indicate that the digital source for the MULTEXT-East \ Slovene ``1984'' has the same distribution restrictions as those imposed by the MULTEXT-East project itself, namely, that the use of the materials is free for academic purposes.
It is not clear what the Slovene printed version serving as the source of OTA's digital version was; however, it seems that the only translation and publication of ``1984'' in Slovene is the (out of print) one by ``Mladinska knjiga'', Ljubljana, published in 1983. Typical errors in the OTA edition further indicate that the OTA edition was obtained by OCR from the printed edition.
We currently do not have written permission to use the text, however, we do have verbal assurance from OTA as well as by employees of MK that it is acceptable to use the Slovene translation of ``1984'' for the purposes of the MULTEXT-East project.
As computed by the Unix program wc over the whole CES-1 document, the Slovene version of ``1984'' has 98727 words.
The Slovene ``1984'' corpus body consists of three <div type=part>
and of one <div type=appendix> . Each part is further subdivided into a number of <div type=chapter> . In the Slovene version, each <div> is followed by a <head> , giving the part or chapter number.
The <div> elements have the n attribute, giving the successive number of the appropriate level of the <div> , and the id attribute, whose value has the prefix sl1984 and the chapter and section numbers separated by periods, e.g. <div type=chapter n=2 id=sl1984.1.2> .
The text is segmented into paragraphs, with the <head> , <quote> , <list> , <note> , <poem> and <title>
elements marked-up at the paragraph level.
Sub-paragraph tagging consists of <hi> , <q> and <name> . Names have been tagged with a program, and the tagged names hand-validated; therefore all the tagged names are correct, but the text might also contain untagged names. The name tags do not contain the type attribute. In accordance with CES, only proper nouns have been tagged, while adjectives derived from proper nouns, e.g. Winstonov, have not.
Rendering information, given as the CES conformant two-letter value of the rend attribute has been in most cases included with the appropriated tags, except for the default preceding mdash of the <q> tag. Rendering has, except for capitalisation, been removed from the tag content.
The first chapter has more detailed mark-up than the rest of the novel, e.g. it includes more type information on tags.
The following is an example from the Slovene ``1984'' corpus:
<p> Ministrstvo resnice — <name>Minires</name> v <name>Novoreku</name><ptr n=1 rend="*" target=N1> — se je osupljivo ločilo od kateregakoli predmeta. Bilo je velikanska piramidasta zgradba iz bleščeče belega betona, ki je v terasah kipela kvišku, tristo metrov visoko v zrak. S kraja, kjer je stal <name>Winston</name>, se je ravno še dalo prebrati tri partijske parole, ki so se v lepih črkah odražale z belega pročelja: <q rend="CN CP" type=slogan>VOJNA JE MIR</q> <q rend="CN CP" type=slogan>SVOBODA JE SUŽENJSTVO</q> <q rend="CN CP" type=slogan>NEVEDNOST JE MOČ</q> </p>
The ECI version which was used as the digital source was essentially the same as the OTA one, except that <p> had been automatically inserted into the text. Also, neither edition had been proofread or checked in any way. The Slovene characters were encoded as cy, sy, zy for c, s, z respectively, which caused problems for Syme. The following is an example from the ECI version:
pojavljale brez ozadja in so bile vecyinoma nerazumljive. Ministrstvo resnice -- Minires v Novoreku* -- se je osupljivo locyilo od kateregakoli predmeta. Bilo je velikanska piramidasta zgradba iz blesycyecye belega betona, ki je v terasah kipela kvisyku, tristo metrov visoko v zrak. S kraj a, kjerje stal Winston, seje ravno sye dalo prebrati tri partijske parole, ki so se v lepih cyrkah odrazyale z belega procyelja: VOJNA JE MIR SVOBODA JE SUZYNJSTVO NEVEDNOST JE MOCY </p>
The ECI edition was taken as the basis for the encoding. It was proofread and marked up to CES1 conformance. The whole Slovene ``1984'' CES1 corpus was cross-checked with the printed edition, and the printed edition was used to insert additional (e.g. <hi> ) markup. In the process, a number of typographical errors were discovered not only in the ECI edition, but also in the printed edition of the Slovene translation of ``1984''.