next up previous contents
Next: Multilingual Comparable 2: Up: Multilingual Comparable 1: Previous: Romanian


  COP project 106 MULTEXT-East Fiction, Slovene

Contributors: Tomaz Erjavec (IJS) and Miro Romih (Amebis d.o.o.)

Description of the Corpus

The contents of the Slovene MULTEXT-East fiction corpus is the novel ``Galjot'' by Drago Jancar, which tells of the wanderings of J. Ot from the German duchy of Neisse in the Slovene counties and the Mediterranean in the 17th century. The novel was first published by ``Pomurska zalozba'', Murska Sobota in 1978, with the last publication being the third, corrected edition by ``Mladinska knjiga'', Ljubljana, in 1984. The digital source used as the basis of encoding was provided, via Amebis d.o.o., by the Slovene Society for Blind and Visually Impaired, where the 1984 printed edition was OCRed. This source also included a short introduction by M. Kramberger, which was not retained in the Slovene Fiction MULTEXT-East corpus.

The Slovene site obtained a license agreement allowing the use of ``Galjot'' for the purposes of the MULTEXT-East project, signed by the author, Drago Jancar.

As computed by the Unix program wc over the whole CES-1 document, the Slovene Fiction corpus has 101308 words.

Structure of the Corpus

The corpus body consists of 25 <div type=chapter> , each of which starts off with a <head> giving the chapter number and an <opener>

with a short expressionistic summary of the chapter. Chapters are further sub-divided into <div type=section> , denoted in the printed edition by increased spacing only --- some of these divisions also begin with a <head> .

The <div type=chapter> elements have the n attribute, giving the chapter number, and the id attribute, whose value has the prefix galjot followed by a period and the chapter number, e.g. <div type=chapter n=2 id=galjot.2>

The text is segmented into paragraphs, with the <head> , <quote> , <opener> , <list> , <poem> and <table> elements marked-up at the paragraph level.

Sub-paragraph tagging consists of <abbr> , <foreign> , <hi> and <q> . Direct speech has been marked-up by <q> even where there is no typographical marking to that effect in the printed text.

Rendering information, given as the CES conformant two-letter value of the rend attribute has been included with the appropriated tags and, for mdash and capitalisation, retained in the tag content.

Example from the corpus:

<div type=chapter n=1 id=galjot.1>

<opener rend="IT">
Gosti skladi zraka. Sluz se vzpenja po stenah. Prihod iz mo&ccaron;virja.
Ku&zcaron;ni komisarji v de&zcaron;eli. Tak &ccaron;uden vstop, tak vinski

<div type=section>
Temne lise vlage so se spakovale po zidu. Zdelo se mu je, da v tej gluhi
tihoti lezejo skupaj in narazen in da s svojim neznansko po&ccaron;asnim
gibanjem oblikujejo nedolo&ccaron;ljive podobe. Spodaj je bilo okrog in okrog
mokro, zid je bil prav do &ccaron;rnega prepojen s sluzasto vodno

Structure of the Original

The OCR'ed version, which was the basis for the encoding, was an ASCII file with a relatively transparent encoding: paragraph boundaries were marked, and a few useful formatting codes were included. However, the text contained a number of recognition errors, e.g. typos and misrendered punctuation. A sample of the original is given below.

#s   #+1#-
#s2   #+Gosti skladi zraka. Sluz se vzpenja po stenah. Prihod iz mo~virja. Ku`ni
komisarji v de`eli. Tak ~uden vstop, tak vinski za~etek.#-
#s2   #+Temne lise vlage so se spakovale po zidu. Zdelo se mu je, da v tej gluhi
tihoti lezejo skupaj in narazen in da s svojim neznansko po~asnim gibanjem
oblikujejo nedolo~ljive podobe. Spodaj je bilo okrog in okrog mokro, zid je bil
prav do ~rnega prepojen s sluzasto vodno snovjo. Nevidno gibanje se je pomikalo

Markup Process

The OCR text was first automatically converted, by Amebis d.o.o., into a quasi-sgml encoding, and at the same time corrected for typos with a spelling-checker. This version has been then additionally corrected and marked up to CES1 conformance by IJS. As there are often no typographical marks which indicate direct speech, the <q> tags were inserted manually. The corpus text was also cross-checked against the printed edition of the novel.

next up previous contents
Next: Multilingual Comparable 2: Up: Multilingual Comparable 1: Previous: Romanian

Tomaz Erjavec
Sat May 18 20:25:31 MDT 1996