Contributors: Heiki-Jaan Kaalep, Leho Paldre, Heili Orav, Urve Talvik and Kadri Muischnek
The Estonian fiction corpus consists of 51 excerpts from Estonian novels or short stories from 1985. Each excerpt is approximately 2000 words long. The digital source used as the basis of encoding was provided by the University of Tartu, as an output of the project ``Creating an Estonian text corpus''. License agreements with the authors ensure that the texts are free to be distributed in any form for academic purposes.
The Estonian fiction corpus contains 104435 words, as indicated in the header of the tagged version.
The corpus body consists of 51 <div type=excerpt>. As there are excerpts from various places of the novels, all <div>-s below the level of ``excerpts'' are more or less arbitrary; e.g. the first <div> below some ``excerpt'' may well be <div chapter=45>.
The text is tagged up to the level of sentences. Names, abbreviations and direct speech have been tagged also.
Rendering information has been included with the appropriated tags.
Example from the corpus:
<name type="org">Eesti Raamat</name>
Suurtes ja hallides teeäärsetes taludes olid elanud kulakud ja
raudsängijalgadesse kulda peitnud.</s>
Ühe talu perenaine oli aga ennast koguni sängijala külge
Mõned lagunenud sängid vedelesid veel praegugi nõgestes.</s>
The original came from the project ``Creating an Estonian text corpus'' in the form of electronic versions of 2000-word excerpts of Estonian novels and short stories, tagged to the sentence level in a TEI-like manner. The Estonian diacritics were encoded in extended ASCII. Every excerpt was a separate file.
An example of the original follows:
Viivi Luik "Seitsmes rahukevad" 1985, lk.3-8.
<s>Suurtes ja hallides tee=84=84rsetes taludes olid elanud
kulakud ja rauds=84ngijalgadesse kulda peitnud.</s>
<s>=9Ahe talu perenaine oli aga ennast koguni s=84ngijala k=81lge
<s>M"ned lagunenud s=84ngid vedelesid veel praegugi n"gestes.</s>
The TEI-like texts were automatically converted to CES-tagged versions by a script written by a student of linguistics of the University of Tartu, Leho Paldre. The result was hand-validated. The separate files were then united into one file and modified to be CES-type as a single document.