Next: Hungarian Up: Multilingual Comparable 1: Fiction Previous: Czech

Subsections

Estonian

COP project 106 MULTEXT-East Deliverable D2.1 F Fiction, Estonian

Contributors: Heiki-Jaan Kaalep, Leho Paldre, Heili Orav, Urve Talvik and Kadri Muischnek

Description of the Corpus

The Estonian fiction corpus consists of 51 excerpts from Estonian novels or short stories from 1985. Each excerpt is approximately 2000 words long. The digital source used as the basis of encoding was provided by the University of Tartu, as an output of the project ``Creating an Estonian text corpus''. License agreements with the authors ensure that the texts are free to be distributed in any form for academic purposes.

The Estonian fiction corpus contains 104435 words, as indicated in the header of the tagged version.

Structure of the Corpus

The corpus body consists of 51 <div type=excerpt>. As there are excerpts from various places of the novels, all <div>-s below the level of ``excerpts'' are more or less arbitrary; e.g. the first <div> below some ``excerpt'' may well be <div chapter=45>.

The text is tagged up to the level of sentences. Names, abbreviations and direct speech have been tagged also.

Rendering information has been included with the appropriated tags.

Example from the corpus:

<div type="excerpt">
          <byline>
            <docauthor>Viivi Luik</docauthor>
            <title>Seitsmes rahukevad</title>
            <name type="org">Eesti Raamat</name>
            <name type="place">Tallinn</name>
            <date>1885</date>
            <num type="pages">5-8</num>
          </byline>
<p>
<s>
Suurtes ja hallides tee&auml;&auml;rsetes taludes olid elanud kulakud ja
rauds&auml;ngijalgadesse kulda peitnud.</s>
<s>
&Uuml;he talu perenaine oli aga ennast koguni s&auml;ngijala k&uuml;lge
&auml;ra poonud.</s>
<s>
M&otilde;ned lagunenud s&auml;ngid vedelesid veel praegugi n&otilde;gestes.</s>
</p>

Structure of the Original

The original came from the project ``Creating an Estonian text corpus'' in the form of electronic versions of 2000-word excerpts of Estonian novels and short stories, tagged to the sentence level in a TEI-like manner. The Estonian diacritics were encoded in extended ASCII. Every excerpt was a separate file.

An example of the original follows:

Viivi Luik "Seitsmes rahukevad" 1985, lk.3-8.
<p>
        <s>Suurtes ja hallides tee=84=84rsetes taludes olid elanud
kulakud ja rauds=84ngijalgadesse kulda peitnud.</s>
<s>=9Ahe talu perenaine oli aga ennast koguni s=84ngijala k=81lge
=84ra poonud.</s>
<s>M"ned lagunenud s=84ngid vedelesid veel praegugi n"gestes.</s>
</p>

Markup Process

The TEI-like texts were automatically converted to CES-tagged versions by a script written by a student of linguistics of the University of Tartu, Leho Paldre. The result was hand-validated. The separate files were then united into one file and modified to be CES-type as a single document.

Next: Hungarian Up: Multilingual Comparable 1: Fiction Previous: Czech

Multext-East