Next: Hungarian Up: Multilingual Comparable 2: Newspapers Previous: Czech

Subsections

Estonian

COP project 106 MULTEXT-East Deliverable D2.1 F Newspapers, Estonian

Contributors: Heiki-Jaan Kaalep, Heili Orav and Urve Talvik

Description of the Corpus

The contents of the Estonian MULTEXT-East newspaper corpus are over 300 articles from eleven newspapers from 1985. The digital source used as the basis of encoding was provided by the University of Tartu, as an output of the project ``Creating an Estonian text corpus''. Licence agreements with the original publishers of the newspapers ensure that the texts are free to be distributed in any form for academic purposes.

The Estonian version of the newspaper corpus contains 112002 words, as indicated in the header of the tagged version.

Structure of the Corpus

The corpus body consists of 11 <div type=newspaper>, each of which contains several files from the original digital data, constituting several articles. Altogether there are over 300 <div type=article>. In fact the corpus contains besides ``real'' articles also announcments and news from news agencies.

The text is segmented up to the sentence level. Names, lists, numbers, abbreviations and direct speech are also tagged.

Rendering information attribute has been in most cases included with the appropriated tags.

Example from the corpus:

        <div type="newspaper">
          <byline>
            <docauthor></docauthor>
            <title>&Otilde;htuleht</title>
            <name type="org">EKP KK kirjastus</name>
            <name type="place">Tallinn</name>
            <date>25/04/1985</date>
            <num type="issue">95</num>
            <num type="pages">1-4</num>
          </byline>
       <div type="article">
          <byline>
            <docauthor></docauthor>
            <title>V&otilde;tame v&auml;&auml;riliselt vastu NLKP XXVII
                   kongressi!</title>
            <num type="page">1</num>
          </byline>
      <head>
V&otilde;tame v&auml;&auml;riliselt vastu<name>
<abbr expan="N&otilde;ukogude Liidu Kommunistlik Partei">
NLKP</abbr>

XXVII
kongressi</name>
!</head>
<p>
<s>
<hi rend=bold>

Linna asutustes ja ettev&otilde;tetes tutvutakse<name>
<abbr expan="N&otilde;ukogude Liidu Kommunistlik Partei">
NLKP</abbr>
Keskkomitee aprillipleenumi</name>
materjalidega ning<name type=org>
Keskkomitee</name>
peasekret&auml;ri<name type=person>
Mihhail Gorbat&scaron;ovi</name>
<name>
ettekandega
<abbr expan="N&otilde;ukogude Liidu Kommunistlik Partei">
NLKP</abbr>
korralise,
XXVII
kongressi
kokkukutsumisest ning selle ettevalmistamise ja l&auml;biviimisega seotud
&uuml;lesannetest
</name>
.</hi>
</s></p>

Structure of the Original

The digital source used as the basis of encoding was provided by the University of Tartu. It consisted of data files, one file per article and tagged according to TEI. Each file starts off by header information, followed by the text.

<!DOCTYPE TEI.2 SYSTEM 'tei2.dtd'[
<!ENTITY % TEI.general 'INCLUDE'>
<!ENTITY % TEI.analysis 'INCLUDE'>
<!ENTITY % TEI.figures 'INCLUDE'>
<!ENTITY % ISOLat1 PUBLIC "ISO  8879-1986//ENTITIES Added Latin 1//EN">
%ISOLat1;
<!ENTITY % ISOLat2 PUBLIC "ISO 8879-1986//ENTITIES Added Latin 2//EN">
%ISOLat2;
<!ENTITY % ISOnum  PUBLIC "ISO 8879-1986//ENTITIES Numeric and Special Graphic//EN">
%ISOnum;
<!ENTITY % ISOpub  PUBLIC "ISO 8879-1986//ENTITIES Publishing//EN">
%ISOpub;
<!ENTITY % ISOdia  PUBLIC "ISO 8879-1986//ENTITIES Diacritical Marks//EN">
%ISOdia;
<!ENTITY % MYent   PUBLIC "ISO 8879-1986//ENTITIES Characters Unfindable//EN">
%MYent;
]>
<tei.2 lang=ET>
<teiheader>
   <filedesc>
      <titleStmt>
         <title>TEI version of: V&otilde;tame v&auml;&auml;riliselt vastu NLKP
                XXVII kongressi!</title>
         <author></author>
         <principal>Heiki-Jaan Kaalep</principal>
         <respStmt>
            <resp>entered the text</resp>
            <name>Riina Mosna</name>
            <resp>validated with psgml</resp>
            <name>Heili Orav</name>
            <resp>validated with sgmls</resp>
            <name>Leho Paldre</name>
            <resp>finalised the header</resp>
            <name>Heiki-Jaan Kaalep</name>
         </respStmt>
      </titleStmt>
      <extent>8396 bytes</extent>
      <publicationStmt>
         <authority>T&Uuml; arvutuslingvistika uurimisgrupp</authority>
         <pubPlace>Tartu, Tiigi 78-232</pubPlace>
         <date>Fall 1995</date>
         <availability>
            <p>Available with prior consent for purposes of research only
         </availability>
      </publicationStmt>
      <sourcedesc>
         <bibl>
            <title level=a>V&otilde;tame v&auml;&auml;riliselt vastu NLKP XXVII
                   kongressi!</title>
            <title level=m>&Otilde;htuleht</title>
            <title level=s></title>
            <author></author>
            <biblScope type=issue>95</biblScope>
            <biblScope type=pages>pp. 1</biblScope>
            <imprint>
               <publisher>EKP KK Kirjastus</publisher>
               <pubPlace>Tallinn</pubPlace>
               <date>85-0-0</date>
            </imprint>
        </bibl>
      </sourcedesc>
   </filedesc>
   <encodingdesc>
        <projectdesc>
                <p>Estonian written text corpus of 1 million words
                   based on published texts from 1983-87
        </projectdesc>
        <samplingdecl>
                <p>Every text contains 1 whole unit of paper-printed text
                   (approximately of the size of 2,000 words unless the unit was smaller)
        </samplingdecl>
        <editorialdecl>
                <correction><p>No corrections</correction>
                <hyphenation><p>No hyphenated words in this electronic version.
                </hyphenation>
                <normalization><p>No normalization</normalization>
                <interpretation>
                   <p>Abbreviations are marked and their expanded form given.
                   <p>Proper named are marked with their type.
                   <p>All highlighted text is marked without interpretaton.
                   <p>Numbers are marked with numerical interpretation and type.
                   <p>Dates and time are marked with numerical interpretation.
                   <p>Lists are interpreted as clauses and items in list as phrases.
                <quotation marks=all><p></quotation>
                <segmentation>
                   <p>Up to the level of sentences</segmentation>
      </editorialdecl>
      <tagsdecl>
       <tagUsage gi=abbr occurs=4></tagUsage>
         <tagUsage gi=body occurs=1></tagUsage>
         <tagUsage gi=div0 occurs=1></tagUsage>
         <tagUsage gi=head occurs=1></tagUsage>
         <tagUsage gi=hi occurs=7></tagUsage>
         <tagUsage gi=name occurs=18></tagUsage>
         <tagUsage gi=num occurs=9></tagUsage>
         <tagUsage gi=p occurs=7></tagUsage>
         <tagUsage gi=q occurs=4></tagUsage>
         <tagUsage gi=s occurs=30></tagUsage>
         <tagUsage gi=text occurs=1></tagUsage>
      </tagsdecl>
   </encodingdesc>
   <profiledesc>
      <langUsage>
         <language id=ET>Estonian</language>
         <language id=DE>German</language>
         <language id=LA>Latin</language>
         <language id=EN>English</language>
         <language id=FR>French</language>
   </profiledesc>
</teiheader>
<text>
<body>
<div0 type=unknown><head>
V&otilde;tame v&auml;&auml;riliselt vastu<name  type=event>
<abbr expan="N&otilde;ukogude Liidu Kommunistlik Partei">
NLKP</abbr>
<num type=Roman value=" 27 ">
XXVII</num>
kongressi</name>
!</head>
<p>
<hi rend=bold>
<s>
Linna asutustes ja ettev&otilde;tetes  tutvutakse<name type=event>
<abbr expan="N&otilde;ukogude Liidu Kommunistlik   Partei">
NLKP</abbr>
Keskkomitee aprillipleenumi</name>
materjalidega ning<name type=org>
Keskkomitee</name>
peasekret&auml;ri<name type=person>
Mihhail Gorbat&scaron;ovi</name>
<name type=product>
ettekandega<hi rend="PRE laquo POST raquo">
<abbr expan="N&otilde;ukogude Liidu Kommunistlik   Partei">
NLKP</abbr>
korralise,<num type=Roman  value=" 27 ">
XXVII</num>
kongressi
kokkukutsumisest  ning selle ettevalmistamise ja l&auml;biviimisega seotud
&uuml;lesannetest</hi>
</name>
.</s>
</hi></p>
<p><hi>

Markup Process

A script was written by Leho Paldre, a student of Linguistics at the University of Tartu to convert from TEI to CES1. As the DTD for CES changed some times during the first year of the project, the script had to be rewritten as many times. Some cases which the script either did not cover or processed incorrectly were tagged by hand.

Part of the sub-paragraph mark-up was deleted from the MTE version of the newspaper corpus because it is really not needed in MTE and validating it in the context of a changing DTD seemed to be a waste of time and resources.

Next: Hungarian Up: Multilingual Comparable 2: Newspapers Previous: Czech

Multext-East