COP project 106 MULTEXT-East Newspapers, Estonian
Contributors: Heiki-Jaan Kaalep, Heili Orav and Urve Talvik
The contents of the Estonian MULTEXT-East newspaper corpus are over 300 articles from eleven newspapers from 1985. The digital source used as the basis of encoding was provided by the University of Tartu, as an output of the project ``Creating an Estonian text corpus''. Licence agreements with the original publishers of the newspapers ensure that the texts are free to be distributed in any form for academic purposes.
The Estonian version of the newspaper corpus contains 112002 words, as indicated in the header of the tagged version.
The corpus body consists of 11 <div type=newspaper> , each of which contains several files from the original digital data, constituting several articles. Altogether there are over 300 <div type=article> . In fact the corpus contains besides ``real'' articles also announcments and news from news agencies.
The text is segmented up to the sentence level. Names, lists, numbers, abbreviations and direct speech are also tagged.
Rendering information attribute has been in most cases included with the appropriated tags.
Example from the corpus:
<div type="newspaper"> <byline> <docauthor></docauthor> <title>Õhtuleht</title> <name type="org">EKP KK kirjastus</name> <name type="place">Tallinn</name> <date>25/04/1985</date> <num type="issue">95</num> <num type="pages">1-4</num> </byline> <div type="article"> <byline> <docauthor></docauthor> <title>Võtame vääriliselt vastu NLKP XXVII kongressi!</title> <num type="page">1</num> </byline> <head> Võtame vääriliselt vastu<name> <abbr expan="Nõukogude Liidu Kommunistlik Partei"> NLKP</abbr> XXVII kongressi</name> !</head> <p> <s> <hi rend=bold> Linna asutustes ja ettevõtetes tutvutakse<name> <abbr expan="Nõukogude Liidu Kommunistlik Partei"> NLKP</abbr> Keskkomitee aprillipleenumi</name> materjalidega ning<name type=org> Keskkomitee</name> peasekretäri<name type=person> Mihhail Gorbatšovi</name> <name> ettekandega <abbr expan="Nõukogude Liidu Kommunistlik Partei"> NLKP</abbr> korralise, XXVII kongressi kokkukutsumisest ning selle ettevalmistamise ja läbiviimisega seotud ülesannetest </name> .</hi> </s></p>
The digital source used as the basis of encoding was provided by the University of Tartu. It consisted of data files, one file per article and tagged according to TEI. Each file starts off by header information, followed by the text.
<!DOCTYPE TEI.2 SYSTEM 'tei2.dtd'[ <!ENTITY % TEI.general 'INCLUDE'> <!ENTITY % TEI.analysis 'INCLUDE'> <!ENTITY % TEI.figures 'INCLUDE'> <!ENTITY % ISOLat1 PUBLIC "ISO 8879-1986//ENTITIES Added Latin 1//EN"> %ISOLat1; <!ENTITY % ISOLat2 PUBLIC "ISO 8879-1986//ENTITIES Added Latin 2//EN"> %ISOLat2; <!ENTITY % ISOnum PUBLIC "ISO 8879-1986//ENTITIES Numeric and Special Graphic//EN"> %ISOnum; <!ENTITY % ISOpub PUBLIC "ISO 8879-1986//ENTITIES Publishing//EN"> %ISOpub; <!ENTITY % ISOdia PUBLIC "ISO 8879-1986//ENTITIES Diacritical Marks//EN"> %ISOdia; <!ENTITY % MYent PUBLIC "ISO 8879-1986//ENTITIES Characters Unfindable//EN"> %MYent; ]> <tei.2 lang=ET> <teiheader> <filedesc> <titleStmt> <title>TEI version of: Võtame vääriliselt vastu NLKP XXVII kongressi!</title> <author></author> <principal>Heiki-Jaan Kaalep</principal> <respStmt> <resp>entered the text</resp> <name>Riina Mosna</name> <resp>validated with psgml</resp> <name>Heili Orav</name> <resp>validated with sgmls</resp> <name>Leho Paldre</name> <resp>finalised the header</resp> <name>Heiki-Jaan Kaalep</name> </respStmt> </titleStmt> <extent>8396 bytes</extent> <publicationStmt> <authority>TÜ arvutuslingvistika uurimisgrupp</authority> <pubPlace>Tartu, Tiigi 78-232</pubPlace> <date>Fall 1995</date> <availability> <p>Available with prior consent for purposes of research only </availability> </publicationStmt> <sourcedesc> <bibl> <title level=a>Võtame vääriliselt vastu NLKP XXVII kongressi!</title> <title level=m>Õhtuleht</title> <title level=s></title> <author></author> <biblScope type=issue>95</biblScope> <biblScope type=pages>pp. 1</biblScope> <imprint> <publisher>EKP KK Kirjastus</publisher> <pubPlace>Tallinn</pubPlace> <date>85-0-0</date> </imprint> </bibl> </sourcedesc> </filedesc> <encodingdesc> <projectdesc> <p>Estonian written text corpus of 1 million words based on published texts from 1983-87 </projectdesc> <samplingdecl> <p>Every text contains 1 whole unit of paper-printed text (approximately of the size of 2,000 words unless the unit was smaller) </samplingdecl> <editorialdecl> <correction><p>No corrections</correction> <hyphenation><p>No hyphenated words in this electronic version. </hyphenation> <normalization><p>No normalization</normalization> <interpretation> <p>Abbreviations are marked and their expanded form given. <p>Proper named are marked with their type. <p>All highlighted text is marked without interpretaton. <p>Numbers are marked with numerical interpretation and type. <p>Dates and time are marked with numerical interpretation. <p>Lists are interpreted as clauses and items in list as phrases. <quotation marks=all><p></quotation> <segmentation> <p>Up to the level of sentences</segmentation> </editorialdecl> <tagsdecl> <tagUsage gi=abbr occurs=4></tagUsage> <tagUsage gi=body occurs=1></tagUsage> <tagUsage gi=div0 occurs=1></tagUsage> <tagUsage gi=head occurs=1></tagUsage> <tagUsage gi=hi occurs=7></tagUsage> <tagUsage gi=name occurs=18></tagUsage> <tagUsage gi=num occurs=9></tagUsage> <tagUsage gi=p occurs=7></tagUsage> <tagUsage gi=q occurs=4></tagUsage> <tagUsage gi=s occurs=30></tagUsage> <tagUsage gi=text occurs=1></tagUsage> </tagsdecl> </encodingdesc> <profiledesc> <langUsage> <language id=ET>Estonian</language> <language id=DE>German</language> <language id=LA>Latin</language> <language id=EN>English</language> <language id=FR>French</language> </profiledesc> </teiheader> <text> <body> <div0 type=unknown><head> Võtame vääriliselt vastu<name type=event> <abbr expan="Nõukogude Liidu Kommunistlik Partei"> NLKP</abbr> <num type=Roman value=" 27 "> XXVII</num> kongressi</name> !</head> <p> <hi rend=bold> <s> Linna asutustes ja ettevõtetes tutvutakse<name type=event> <abbr expan="Nõukogude Liidu Kommunistlik Partei"> NLKP</abbr> Keskkomitee aprillipleenumi</name> materjalidega ning<name type=org> Keskkomitee</name> peasekretäri<name type=person> Mihhail Gorbatšovi</name> <name type=product> ettekandega<hi rend="PRE laquo POST raquo"> <abbr expan="Nõukogude Liidu Kommunistlik Partei"> NLKP</abbr> korralise,<num type=Roman value=" 27 "> XXVII</num> kongressi kokkukutsumisest ning selle ettevalmistamise ja läbiviimisega seotud ülesannetest</hi> </name> .</s> </hi></p> <p><hi>
A script was written by Leho Paldre, a student of Linguistics at the University of Tartu to convert from TEI to CES1. As the DTD for CES changed some times during the first year of the project, the script had to be rewritten as many times. Some cases which the script either did not cover or processed incorrectly were tagged by hand.
Part of the sub-paragraph mark-up was deleted from the MTE version of the newspaper corpus because it is really not needed in MTE and validating it in the context of a changing DTD seemed to be a waste of time and resources.