next up previous contents
Next: Lithuanian Up: TELRI Appendix 1: Additional Previous: Corpus Encoding



 TELRI & MULTEXT-East Deliverable D2.1 F ``1984'', Latvian

Contributors: Andrejs Spektors and Tomaz Erjavec

Description of the Corpus

The digital source of the Latvian version of ``1984'' was created at the Artificial Intelligence Laboratory, University of Latvia, Riga. The publication and translation in Latvian that was used for the creation of the digital source is the one by Avots, Riga, published in 1990. The digital source was obtained by OCR from the printed edition.

The copyright of the translation of the Latvian ``1984'' is held by the Latvian copyright agency ``AKKA/LAA''. We currently do not have permission to distribute this translation further.

The Latvian version of ``1984'' contains 81.956 words, as indicated in the header of the tagged version.

Structure of the Corpus

The Latvian ``1984'' corpus body consists of three <div type=part n=1, 2, 3> and of one <div type=part n=appendix>. Each part is further subdivided into a number of <div type=chapter n=1, 2, ...>. In the Latvian version, each <div> is followed by a <head>, giving the part or chapter number. Counting of chapters starts from 1 in every part.

Elements <body>, <div>, <head>, <item>, <l>, <list>, <note>, <p>, <poem>, <ptr>, <quote>, <text> are used so that to be in harmony with the English 1984 for MULTEXT-East (-//MTE//TEXT CES1 1984//EN); the differences are due only to the differences between the English electronic and Latvian printed version.

Of the possible sub-paragraph tags, only <hi>, <mentioned>, <name>, <q>, and <title> are used. The <q> tag is used only to denote quoted words, but not direct speech. Rendering information is given as the CES conformant two-letter value of the rend attribute. It has been in most cases included with the appropriate tags. The values for rend attribute are: CA, CE, and IT.

The mark-up is even in all chapters of the novel, i.e. no chapters could be distinguished as having more type information on tags than others.

The text has been automatically sentence segmented, and the segmentation hand-validated. The <body>, <div>, <p>, <poem>, <list>, <l>, <item>, and <s> tags have been marked with the id attribute.

Page brakes from the original have been preserves as comments.

The following is an example from the Latvian ``1984'' corpus:

<p id="Olv.1.2.9"><s id="Olv.">Paties&imacr;bas ministrija
&mdash; jaunrun&amacr; to sauca par Patminu &mdash; krasi
at&scaron;&kcedil;&imacr;r&amacr;s no visiem citiem redzamiem
<s id="Olv.">T&amacr; bija milz&imacr;ga, piram&imacr;dai
l&imacr;dz&imacr;ga mirdzo&scaron;i balta betona celtne, kas
daudz&amacr;m citu par citu augst&amacr;k&amacr;m teras&emacr;m
sl&emacr;j&amacr;s debes&imacr;s tr&imacr;ssimt metrus.</s>
<s id="Olv.">No vietas, kur Vinstons st&amacr;v&emacr;ja,
var&emacr;ja salas&imacr;t t&amacr;s baltaj&amacr; fas&amacr;d&emacr;
skaistiem burtiem iecirstos tr&imacr;s partijas sauk&lcedil;us:</s>
<q rend="CE CA" type=slogan>KAR&Scaron; IR MIERS</q>
<q rend="CE CA" type=slogan>BR&Imacr;V&Imacr;BA IR VERDZ&Imacr;BA</q>
<q rend="CE CA" type=slogan>NEZIN&Amacr;&Scaron;ANA IR SP&Emacr;KS</q>

Structure of the Original

The electronic version that was obtained by OCR preserves most of the visual layout peculiarities in electronic form that are in the printed version of ``1984'' including line breaks, end-of-line hypenation, centering, capitalization, italic print, page numbers, Latvian characters and foreign characters.

Markup Process

The electronic version of ``1984'' that was created from the printed version by OCR was taken as the basis for the encoding. The typical OCR mistakes were corrected manually. As this version has similar layout to the printed version of ``1984'' it enabled to mark up many visual distinctions semi-automatically. It was proofread and marked up to CES1 conformance. In the process, some typographical errors were discovered not only in the digital version, but also in the printed edition of the Latvian translation of ``1984''. The transliteration of Latvian characters by SGML entities was performed automatically. A number inconsistencies and anomalies were discovered and corrected in the process of aligning Latvian translation to the original.

next up previous contents
Next: Lithuanian Up: TELRI Appendix 1: Additional Previous: Corpus Encoding