Contributors: Andrejs Spektors and Tomaz Erjavec
The digital source of the Latvian version of ``1984'' was created at the Artificial Intelligence Laboratory, University of Latvia, Riga. The publication and translation in Latvian that was used for the creation of the digital source is the one by Avots, Riga, published in 1990. The digital source was obtained by OCR from the printed edition.
The copyright of the translation of the Latvian ``1984'' is held by the Latvian copyright agency ``AKKA/LAA''. We currently do not have permission to distribute this translation further.
The Latvian version of ``1984'' contains 81.956 words, as indicated in the header of the tagged version.
The Latvian ``1984'' corpus body consists of three <div type=part n=1, 2, 3> and of one <div type=part n=appendix>. Each part is further subdivided into a number of <div type=chapter n=1, 2, ...>. In the Latvian version, each <div> is followed by a <head>, giving the part or chapter number. Counting of chapters starts from 1 in every part.
Elements <body>, <div>, <head>, <item>, <l>,
<list>, <note>, <p>, <poem>, <ptr>, <quote>,
<text> are used so that to be in harmony with the English 1984 for
-//MTE//TEXT CES1 1984//EN); the differences are
due only to the differences between the English electronic and
Latvian printed version.
Of the possible sub-paragraph tags, only <hi>, <mentioned>, <name>, <q>, and <title> are used. The <q> tag is used only to denote quoted words, but not direct speech. Rendering information is given as the CES conformant two-letter value of the rend attribute. It has been in most cases included with the appropriate tags. The values for rend attribute are: CA, CE, and IT.
The mark-up is even in all chapters of the novel, i.e. no chapters could be distinguished as having more type information on tags than others.
The text has been automatically sentence segmented, and the segmentation hand-validated. The <body>, <div>, <p>, <poem>, <list>, <l>, <item>, and <s> tags have been marked with the id attribute.
Page brakes from the original have been preserves as comments.
The following is an example from the Latvian ``1984'' corpus:
<p id="Olv.1.2.9"><s id="Olv.22.214.171.124">Patiesības ministrija
— jaunrunā to sauca par Patminu — krasi
atšķīrās no visiem citiem redzamiem
<s id="Olv.126.96.36.199">Tā bija milzīga, piramīdai
līdzīga mirdzoši balta betona celtne, kas
daudzām citu par citu augstākām terasēm
slējās debesīs trīssimt metrus.</s>
<s id="Olv.188.8.131.52">No vietas, kur Vinstons stāvēja,
varēja salasīt tās baltajā fasādē
skaistiem burtiem iecirstos trīs partijas saukļus:</s>
<q rend="CE CA" type=slogan>KARŠ IR MIERS</q>
<q rend="CE CA" type=slogan>BRĪVĪBA IR VERDZĪBA</q>
<q rend="CE CA" type=slogan>NEZINĀŠANA IR SPĒKS</q>
The electronic version that was obtained by OCR preserves most of the visual layout peculiarities in electronic form that are in the printed version of ``1984'' including line breaks, end-of-line hypenation, centering, capitalization, italic print, page numbers, Latvian characters and foreign characters.
The electronic version of ``1984'' that was created from the printed version by OCR was taken as the basis for the encoding. The typical OCR mistakes were corrected manually. As this version has similar layout to the printed version of ``1984'' it enabled to mark up many visual distinctions semi-automatically. It was proofread and marked up to CES1 conformance. In the process, some typographical errors were discovered not only in the digital version, but also in the printed edition of the Latvian translation of ``1984''. The transliteration of Latvian characters by SGML entities was performed automatically. A number inconsistencies and anomalies were discovered and corrected in the process of aligning Latvian translation to the original.