next up previous contents
Next: Serbo-Croatian Up: TELRI Appendix 1: Additional Previous: Latvian



 TELRI & MULTEXT-East Deliverable D2.1 F ``1984'', Lithuanian

Contributors: Andrius Utka and Tomaz Erjavec

Description of the Corpus

The digital source of the Lithuanian version of ``1984'' was created at the Centre of Computational Linguistics of the Kaunas Vytautas Magnus University. The publication and translation in Lithuanian that was used for the creation of the digital source is the one by ``Vyturys'', Vilnius, published in 1991. The digital source was obtained by OCR from the printed edition.

The digital source for the Lithuanian ``1984'' has the same distribution restrictions as those imposed by the MULTEXT-East project, namely, that the use of the materials is free for academic purposes.

We currently do not have written permission to use the text, however, we do have verbal assurance from the translator that it is acceptable to use the Lithuanian translation of ``1984'' for the purposes of the TELRIproject.

The Lithuanian version of ``1984'' contains 71.210 words, as indicated in the header of the tagged version.

Structure of the Corpus

The Lithuanian ``1984'' corpus body consists of three <div type=part n=1, 2, 3> and of one <div type=part n=appendix>. Each part is further subdivided into a number of <div type=chapter n=1, 2, ...>. In the Lithuanian version, each <div> is followed by a <head>, giving the part or chapter number. Counting of chapters starts from 1 in every part.

Elements <body>, <div>, <head>, <item>, <l>, <list>, <note>, <p>, <poem>, <ptr>, <quote>, <text> are used so that to be in harmony with the English 1984 for MULTEXT-East (-//MTE//TEXT CES1 1984//EN); the differences are due only to the differences between the English electronic and Lithuanian printed version.

Sub-paragraph tags, i.e. <abbr>, <date>, <foreign>, <mentioned>, <name>, <num> are not used, except for <hi>,<q>,<title> that are used inoconsistently. Rendering information is given as the CES conformant two-letter value of the rend attribute. It has been in most cases included with the appropriate tags, except for the default preceding &mdash; of the <q> tag. The values for rend attribute are: CA, CE, CN, IT, and '*'.

The mark-up is even in all chapters of the novel, i.e. no chapters could be distinguished as having more type information on tags than others.

The text has been automatically sentence segmented, and the segmentation hand-validated. The <body>, <div>, <quote>, <p>, <poem>, <list>, <l>, <item>, <s>, and <q> tags have been marked with the id attribute.

The following is an example from the Lithuanian ``1984'' corpus:

<p id="Olt.1.2.8">
<s id="Olt.">Teisyb&edot;s ministerija&mdash;naujakalbe 
<ptr id="Olt." n="1" target=N1 rend=asterisk target= "Olt.1.2.11">
Teisybmin&edot; &mdash; akivaizd&zcaron;iai skyr&edot;si nuo vis&uogon; kit&uogon; 
aplinkini&uogon; pastat&uogon;.</s><s id="Olt.">Tai buvo giganti&scaron;kas 
piramid&edot;s pavidalo statinys, &zcaron;vilgantis baltu betonu, terasa po terasos kylantis &iogon; 
300 metr&uogon; auk&scaron;t&iogon;.</s><s id="Olt.">Vinstonas i&scaron; &ccaron;ia dar 
gal&edot;jo perskaityti grak&scaron;&ccaron;iomis raid&edot;mis baltame fone 
&scaron;vie&ccaron;ian&ccaron;ius tris partijos &scaron;&umacr;kius:
<q id="Olt." rend="CE CA" type=slogan>KARAS&mdash;TAI TAIKA</q>
<q id="Olt." rend="CE CA" type=slogan>LAISV&Edot;&mdash;TAI VERGIJA</q>
<q id="Olt." rend="CE CA" type=slogan>NE&Zcaron;INOMAS&mdash;TAI 

Structure of the Original

The electronic version that was obtained by OCR preserves all the visual layout peculiarities in electronic form that are in the printed version of ``1984'' including new line, centering, capitalization, bold print, italic print, page numbers, Lithuanian characters and foreign characters.

Markup Process

The electronic version of ``1984'' that was created from the printed version by OCR was taken as the basis for the encoding. The typical OCR mistakes were corrected manually as well as by spelling checker. As this version has similar layout to the printed version of ``1984'' it enabled to mark up many visual distinctions semi-automatically. It was proofread and marked up to CES1 conformance. In the process, some typographical errors were discovered not only in the digital version, but also in the printed edition of the Lithuanian translation of ``1984''. The transliteration of Lithuanian characters by SGML entities was performed automatically. A number inconsistencies and anomalies were discovered and corrected in the process of aligning Lithuanian translation to the original.

next up previous contents
Next: Serbo-Croatian Up: TELRI Appendix 1: Additional Previous: Latvian