 COP project 106 MULTEXT-East Deliverable D2.1 F ``1984'', Bulgarian

Contributors: Ludmila Dimitrova (IMI), Lydia Sinapova (IIT) and Kiril Simov (CLPP)

Description of the Corpus

The electronic version of the Bulgarian translation of ``1984'' was created manually from the printed edition, because the electonic version of the text doesn't exist. We do not possess a written permission from the publisher to use the text. As all other texts within the MULTEXT-EAST corpora, it can be used for research/academic purposes only. The book was published by the publishing house ``Profizdat'', Sofia, Bulgaria, in 1989 as the first and so far only edition of the book in question.

The Bulgarian version of ``1984'' has 87235 words. Microsoft Word 6.0 was used to count words. The text was spell-checked.

Structure of the Corpus

The Bulgarian ``1984'' corpus body consists of four <div type=part>. The <div> elements have the n attribute, giving the successive number of the appropriate level of the <div> (except for the appendix, where n=APPENDIX). The <div> elements hava also 'ID' attribute, whose value has the prefix 'Obg' and the chapter numbers separated by periods, e.g. <div type=chapter n=1 id="Obg.1.2">. Counting of chapters starts from 1 in every part. Each part (except <div type=part n=APPENDIX>) is further subdivided into a number of <div type=chapter>. In the Bulgarian version, only the appendix is followed by a <head>.

Further the text is segmented into paragraphs, with the <quote>, <note>, and <poem> elements marked-up at the paragraph level.

Sub-paragraph tagging consists of <item>, <l>, <list>, <q>, <s>, <abbr>, <date>, <foreign>, <hi>, <mentioned>, <name>, <num>, <ptr>, and<title>.

The text has been automatically sentence segmented, and the segmentation hand-validated.

The<q> tag is used to mark quoted dialogue. Some <q> elements have been split in this process; these are marked with type=MI, for ``machine inserted''. For the <q> tag, the attribute 'broken=yes' is used when no sentence terminating punctuation appears between two dialogue fragments by the same speaker (either inside the <q> itself or in the intervening text between two <q> tags). The <name> tag is used for all proper nouns and noun phrases, denoting names.

Adjectives derived from proper nouns are not tagged.

All <name> tags contain the 'type' attribute. The tag <foreign> is used only for those Newspeak words which are typographically distinguished in the printed version of the text, and for the Latin words.

The tags <body>, <div>,<item>,<l>,<list>,<note>,<p>,<poem>,<q>,<quote>, <ptr> and <s> have been automatically supplied with "id" attribute with the prefix "Obg" and subsequent numbers, separated by periods, showing their hierarchical position within the text..

All tags are used so that to be in harmony with the English 1984 for MULTEXT-East; the differences are due only to the differences between the English electronic text and Bulgarian printed text. Rendering information is included within the appropriate tags where necessary as a descriptive value of the 'rend' attribute. Two rendition short-cuts are used:

The value 'rend="PRE mdash" (or "PRE ldquo") is used when the quoted dialogue ends up with the paragraph (there is no other typographical distinction). The value 'rend="POST mdash" (or "POST rdquo") is used when there is no typographical distinction (except ordinary punctuation) for the beginning of the quoted dialogue. No default rendition is used.

The entire book is marked up using the same level of detail, i.e. no part is more detailed than the rest.

The following is an example from the Bulgarian ``1984'' corpus:

<p id="Obg.1.1.7">
<s id="Obg.">
<name type=org>&Mcy;&icy;&ncy;&icy;&scy;&tcy;&iecy;&rcy;&scy;&tcy;&vcy;&ocy;&tcy;&ocy;
&ncy;&acy; &icy;&scy;&tcy;&icy;&ncy;&acy;&tcy;&acy;</name> &mdash;
<name type=org lang=ns-bg>&Mcy;&icy;&ncy;&icy;&pcy;&rcy;&acy;&vcy;</name> 
&pcy;&ocy; &ncy;&ocy;&vcy;&gcy;&ocy;&vcy;&ocy;&rcy;
<ptr id="Obg." target="Obg.1.1.8" n=1>
&mdash; &rcy;&yacy;&zcy;&kcy;&ocy; &scy;&iecy;
&ocy;&tcy;&lcy;&icy;&chcy;&acy;&vcy;&acy;&shcy;&iecy; &ocy;&tcy;
&pcy;&rcy;&iecy;&dcy; &ocy;&chcy;&icy;&tcy;&iecy; &mcy;&ucy;.</s>
<s id="Obg.">&Tcy;&ocy;&vcy;&acy; &bcy;&iecy;
&pcy;&icy;&rcy;&acy;&mcy;&icy;&dcy;&acy; &ocy;&tcy;
&bcy;&lcy;&iecy;&scy;&tcy;&yacy;&shchcy; &bcy;&yacy;&lcy;
&kcy;&ocy;&yacy;&tcy;&ocy; &tcy;&iecy;&rcy;&acy;&scy;&acy;
&scy;&lcy;&iecy;&dcy; &tcy;&iecy;&rcy;&acy;&scy;&acy; &scy;&iecy;
&icy;&zcy;&dcy;&icy;&gcy;&acy;&shcy;&iecy; &ncy;&acy;
&tcy;&rcy;&icy;&scy;&tcy;&acy; &mcy;&iecy;&tcy;&rcy;&acy;
<s id="Obg.">&Ocy;&tcy; &mcy;&yacy;&scy;&tcy;&ocy;&tcy;&ocy; &scy;&icy;
<name type=person>&Ucy;&icy;&ncy;&scy;&tcy;&hardcy;&ncy;</name>
&mcy;&ocy;&zhcy;&iecy;&shcy;&iecy; &dcy;&acy;
&icy;&zcy;&pcy;&icy;&scy;&acy;&ncy;&icy; &scy;
&bcy;&ucy;&kcy;&vcy;&icy; &vcy;&hardcy;&rcy;&khcy;&ucy;
&bcy;&yacy;&lcy;&acy;&tcy;&acy; &fcy;&acy;&scy;&acy;&dcy;&acy;
&tcy;&rcy;&icy;&tcy;&iecy; &lcy;&ocy;&zcy;&ucy;&ncy;&gcy;&acy;
&ncy;&acy; &pcy;&acy;&rcy;&tcy;&icy;&yacy;&tcy;&acy;:
<q id="Obg." type=slogan rend="CE CA">&Vcy;&ocy;&jcy;&ncy;&acy;&tcy;&acy; &iecy; &mcy;&icy;&rcy;</q>
<q id="Obg." type=slogan rend="CE CA">&Scy;&vcy;&ocy;&bcy;&ocy;&dcy;&acy;&tcy;&acy; &iecy; &rcy;&ocy;&bcy;&scy;&tcy;&vcy;&ocy;</q>
<q id="Obg." type=slogan rend="CE CA">&Ncy;&iecy;&vcy;&iecy;&zhcy;&iecy;&scy;&tcy;&vcy;&ocy;&tcy;&ocy; &iecy; &scy;&icy;&lcy;&acy;</q>

Structure of the Original

There was no original electronic version. The book was typed in by Ludmila Dimitrova (IMI), Lydia Sinapova (IIT) and Kiril Simov (CLPP).

During the hand-validation process and comparison with the English version it was found out that some English sentences are not translated, and a paragraph had been moved to the next chapter. The original text has not been corrected.

Markup Process

The text of the Bulgarian ``1984'' was marked up during the typing process accordingly to CES1 conformance. The encountered typographical errors in the printed edition were corrected.

