Next: Czech Up: Multilingual Comparable 1: Previous: Multilingual Comparable 1:

Bulgarian

COP project 106 MULTEXT-East Fiction, Bulgarian

Contributors: Institute of Mathematics, Bulgarian Academy of Sciences

Description of the Corpus

The Bulgarian MULTEXT-East fiction corpus contains the novel ``PASSION or the death of Alice'' by Emilia Dvorianova and and the first four chapters of the novel ``I want, I believe, I can'' by Julia Berberyan.

The first novel was first published by ``OBSIDIAN'', Sofia in 1995, the second - by ``Abagar Holding'', Sofia in 1995. The Bulgarian site obtained the author's / publisher's agreement allowing the use of novels for the purposes of the MULTEXT-East project.

``PASSION or the death of Alice'' was provided in WINWORD format, while ``I want, I believe, I can'' in 'Page maker' format. The texts were converted to ASCII files.

The Bulgarian Fiction corpus contains 98371 words. The first part (``PASSION or the death of Alice'') contains 54100 words approx., the second part - 44200 words approx.

Structure of the Corpus

The corpus body consists of 15 <div type=chapter> , the number of each chapter is given in the attribute n of the tag. The chapters from ``I want, I believe, I can''start with a <head> giving the title of the chapter.

The <div type=chapter> elements have the n attribute, giving the chapter number, and the id attribute, whose value has the prefix Fictbg followed by a period, a string, identifying the book, and the chapter number, for example <div type=chapter n=1 id=Fictbg.alice.1.1> .

The text is segmented into paragraphs, with the <head> , <quote> , <opener> , <poem> elements marked-up at the paragraph level.

Sub-paragraph tagging consists of <abbr> , <foreign> , <hi>

<date> , <name> , <l> , <num> and <q> . Direct speech has been marked-up by <q> . ``PASSION or the death of Alice'' is tagged in full at the sub-paragraph level. ``I want, I believe, I can'' is tagged partially.

Rendering information, given as the CES conformant two-letter value of the rend attribute has been included with the appropriated tags and, for mdash and capitalisation, retained in the tag content.

The following is an example from the Bulgarian Fiction corpus; the characters are, in the original, encoded as SGML ISO Cyrillic entities (e.g. &ocy;); to make the following sample more readable, they were automatically transliterated into Latin (e.g. &ocy; o).

<p>
Poslie vlyazokh i mie porhardsi dhardzhd ot shubraka, i pak sie
usietikh razshardrdiena, dazhie promhardrmorikh nieshcho nie
shardvsiem za kazvanie, shchoto i za tozi shubrak smie sie
razpravyali, no za niego gospozhitsata bieshie nieprieklonna i nie sie
usmikhvashie, a ustnichkitie ig sie svivakha na dhardgichka:
</p>
<p>
<q rend="PRE mdash">
Da nie si go 
pipnala, lielsofto <name 
type=person>Jo</name>!</q>
</p>

Structure of the Original

The versions which were the basis for the encoding, were ASCII files. The text contained a number of misrendered punctuation and spelling mistakes. A spell-checker was used to correct the spelling mistakes. It should be noted that the original electronic texts contained Latin letters used as cyrillic (e.g. latin ``e'' instead of cyrillic ``e'') Their occurrences were discovered by the spell-checker, but we had to substitute latin with cyrillic letters manually. In the following example, the Cyrillic letters have been transliterated back into Latin:

Poslie vlyazokh i mie porhardsi dhardzhd ot shubraka, i pak sie
usietikh razshardrdiena, dazhie promhardrmorikh nieshcho nie
shardvsiem za kazvanie, shchoto i za tozi shubrak smie sie
razpravyali, no za niego gospozhitsata bieshie nieprieklonna i nie sie
usmikhvashie, a ustnichkitie ig sie svivakha na dhardgichka: - Da nie
si go pipnala, lielsofto Jo!

Markup Process

The version has been marked up to CES1 conformance manually - all tags were inserted manually.

Next: Czech Up: Multilingual Comparable 1: Previous: Multilingual Comparable 1:

Tomaz Erjavec
Sat May 18 20:25:31 MDT 1996