COP project 106 MULTEXT-East Fiction, Czech
Contributors: Vladimír Petkevic (FFUK) and Vera Schmiedtová (ÚJC AV CCR)
The Czech MULTEXT-East fiction corpus is represented by a book entitled OPERA --- pruvodce operní tvorbou (Opera --- the Guide through the Art of Opera). The book written by Anna Hostomská was first published in 1955 by SNKLU, Prague, its second edition (Praha, Svoboda 1993) was the source of the Czech fiction corpus. The publication contains a lot of descriptions of the content of operas and descriptions of their authors. Thus it contains very rich lexical material. The electronic version was obtained from the publishers in the T602 format (T602 is a Czech text editor). The book can be used solely for scientific and academic purpose.
As computed by the Unix program wc over the whole CES-1 document, the Czech Fiction corpus contains 94973 words in 780081 Bytes.
The corpus consists of 7 <div type=chapter> elements. After 2 introductory chapters further divided into sections, the chapters 3 through 7 contain descriptions of the most famous European opera composers which is structurally expressed as <div type=composer> . Thus, chapter 3 contains 13 composers, chapter 4 3 composers, chapter 5 contains 15 composers, chapter 6 has 22 composers and chapter 7 contains the descriptions of the work of 4 composers. The division to chapters is made according to the nationality principle - the opera composers belonging to the same nation are included in the same chapter. Thus, the chapters are entitled Italská opera, Francouzská opera etc. Each composer unit is further subdivided into the short introduction about the composer in question and then follow the descriptions of his operas in <div type=opera> . Each composer (opera) is introduced by the head attribute which is then followed by the description of the composer (opera). Each opera is further divided into individual acts.
The <div type=chapter> elements have the n
attribute, giving the chapter number, and the id
attribute, whose value has the prefix OPERA followed by a period and the chapter number, e.g. <div type=chapter n=3 id=OPERA.3> Further in-depth structure is also reflected by successive numbers hierarchically following the book identification OPERA.
On the lowest structural level the text is composed of paragraphs containing subparagraph level tags, such as <q> , <name> , <num> , <abbr> , <foreign> etc.
Rendering information is given as the CES conformant two-letter value of the rend attribute. Each dash except within the values of other tags is represented by the SGML entity mdash.
Example from the corpus:
<div type=chapter n=3 id=OPERA> <body lang=cs id=OPERA> <div type=chapter n=1 id=OPERA.1> <head> <hi rend=ca>ÚVOD</hi> </head> <p> Obliba operního útvaru byla u nás odedávna mimořádná, ale v poslední době se naše obecenstvo s operou sžívá stále více. Stala se nejschůdnější cestou k proniknutí do světa hudební krásy. Jsou však úskalí, která začínajícímu zájemci působí nesnáze. Patří k nim složitost operního útvaru, obtížná zejména při sledování opery poprvé viděné či dokonce, z rozhlasu jen slyšené. Odedávna proto vznikaly pomůcky k snadnějšímu pochopení průběhu děje i hudby operního díla. </p> <p> První českou informační knihou byla <q rend="PRE lsquo POST rsquo"> <name>Česká zpěvohra bratří Hornů</name> </q>, vydaná roku <date><num>1903</num></date> u <name type=org>Grossmanna a Svobody</name> v <name type=place>Praze</name>. V abecedním seřazení skladatelů uváděla stati o českých operách od Škroupova <q rend="PRE lsquo POST rsquo"> <name>Dráteníka</name> </q> až po díla provedená do roku <date><num>1902</num></date>. Ovšem jejich knížka brzy nestačila. Roku <date><num>1910</num></date> napsal <name type=person>A. Tvrdek</name> knihu, kterou pod názvem <q rend="PRE lsquo POST rsquo"> <name>Anthologie z oper</name> </q> vydal <name type=person>Emil Šolce</name> v <name type=place>Telči</name>. V ní jsou obsažena už vedle základních českých oper i ta cizí díla, která se tehdy hrála na českých jevištích. Další publikací tohoto typu pak byl <q rend="PRE lsquo POST rsquo"> <name>Svět v opeře</name> </q> ( <date><num>1934</num></date>) sepsaný <name type=person>J. Branbergem</name> a <name type=person>Z. Münzerovou</name>. Podruhé titul vyšel roku <date><num>1939</num></date> a potřetí, rozšířen i o řadu baletů na více než <num>600</num> výkladů dějů, roku <date><num>1947</num></date>. </p>
The original was technically organised according to printing columns (approx. 400 columns). They were omitted in our subsequent markup. Logical structure of the book was meticulously respected. Paragraph boundaries were marked in the original and rendition information in the original was respected in the markup.
The electronic version was at our disposal. It was semiautomatically marked up in a very detailed way (the corpus contains a lot of proper nouns which have all been marked by the <name> tag). Possessive adjectives derived from proper nouns have not been marked by the <name>
attribute. Also some spelling errors were corrected during the markup. Finally, the whole corpus was validated by the nsgmls parser.