 COP project 106 MULTEXT-East Deliverable D2.1 F Fiction, Czech

Contributors: Vladimír Petkevic (FFUK) and Vera Schmiedtová (ÚJC AV CCR)

Description of the Corpus

The Czech MULTEXT-East fiction corpus is represented by a book entitled OPERA -- pruvodce operní tvorbou (Opera -- the Guide through the Art of Opera). The book written by Anna Hostomská was first published in 1955 by SNKLU, Prague, its second edition (Praha, Svoboda 1993) was the source of the Czech fiction corpus. The publication contains a lot of descriptions of the content of operas and descriptions of their authors. Thus it contains very rich lexical material. The electronic version was obtained from the publishers in the T602 format (T602 is a Czech text editor). The book can be used solely for scientific and academic purpose.

As computed by the Unix program wc over the whole CES-1 document, the Czech Fiction corpus contains 94965 words in 1146353 Bytes.

Structure of the Corpus

The corpus consists of 7 <div type=chapter> elements. After 2 introductory chapters further divided into sections, the chapters 3 through 7 contain descriptions of the most famous European opera composers which is structurally expressed as <div type=composer>. Thus, chapter 3 contains 13 composers, chapter 4 3 composers, chapter 5 contains 15 composers, chapter 6 has 22 composers and chapter 7 contains the descriptions of the work of 4 composers. The division to chapters is made according to the nationality principle - the opera composers belonging to the same nation are included in the same chapter. Thus, the chapters are entitled Italská opera, Francouzská opera etc. Each composer unit is further subdivided into the short introduction about the composer in question and then follow the descriptions of his operas in <div type=opera>. Each composer (opera) is introduced by the head attribute which is then followed by the description of the composer (opera). Each opera is further divided into individual acts.

The <div type=chapter> elements have the n attribute, giving the chapter number, and the id attribute, whose value has the prefix OPERA followed by a period and the chapter number, e.g. <div type=chapter n=3 id=OPERA.3> Further in-depth structure is also reflected by successive numbers hierarchically following the book identification OPERA.

On the lowest structural level the text is composed of paragraphs containing subparagraph level tags, such as <q>, <name>, <num>, <abbr>, <foreign> etc.

Rendering information is given as the CES conformant two-letter value of the rend attribute. Each dash except within the values of other tags is represented by the SGML entity mdash.

Example from the corpus:

<div type=chapter n=3 id=OPERA>
<body lang=cs id=OPERA>
<div type=chapter n=1 id=OPERA.1>

<hi rend=ca>&Uacute;VOD</hi>

Obliba opern&iacute;ho &uacute;tvaru byla u n&aacute;s oded&aacute;vna
mimo&rcaron;&aacute;dn&aacute;, ale v
posledn&iacute; dob&ecaron; se na&scaron;e obecenstvo s operou
s&zcaron;&iacute;v&aacute; st&aacute;le v&iacute;ce. Stala
se nejsch&uring;dn&ecaron;j&scaron;&iacute; cestou k proniknut&iacute; do
sv&ecaron;ta hudebn&iacute; kr&aacute;sy. Jsou
v&scaron;ak &uacute;skal&iacute;, kter&aacute;
za&ccaron;&iacute;naj&iacute;c&iacute;mu z&aacute;jemci
p&uring;sob&iacute; nesn&aacute;ze. Pat&rcaron;&iacute; k
nim slo&zcaron;itost opern&iacute;ho &uacute;tvaru,
obt&iacute;&zcaron;n&aacute; zejm&eacute;na p&rcaron;i
opery poprv&eacute; vid&ecaron;n&eacute; &ccaron;i dokonce, z rozhlasu jen
sly&scaron;en&eacute;. Oded&aacute;vna
proto vznikaly pom&uring;cky k snadn&ecaron;j&scaron;&iacute;mu
pochopen&iacute; pr&uring;b&ecaron;hu d&ecaron;je i
hudby opern&iacute;ho d&iacute;la.

Prvn&iacute; &ccaron;eskou informa&ccaron;n&iacute; knihou byla
<q rend="PRE lsquo POST rsquo">
<name>&Ccaron;esk&aacute; zp&ecaron;vohra brat&rcaron;&iacute;
vydan&aacute; roku
<name type=org>Grossmanna a Svobody</name>
<name type=place>Praze</name>. V abecedn&iacute;m se&rcaron;azen&iacute;
uv&aacute;d&ecaron;la stati o &ccaron;esk&yacute;ch oper&aacute;ch od
<q rend="PRE lsquo POST rsquo">
a&zcaron; po d&iacute;la proveden&aacute; do roku
Ov&scaron;em jejich kn&iacute;&zcaron;ka brzy nesta&ccaron;ila. Roku
<name type=person>A. Tvrdek</name> knihu, kterou pod n&aacute;zvem
<q rend="PRE lsquo POST rsquo">
<name>Anthologie z oper</name>
</q> vydal
<name type=person>Emil &Scaron;olce</name> v
<name type=place>Tel&ccaron;i</name>.
V n&iacute; jsou obsa&zcaron;ena u&zcaron; vedle z&aacute;kladn&iacute;ch
&ccaron;esk&yacute;ch oper i ta ciz&iacute;
d&iacute;la, kter&aacute; se tehdy hr&aacute;la na &ccaron;esk&yacute;ch
jevi&scaron;t&iacute;ch. Dal&scaron;&iacute; publikac&iacute;
tohoto typu pak byl
<q rend="PRE lsquo POST rsquo">
<name>Sv&ecaron;t v ope&rcaron;e</name>
<name type=person>J. Branbergem</name> a
<name type=person>Z. M&uuml;nzerovou</name>.
Podruh&eacute; titul vy&scaron;el roku
a pot&rcaron;et&iacute;, roz&scaron;&iacute;&rcaron;en i o &rcaron;adu
balet&uring; na v&iacute;ce ne&zcaron;
<num>600</num> v&yacute;klad&uring; d&ecaron;j&uring;, roku

Structure of the Original

The original was technically organised according to printing columns (approx. 400 columns). They were omitted in our subsequent markup. Logical structure of the book was meticulously respected. Paragraph boundaries were marked in the original and rendition information in the original was respected in the markup.

Markup Process

The electronic version was at our disposal. It was semiautomatically marked up in a very detailed way (the corpus contains a lot of proper nouns which have all been marked by the <name> tag). Possessive adjectives derived from proper nouns have not been marked by the <name> attribute. Also some spelling errors were corrected during the markup. Finally, the whole corpus was validated by the nsgmls parser.

