Next: Slovene Up: Multilingual Comparable 1: Fiction Previous: Hungarian

Subsections

Romanian

COP project 106 MULTEXT-East Deliverable D2.1 F Fiction, Romanian

Contributors: Dan Tufis and Stefan Bruda (RACAI), Lidia Diaconu, Calin Diaconu (ICI)

Description of the Corpus

The contents of the Romanian MULTEXT-East fiction corpus is made of three novels ``Flacari sub cruce'', ``Obreja'' and ``Testament între înger si diavol'' by Mihai Radulescu. The novels were published by ''Ramida'' Publishing House (the first two novels) and ''PIKA'' Publishing House (the last one). The sources also included introductions by author or by H.H. Teodosie Snagoveanu.

The digital source used as the basis of encoding was provided by the ``Ramida'' Publishing House and ``PIKA'' Publishing House based on the author's written permission.

The Romanian site obtained a license agreement allowing the use of the three novels for the purposes of the MULTEXT-East project, signed by the author, Mihai Radulescu.

As computed by the Unix program wc over the Romanian Fiction corpus has 160405 words (39389, 39521 and 81495 for each novel).

Structure of the Corpus

The corpus body consists of some <div type=part>, each of which starts off with a <head> giving the chapter title. Each part may be divided by some <div type=chapter>, which also begin with a <head>.

The <div type=chapter> and <div type=part> elements have the n attribute, giving the chapter or part number, and the id attribute, whose value has the prefixes conformant with the novel, e.g. <div id="obreja.2.1" type="chapter">.

The text is segmented into paragraphs, with the <head>, <quote> and <poem> elements marked-up at the paragraph level.

Sub-paragraph tagging consists of <hi> and <q>. Direct speech has been marked-up by <q> even where there is no typographical marking to that effect in the printed text.

Rendering information, given as the CES conformant two-letter value of the rend attribute has been included with the appropriated tags and, for mdash and capitalisation, retained in the tag content.

The tag usage for the three novels is the following:

``Flacari sub cruce''

      <tagsdecl>
        <tagusage gi=body occurs=1></tagusage>
        <tagusage gi=div occurs=18></tagusage>
        <tagusage gi=head occurs=20></tagusage>
        <tagusage gi=hi occurs=110></tagusage>
        <tagusage gi=l occurs=47></tagusage>
        <tagusage gi=p occurs=817></tagusage>
        <tagusage gi=poem occurs=5></tagusage>
        <tagusage gi=q occurs=564></tagusage>
        <tagusage gi=text occurs=1></tagusage>
      </tagsdecl>

``Obreja''

      <tagsdecl>
        <tagusage gi=body occurs=1></tagusage>
        <tagusage gi=div occurs=17></tagusage>
        <tagusage gi=head occurs=32></tagusage>
        <tagusage gi=hi occurs=368></tagusage>
        <tagusage gi=l occurs=4></tagusage>
        <tagusage gi=p occurs=563></tagusage>
        <tagusage gi=poem occurs=1></tagusage>
        <tagusage gi=q occurs=821></tagusage>
        <tagusage gi=text occurs=1></tagusage>
        <tagusage gi=foreign occurs=1></tagusage>
      </tagsdecl>

``Testament între înger si diavol''

      <tagsdecl>
        <tagusage gi=body occurs=1></tagusage>
        <tagusage gi=div occurs=45></tagusage>
        <tagusage gi=head occurs=84></tagusage>
        <tagusage gi=hi occurs=338></tagusage>
        <tagusage gi=l occurs=171></tagusage>
        <tagusage gi=p occurs=1464></tagusage>
        <tagusage gi=poem occurs=12></tagusage>
        <tagusage gi=q occurs=808></tagusage>
        <tagusage gi=quote occurs=341></tagusage>
        <tagusage gi=text occurs=1></tagusage>
      </tagsdecl>

Example from the corpus:

<div id="obreja.1" type="part">
 <head> AUZI-M&Abreve;, DOAMNE! </head>
 <head> -Cuv&acirc;nt c&abreve;tre cititor- </head>

<P>
Una dintre &icirc;ntreb&abreve;rile sf&acirc;&scedil;ietoare ce apar adesea
pe buzele
neferici&tcedil;ilor care trec prin &icirc;ncerc&abreve;ri mult prea grele
pentru puterile
lor este: <q rend=dblq> De ce m-a p&abreve;r&abreve;sit Dumnezeu? </q>
&Icirc;nso&tcedil;it&abreve; de strig&abreve;te nedumerite &scedil;i disperate,
speran&tcedil;a &icirc;n interven&tcedil;ia divin&abreve;
r&abreve;m&acirc;ne cu
at&acirc;t mai mare: <q rend=dblq> Unde e&scedil;ti, Doamne?! </q> Sau,
relu&acirc;nd
cuvintele psalmistului: <q rend=dblq> Auzi-m&abreve;, Doamne! </q> </P>
<P>
Omul are nevoie permanent&abreve; de P&abreve;rintele s&abreve;u. &Icirc;i
este greu
s&abreve; &icirc;n&tcedil;eleag&abreve; &icirc;nt&acirc;rzierea ajutorului
dumnezeiesc
l&abreve;murit, ori c&abreve; el se manifest&abreve; nev&abreve;zut &scedil;i
nerecunoscut &icirc;ntru &icirc;nt&abreve;rirea puterilor d&abreve;ruite celui
&icirc;ndurerat , pentru ca acesta s&abreve;-&scedil;i poat&abreve;
r&abreve;bda chinul. </P>
<P>
Una dintre misiunile Teologiei este s&abreve; fac&abreve; de
&icirc;n&tcedil;eles
Divinitatea. Alta - mult mai apropiat&abreve; de orizonturile &icirc;nguste ale
f&abreve;pturilor umane ce suntem &scedil;i de aceea mai important&abreve;
pentru
biata noastr&abreve; neputin&tcedil;&abreve; -, alt&abreve; misiune a
Teologiei este
s&abreve;-i explice suferitorului de ce este l&abreve;sat de Dumnezeu s&abreve;
se chinuiasc&abreve;, s&abreve;-i explice lui Iov de ce i s-a
&icirc;ng&abreve;duit
satanei s&abreve;-l pun&abreve; la &icirc;ncercare. </P>
<P>
.
.
.

Structure of the Original

The original versions of the three novels, which were the basis for the encoding, were ASCII exported files from an uncommon text-editor (chiwriter) with a relatively transparent encoding: paragraph boundaries were marked, and a few useful formatting codes were included. Due to the lack of printed versions, no hilighting marking has been provided, except for the "rend=dblq" marking found in the text. Also, the text contained a number of typo errors, which were also in the printed version. We corrected these errors in the fiction corpus.

Markup Process

Due to the lack of printed versions, no hilighting marking has been provided, except for the "rend=dblq" marking found in the text. Also, the text contained a number of typo errors, which were also in the printed version. We corrected these errors in the fiction corpus.

Next: Slovene Up: Multilingual Comparable 1: Fiction Previous: Hungarian

Multext-East