next up previous contents
Next: Slovene Up: Multilingual Parallel: Orwell's Previous: Hungarian

Romanian

  COP project 106 MULTEXT-East ``1984'', Romanian

Contributors: Dan Tufis and Stefan Bruda (RACAI), Lidia Diaconu, Calin Diaconu (ICI)

Description of the Corpus

The Romanian version of ``1984'' was typed in after the printed book published by the ``Univers'' Publishing house, with a translation due to Mihnea Gafitta. The copyright problems are still questionable. The ``Univers'' Publishing house had a limited copyright which expired at the beginning of this year. The type-in process introduced a lot of errors. The proofreading, done by a different person than the typist elliminated much of the errors. During the dictionary construction several other errors showed up and were corrected.

As computed by the Unix program wc over the whole CES-1 document the Romanian version of ``1984'' has 104580 words.

Structure of the Corpus

The Romanian ``1984'' corpus body consists of three <div type=part> and of one <div type=appendix> . Each part is further subdivided into a number of <div type=chapter> . In the Romanian version, each <div> is followed by a <head> , giving the part or chapter number.

The <div> elements have the n attribute, giving the successive number of the appropriate level of the <div> , and the id attribute, whose value has the prefix ro1984 and the chapter and section numbers separated by periods, e.g. <div type=part n=1 id=ro1984.1.1> .

The text is segmented into paragraphs, with the <head> , <quote> , <note> , and <poem> elements marked-up at the paragraph level.

Sub-paragraph tagging consists of <hi> , <q> , <foreign> .

Rendering information, given as the CES conformant two-letter value of the rend attribute has been in most cases included with the appropriated tags, except for the default preceding mdash of the <q> tag.

In our markup, the mdash had beeen replaced by simple quotes i.e. rend="PRE lsquo POST rsquo" (preceded by left side quote, followed by right side quote).

The tag usage for the ``1984'' corpus is shown below.

     <tagsdecl>
        <tagusage gi=body occurs=1></tagusage>
        <tagusage gi=div occurs=28></tagusage>
        <tagusage gi=head occurs=28></tagusage>
        <tagusage gi=hi occurs=410></tagusage>
        <tagusage gi=l occurs=30></tagusage>
        <tagusage gi=note occurs=3></tagusage>
        <tagusage gi=p occurs=1278></tagusage>
        <tagusage gi=poem occurs=6></tagusage>
        <tagusage gi=q occurs=996></tagusage>
        <tagusage gi=quote occurs=186></tagusage>
        <tagusage gi=text occurs=1></tagusage>
        <tagusage gi=foreign occurs=421></tagusage>
      </tagsdecl>

The following is an example from the Romanian ``1984'' corpus:

<text>
  <body lang="ro" id="ro1984">
    <div type=part n=1 id="ro1984.1">
      <head>
        <hi rend="CA">PARTEA &Icirc;NT&Acirc;I</hi></head>
    <div type=chapter n=1 id="ro1984.1.1">
      <head> 1 </head>

<p>&Icirc;ntr-o zi senin&abreve; &scedil;i friguroas&abreve; de
aprilie , pe c&acirc;nd ceasurile b&abreve;teau ora treisprezece ,
Winston Smith , cu b&abreve;rbia &icirc;nfundat&abreve; &icirc;n piept
pentru a sc&abreve;pa de v&acirc;ntul care-l lua pe sus , se
strecur&abreve; iute prin u&scedil;ile de sticl&abreve; ale Blocului
Victoria , de&scedil;i nu destul de repede pentru a &icirc;mpiedica un
v&acirc;rtej de praf &scedil;i nisip s&abreve; p&abreve;trund&abreve;
o dat&abreve; cu el. Holul blocului mirosea a varz&abreve;
c&abreve;lit&abreve; &scedil;i a pre&scedil;uri vechi. La unul din
capete se afla un afi&scedil; mult prea mare pentru interior , care
&icirc;nf&abreve;&tcedil;i&scedil;a figura enorm&abreve; , lat&abreve;
de peste un metru , a unui b&abreve;rbat &icirc;n jur de patruzeci
&scedil;i cinci de ani , cu o musta&tcedil;&abreve; neagr&abreve;
&scedil;i stufoas&abreve; , &scedil;i cu tr&abreve;s&abreve;turi
frumoase dar dure. Winston se &icirc;ndrept&abreve; c&abreve;tre
sc&abreve;ri. Nu avea nici un rost s&abreve; &icirc;ncerce la lift.
Chiar &scedil;i &icirc;n vremurile cele mai bune func&tcedil;iona doar
din c&acirc;nd &icirc;n c&acirc;nd , iar &icirc;n prezent curentul
electric era t&abreve;iat &icirc;n timpul zilei , ca parte
integrant&abreve; a campaniei de economisire organizat&abreve;
&icirc;n preg&abreve;tirea S&abreve;pt&abreve;m&acirc;nii Urii.
Apartamentul lui se g&abreve;sea la etajul &scedil;apte , a&scedil;a
&icirc;nc&acirc;t Winston , care avea treizeci &scedil;i nou&abreve;
de ani &scedil;i o ulcera&tcedil;ie varicoas&abreve; deasupra gleznei
drepte , o lu&abreve; pe jos , &icirc;ncet , oprindu-se de mai multe
ori s&abreve; se odihneasc&abreve;. Pe fiecare palier ,
a&scedil;ezat&abreve; fa&tcedil;&abreve; &icirc;n fa&tcedil;&abreve;
cu u&scedil;a liftului , figura cea enorm&abreve; &icirc;l privea fix
de perete. Era una din acele poze &icirc;n a&scedil;a fel realizate ,
&icirc;nc&acirc;t ochii te urm&abreve;resc din orice unghi. Textul de
dedesubt suna: </p> <quote rend="CA" type="slogan">FRATELE CEL MARE
ESTE CU OCHII PE TINE </quote>

.
.
     <\div>
.
.
.
     <\div>
  <\body>
<\text>

Structure of the Original

There was no original electronic version. The book was typed-in. Apparently, either the Romanian Publisher or the translator took some liberty in defining paragraphs and this is why the paragraphs in the Romanian version do not match the paragraphs in the English version (ECI edition). However, this is does not happen many times.

Markup Process

The whole Romanian ``1984'' CES1 corpus was cross-checked with the printed edition, and the printed edition was used to insert additional (e.g. <hi> ) markup. In marking the paragraphs we followed the Romanian published version.



next up previous contents
Next: Slovene Up: Multilingual Parallel: Orwell's Previous: Hungarian



Tomaz Erjavec
Sat May 18 20:25:31 MDT 1996