 COP project 106 MULTEXT-East Deliverable D2.1 F ``1984'', Romanian

Contributors: Dan Tufis and Stefan Bruda (RACAI), Lidia Diaconu, Calin Diaconu (ICI)

Description of the Corpus

The Romanian version of ``1984'' was typed in after the printed book published by the ``Univers'' Publishing house, with a translation due to Mihnea Gafitta. The copyright problems are still questionable. The ``Univers'' Publishing house had a limited copyright which expired at the beginning of this year. The type-in process introduced a lot of errors. The proofreading, done by a different person than the typist elliminated much of the errors. During the dictionary construction several other errors showed up and were corrected.

As computed by the Unix program wc over the whole CES-1 document the Romanian version of ``1984'' has 104580 words.

Structure of the Corpus

The Romanian ``1984'' corpus body consists of three <div type=part> and of one <div type=appendix>. Each part is further subdivided into a number of <div type=chapter>. In the Romanian version, each <div> is followed by a <head>, giving the part or chapter number.

The <div> elements have the n attribute, giving the successive number of the appropriate level of the <div>, and the id attribute, whose value has the prefix ro1984 and the chapter and section numbers separated by periods, e.g. <div type=part n=1 id=ro1984.1.1>.

The text is segmented into paragraphs, with the <head>, <quote>, <note>, and <poem> elements marked-up at the paragraph level.

Sub-paragraph tagging consists of <hi>, <q>, <foreign>.

Rendering information, given as the CES conformant two-letter value of the rend attribute has been in most cases included with the appropriated tags, except for the default preceding mdash of the <q> tag.

In our markup, the mdash had beeen replaced by simple quotes i.e. rend="PRE lsquo POST rsquo" (preceded by left side quote, followed by right side quote).

The tag usage for the ``1984'' corpus is shown below.

        <tagusage gi=body occurs=1></tagusage>
        <tagusage gi=div occurs=28></tagusage>
        <tagusage gi=head occurs=28></tagusage>
        <tagusage gi=hi occurs=410></tagusage>
        <tagusage gi=l occurs=30></tagusage>
        <tagusage gi=note occurs=3></tagusage>
        <tagusage gi=p occurs=1278></tagusage>
        <tagusage gi=poem occurs=6></tagusage>
        <tagusage gi=q occurs=996></tagusage>
        <tagusage gi=quote occurs=186></tagusage>
        <tagusage gi=text occurs=1></tagusage>
        <tagusage gi=foreign occurs=421></tagusage>

The following is an example from the Romanian ``1984'' corpus:

<p id="Oro.1.2.8"> 
<s id="Oro."><name type="org">Ministerul Adev&abreve;rului</name> -
&icirc;n <foreign lang="ns-ro">Nouvorb&abreve;</foreign>,
<foreign lang="ns-ro">Minadev</foreign> - te izbea fiindc&abreve; era
cu totul diferit de oricare alt&abreve; cl&abreve;dire care se
vedea.</s> <s id="Oro.">Era o structur&abreve;
&icirc;nalt&abreve;, imens&abreve;, de beton armat alb,
str&abreve;lucitor, &icirc;n form&abreve; de piramid&abreve;, care se
ridica, teras&abreve; dup&abreve; teras&abreve;, la trei sute de metri
de la p&abreve;m&acirc;nt.</s>
<s id="Oro.">De unde st&abreve;tea <name
type="person">Winston</name> se puteau citi foarte bine cele trei
lozinci ale <name type="org">Partidului</name>, scrise cu litere
elegante pe fa&tcedil;ada cea alb&abreve;:
<q id="Oro." rend="CA CN" type="slogan">R&Abreve;ZBOIUL ESTE PACE</q> 
<q id="Oro." rend="CA CN" type="slogan">LIBERTATEA ESTE SCLAVIE</q> 
<q id="Oro." rend="CA CN" type="slogan">IGNORAN&Tcedil;A ESTE

Structure of the Original

There was no original electronic version. The book was typed-in. Apparently, either the Romanian Publisher or the translator took some liberty in defining paragraphs and this is why the paragraphs in the Romanian version do not match the paragraphs in the English version (ECI edition). However, this is does not happen many times.

Markup Process

The whole Romanian ``1984'' CES1 corpus was cross-checked with the printed edition, and the printed edition was used to insert additional (e.g. <hi>) markup. In marking the paragraphs we followed the Romanian published version.

