 COP project 106 MULTEXT-East Deliverable D2.1 F Newspapers, Romanian

Contributors: Dan Tufis and Stefan Bruda (RACAI), Lidia Diaconu, Calin Diaconu (ICI)

Description of the Corpus

The contents of the Romanian MULTEXT-East newspaper corpus are 128 articles from the daily newspaper ``România Libera'', Bucharest, from 12 Apr. 1995.

The digital source used as the basis of encoding was provided, unofficialy by one of the collaborators of ``România Libera'' and consisted of data WORD files made by the newspaper's editor.

The Romanian site didn't obtained a license agreement, but a verbal approval for the use of these articles for the purposes of the MULTEXT-East project.

As computed by the Unix program wc over the whole CES-1 document, the Romanian newspaper corpus has 26448 words.

Structure of the Corpus

The corpus body consists of 128 <div type=article>, each of which contains one file from the original digital data, in most cases constituting one article, and sometimes a series of articles. The articles are grouped by <div type=page>.

Each <div type=article> begins with one or more <head>. Some articles end with <byline>.

The text is segmented into paragraphs, with no other paragraph level tagging, except the tags discussed above and <sp> for interviews. Sub-paragraph tagging consists of <hi> and <q>.

The tag usage for the newspaper corpus is shown below.

        <tagusage gi=body occurs=1></tagusage>
        <tagusage gi=div occurs=138></tagusage>
        <tagusage gi=head occurs=179></tagusage>
        <tagusage gi=hi occurs=681></tagusage>
        <tagusage gi=p occurs=573></tagusage>
        <tagusage gi=q occurs=76></tagusage>
        <tagusage gi=text occurs=1></tagusage>
        <tagusage gi=byline occurs=77></tagusage>
        <tagusage gi=sp occurs=26></tagusage>

Example from the corpus:

<div type=page n=1 id="roRL12Apr.1">
<div type=article>
Raport pentru stomacuri</head>
Proverbele sunt un tezaur de &icirc;n&tcedil;elepciune. Nu toate
&icirc;ns&abreve;. Unele
proverbe &scedil;i-au pierdut &icirc;n&tcedil;elesul, dar continu&abreve;
s&abreve; beneficieze
de prestigiul tradi&tcedil;ional al zicerilor populare. Un proverb foarte
circulat spune c&abreve; <q rend=dblq>Prostul nu doarme de grija
altuia</q>. C&acirc;ndva, &icirc;l denun&tcedil;a pe
b&abreve;g&abreve;re&tcedil;, pe omul care nu-&scedil;i vede de treaba lui.
Cel pu&tcedil;in, a&scedil;a cred, fiindc&abreve; numai un prost poate
s&abreve; cread&abreve; c&abreve; grijile - adic&abreve; temerile, fricile,
nefericirile - altora nu ne privesc. C&abreve; singura solu&tcedil;ie,
c&acirc;nd al&tcedil;ii sunt plini de griji, e s&abreve; dormi bine.
Proverbul acesta func&tcedil;ioneaz&abreve; negativ, ca un tezaur de
suficien&tcedil;&abreve;. Foarte multe proverbe sunt reflexul unor timpuri
dominate de spaim&abreve; &scedil;i resemnare. &Scedil;i,
bine&icirc;n&tcedil;eles, de prostie. De grija altora, cei care nu dorm
sunt &icirc;ntotdeauna de&scedil;tep&tcedil;ii. Pu&tcedil;ini, mul&tcedil;i
c&acirc;&tcedil;i avem. Zicala citat&abreve; con&tcedil;ine punctul de
vedere al omului m&abreve;rginit. E izb&acirc;nda vremelnic&abreve; a <hi
rend=dblq>maselor</hi>. O form&abreve; de cinism mitoc&abreve;nesc devenit
folclor. &Icirc;n tramvaiul 34, un n&abreve;t&abreve;r&abreve;u
f&abreve;r&abreve; griji cugeta la intelectualitate. <q
rend=dblq>A&scedil;a cum m&abreve; vede&tcedil;i</q> - zicea el - <q
rend=dblq>nu m&abreve; dau pe zece profesori</q>. A devenit un obicei ca
dasc&abreve;lul s&abreve; fie unitatea de m&abreve;sur&abreve; a
importan&tcedil;ei altor profesiuni. Suntem, b&abreve;nuiesc, singura
&tcedil;ar&abreve; din lume care scoate anecdote pe seama
&icirc;nv&abreve;&tcedil;&abreve;torilor. Dac&abreve; asta se mai
&icirc;nt&acirc;mpl&abreve; &scedil;i-n alte locuri, situa&tcedil;ia e
grav&abreve;. E ne&icirc;ndoielnic c&abreve; glumele proaste despre
intelighen&tcedil;ie nu le nasc dec&acirc;t cei care, ca &scedil;i
g&acirc;nditorul public din tramvai, se socotesc mai de&scedil;tep&tcedil;i
ca oamenii cu studii. Cum s-a ajuns oare aici? Ceva s-a schimbat &icirc;n
Rom&acirc;nia. Un lucru e ca &icirc;nainte: ordinea social&abreve;.
Intelectualitatea vine tot dup&abreve; clasa muncitoare &scedil;i
&tcedil;&abreve;r&abreve;nimea muncitoare. Nimeni, de la putere, nu a avut
curajul s&abreve;
pun&abreve; treburile la punct. Raporturile puterii cu intelectualitatea
poart&abreve; pecetea
unei ostilit&abreve;&tcedil;i mocnite. Un nenorocit de parlamentar
majoritar spunea, referitor
la exodul creierelor rom&acirc;ne&scedil;ti: <q rend=dblq> Cine vrea
s&abreve; plece, e liber
s&abreve; plece. Nu &tcedil;inem pe nimeni cu sila!</q>. Ca &scedil;i
m&abreve;rginitul din
tramvai, suficientul din parlament se sim&tcedil;ea dezlegat s&abreve; se
exprime a&scedil;a
deoarece, &icirc;n cinci ani, nici pre&scedil;edintele, nici
prim-mini&scedil;trii nu au referit
niciodat&abreve; clar &scedil;i programatic la problemele
O guvernare de stomacuri pentru stomacuri, de stomacuri cu somnul gros
&scedil;i ad&acirc;nc.</p>
Tudor Octavian</byline>

Structure of the Original

The digital source, used as the basis of encoding, consisted of WORD files, in general, one file per article, made by the newspaper's editor.

Markup Process

Due to the lack of the printed versions, no hilighting marking has been provided, except for the rend=dblq marking found in the text. Also, the text contained a number of typo errors, which were also in the printed version. We corrected these errors in the newpaper corpus.

