COP project 106 MULTEXT-East Newspapers, Slovene
Contributors: Tomaz Erjavec (IJS) and Miro Romih (Amebis d.o.o.)
The contents of the Slovene MULTEXT-East newspaper corpus are 45 articles from the daily newspaper ``Dnevnik'', Ljubljana, in the period August to October 1995. The digital source used as the basis of encoding was provided, via Amebis d.o.o., by Dnevnik, and consisted of data files made by the newspaper's editor, with idiosyncratic markup and embedded comments for the typesetters. From the large selection of articles originally obtained from Dnevnik, the amount needed by the MULTEXT-East project was selected, choosing longer articles, with an eye towards their being ``journalistic'' (e.g. discarding serial novels), with a well defined structure.
The Slovene site obtained a license agreement allowing the use of these articles for the purposes of the MULTEXT-East project, signed by the responsible person at Dnevnik.
As computed by the Unix program wc over the whole CES-1 document, the Slovene Fiction corpus has 107104 words.
The corpus body consists of 45 <div type=article> , each of which contains one file from the original digital data, in most cases constituting one article, and sometimes a series of articles. In the current version, the article first includes a SGML comment giving the name of the original file. This is followed by a <opener> , which contains the <date> , giving the date of processing of the file. This date is therefore equal or smaller to the date on which the publication actually appeared.
Each <div type=article> contains one or more <div type=articletext> containing the actual text(s) of the article; these are optionally followed by one or more <div type=frame> or <div type=figure> , to be explained below.
The <div type=articletext> usually starts off with one or more <head> tags, giving the headline(s), and one or more <div type=articlepart> giving the sections of the article, each of which can have its own <head> .
Articles can also have so called ``boxes'' or ``frames'' , i.e. texts given in a frame in the middle of the article, usually passages a few paragraphs long giving background material to the article. These are collected at the end of articles in their own <div type=frame> .
Captions to the pictures accompanying the articles were also included in the corpus. Same as frames, they are given in a series at the end of <div type=article> and have the following structure:
<div type=figure> <figure> <head>The text of the figure's caption</head> <figdesc>Slika izpuscena (i.e. ``picture omitted'')</figdesc> </figure> </div>
The <div> elements have the n attribute, giving the successive number of the appropriate level of the <div> , and the id attribute, the value of which has the prefix dnv
with the laddered section numbers following, separated by periods, e.g. <div type=articlepart n=6 id=dnv.33.1.6> .
The frames and figures are problematic for this numbering and id scheme, as they appear at the same level as the article texts --- they have been given their own identifiers, fr and fg respectively. So, for example the first frame of article 8 has the open tag <div type=frame n=1 id=dnv.fr.8.1> .
Names of document authors are included where they appeared in the original, usually at the end of the articles; they are marked-up as <byline> <docauthor> Author </docauthor> </byline> .
The text is segmented into paragraphs, with no other paragraph level tagging, except the tags discussed above.
Sub-paragraph tagging consists of <name> , and <q> . These categories were tagged where they could be automatically inferred from control codes in the original.
No rend attributes are included. The quote marks are retained in the <q> data, with a double quote being consistently used as the 'top-level' quote, and apostrophe in possible embedded quotes, usually denoting a ``so-called'', which is very popular in the Slovene press.
Example from the corpus:
<div type=article n=33 id=dnv.33> <!-- Original file: 01klolet.t --> <opener> <date>10/8/1995</date> </opener> <div type=articletext n=1 id=dnv.33.1> <head>Ali je kaj narobe z varnostjo na"sega zra"cnega prometa</head> <head>Letala nam padajo z neba</head> <div type=articlepart n=1 id=dnv.33.1.1> <head>Letos doslej kar dvanajst nesre"c in vsaj pet incidentov, "zivljenje so izgubili trije ljudje - Vzrokov je ve"c, v"casih tudi lahkomiselnost</head> <p> Pilot in kopilot zgorela, Zasilni pristanek se ni posre"cil, "crna nedelja v zraku in na tleh, MAG na hrbtu, Pilot in dekle "cude"zno
The digital source used as the basis of encoding was provided, via Amebis d.o.o., by Dnevnik. It consisted of data files, one file per article, made by the newspaper's editor. Each file starts off by header information, followed by the text. The text itself contains idiosyncratic markup and embedded comments for the typesetters.
>beginh< user=Tone status=normal format=dnevnik format=nd creator=Tone,4/8/1995,10:26:6 editor=Tone,4/8/1995,10:26:6 editor=Sonja,4/8/1995,12:24:56 editor=Tone,10/8/1995,15:49:4 abstract=0 -so Ali je kaj narobe z varnostjo našega zra\248nega prometa Letala nam padajo z neba Letos h=904.96dd version=16 >endh< -so >12n<Ali je kaj narobe z varnostjo našega zra\248nega prometa >54a<Letala nam padajo z $ neba >14pn<Letos doslej kar dvanajst nesre\248 in vsaj pet incidentov, \236ivljenje so izgubili trije $ ljudje - Vzrokov je ve\248, v\248asih tudi lahkomiselnost Pilot in kopilot zgorela, Zasilni pristanek se ni posre\248il, \252rna nedelja v zraku in na tleh, MAG na hrbtu, Pilot in dekle \248ude\237no $ pre\236ivela, Trd pristanek lahkega letala v koruzi, Helikopter po nesre\248i pobegnil>tp< To je le nekaj \248asopisnih naslovov zadnjih dveh $
The original digital (and only) source were files made by the newspaper's editor, with idiosyncratic markup and embedded comments for the typesetters. From the large selection of articles originally obtained from Dnevnik, the amount needed by the MULTEXT-East project was selected. After discussions on the desired CES1 structure, files were automatically converted, by Amebis d.o.o., into CES1 marked-up files, and at the same time corrected for typos with a spelling-checker. This version has been then additionally hand-corrected and the header made to produce a standalone CES1 SGML document.