TEI Header

^§file description

^§title statement

^§title

LeMonde Corpus

^§principal researcher

^§name

Špela Vintar (FF)

^§funding body

Slovene ARRS research project J6-2009-0581 "Slovene translation studies - resources and research"

^§statement of responsibility

^§name	Adriana Mezeg
^§responsibility	Acquisition of digital source, OCR correction and text alignment.

^§statement of responsibility

^§name id = ET	Tomaž Erjavec (IJS)
^§responsibility	Linguistic annotation, TEI encoding.

^§edition statement

^§edition

V1.0

^§extent

300 bi-texts_<term> , 1144 thousand words_<term>

^§publication statement

^§date	2012-05-15
^§publication place	nl.ijs.si/spook lojze.lugos.si/spook
^§availability	The corpus is available via concordancers at nl.ijs.si.

^§source description

^§bibliographic citation

^§title

Le Monde newspaper

^§encoding description

^§project description

SPOOK project: “Slovene translation studies: resources and research”.

^§editorial practice declaration

^§normalization	OCR mistakes in the text were manually corrected.
^§segmentation	The texts are manually segmented into translation units, encoded as "anonymous blocks", and then automatically into sentences, words, punctuation marks and whitespace.
^§interpretation	The text has been automatically tokenised, part-of-speech tagged and lemmatised. For Slovene, the ToTrTaLe tool was used, while English was processed with TreeTagger using the Penn Treebank model. Two tags are given for each word. For Slovene, @ctag gives the reduced SPOOK tag, while @ana gives the complete JOS morphosyntactic tag. For English, @ctag gives the original TreeTagger (Penn) PoS tag, while @ana gives its mapping to its equivalent SPOOK tag.

^§text-profile description

^§text classification

^§keywords
scheme = local

^§term

non-fiction

^§language usage

^§language
ident = sl

^§term

Slovene

^§language
ident = fr

^§term

French

^§revision description

^§change

Tomaž Erjavec_<name>: Conversion to TEI P5.

^§date

2012-05-15

^§change

Adriana Mezeg_<name>: OCR correction and text alignment.

^§date

2009-12-01