TEI Header

§file description
§title statement
§title
LeMonde Corpus
§principal researcher
§name Špela Vintar (FF)
§funding body
Slovene ARRS research project J6-2009-0581 "Slovene translation studies - resources and research"
§statement of responsibility
§name Adriana Mezeg
§responsibility
Acquisition of digital source, OCR correction and text alignment.
§statement of responsibility
§name
id = ET
Tomaž Erjavec (IJS)
§responsibility
Linguistic annotation, TEI encoding.
§edition statement
§edition V1.0
§extent 300 bi-texts<term> , 1144 thousand words<term>
§publication statement
§date 2012-05-15
§publication place nl.ijs.si/spook lojze.lugos.si/spook
§availability

The corpus is available via concordancers at nl.ijs.si.

§source description
§bibliographic citation
§title
Le Monde newspaper
§encoding description
§project description

SPOOK project: “Slovene translation studies: resources and research”.

§editorial practice declaration
§normalization

OCR mistakes in the text were manually corrected.

§segmentation

The texts are manually segmented into translation units, encoded as "anonymous blocks", and then automatically into sentences, words, punctuation marks and whitespace.

§interpretation

The text has been automatically tokenised, part-of-speech tagged and lemmatised. For Slovene, the ToTrTaLe tool was used, while English was processed with TreeTagger using the Penn Treebank model. Two tags are given for each word. For Slovene, @ctag gives the reduced SPOOK tag, while @ana gives the complete JOS morphosyntactic tag. For English, @ctag gives the original TreeTagger (Penn) PoS tag, while @ana gives its mapping to its equivalent SPOOK tag.

§text-profile description
§text classification
§keywords
scheme = local
§term
non-fiction
§language usage
§language
ident = sl
§term
Slovene
§language
ident = fr
§term
French
§revision description
§change Tomaž Erjavec<name>: Conversion to TEI P5.
§date 2012-05-15
§change Adriana Mezeg<name>: OCR correction and text alignment.
§date 2009-12-01