TEI Header

§file description
§title statement
§title
TRANS5 English-Slovene parallel corpus
§principal researcher
§name Tomaž Erjavec
§address
Department of Knowledge Technologies
Jožef Stefan Institute
Jamova cesta 39
SI-1000 Ljubljana
Slovenia
§principal researcher
§name Špela Vintar
§address
Dept. of Translation
Faculty of Arts
University of Ljubljana
Aškerčeva 2
SI-1000 Ljubljana
Slovenia
§edition statement
§edition 0.1
§extent 104 bi-texts<term> , 2.75 million words<term>
§publication statement
§distributor
§address
Department of Knowledge Technologies
Jožef Stefan Institute
Jamova cesta 39
SI-1000 Ljubljana
Slovenia
§publication place http://nl.ijs.si/trans/
§availability

This corpus is freely available for access via a concordancer. For downloading the corpus please get in touch with one of the principals and explain the intended use of the corpus.

§date 2012-11-01
§source description
§citation list http://nl.ijs.si/elan/http://nl.ijs.si/~spela/trans-index.htmlhttp://langtech.jrc.ec.europa.eu/ECDC-TM.html
§bibliographic citation
§title
IJS-ELAN Slovene-English Parallel Corpus
§publisher IJS
§date 2003
§bibliographic citation
§title
TRANS Slovene-English Parallel Corpus
§publisher IJS
§date 2005
§bibliographic citation
§title
JRC ECDC Translation Memory: English-Slovene pair
§publisher JRC
§date 2012
§encoding description
§project description

The purpose of the corpus is to ensure a as large as possible manually sentence aligned parallel corpus, which is linguistically annotated. The corpus can serve as the source of word and phrase translations or as a traning and testing set for the development of multilingual language technolgies.

§editorial practice declaration
§segmentation

The texts are manually segmented into translation units, encoded as "anonymous blocks", and then automatically into sentences, words, punctuation marks and whitespace.

§interpretation

The text has been automatically tokenised, part-of-speech tagged and lemmatised. For Slovene, the ToTrTaLe tool was used, while English was processed with TreeTagger using the Penn Treebank model. Two tags are given for each word. For Slovene, @ctag gives the reduced SPOOK tag, while @ana gives the complete JOS morphosyntactic tag. For English, @ctag gives the original TreeTagger (Penn) PoS tag, while @ana gives its mapping to its equivalent SPOOK tag.

§text-profile description
§language usage
§language
ident = sl
§term
Slovene
§language
ident = en
§term
English
§revision description
§change Tomaž Erjavec<name>: First version of corpus, corpus header.
§date 2012-11-01