Slovene-English Parallel Corpus

The IJS-ELAN corpus contains 1 million words from 15 parallel Slovene-English / English-Slovene texts. The corpus is meant to facilitate developments of language technology and studies in bilingual terminology extraction, primarily for the Slovene language. The corpus is sentence aligned, tokenised, lemmatised and PoS tagged. The corpus is currently at Version 3, produced in 2012. Its tagging and tagset is described in the SPOOK Morphosyntactic Specifications. It is encoded in XML according to the TEI P5 Guidelines.

You can use the on-line concordancer to search the IJS-ELAN corpus - it is part of the TRANS5 parallel corpus.

The corpus is also freely available for downloading but please ackowledge its use in publications by citing the paper below:

The IJS-ELAN corpus was produced at the Dept. of Knowledge Technologies, Jožef Stefan Institute. Thanks go to Špela Vintar, Roman Maurer and Andrej Skubic for acquiring and alinging portions of the corpus and to the company Amebis, d.o.o. for lexically annotating the first version the corpus. The compilation of the corpus was partially financed by subcontract to the EU MLIS 121 project ELAN, by subcontract to the Copernicus Joint Project CONCEDE, and by the grant MZT L2-0461-0106 from the Ministry of Science and Technology of Slovenia.

Page, last updated 2013-01-04, et