Natural language server
Department of Knowledge Technologies
Jožef Stefan Institute


Slovene	English

jos1M corpus

The jos1M corpus contains 1,000,000 words of sampled paragraphs from the FidaPLUS corpus. It is meant to serve as a training corpus for word-level tagging of Slovene, as it contains partially hand-validated morphosyntactic descriptions and lemmas. The corpus has the following annotations:

the texts in the corpus are annotated with their bibliographic data and given text-types from the FIDA taxonomy
the texts contain complete sampled paragraphs, these sentences, and these words, punctuation and spaces
words have assigned their morphosyntactic description and lemma

Documentation of the linguistic annotation (c.f. also the Bibliography):

Morphosyntax: Specifications and Annotators' Guidelines (in Slovene)

Download jos1M V1.1:

TEI corpus header in HTML:
- English
- Slovene
The corpus is available for download form the CLARIN.SI repository under the permanent URL:
http://hdl.handle.net/11356/1037

The manual annotation of the jos1M corpus was supported by the project BMT.