Natural language server
Department of Knowledge Technologies
Jožef Stefan Institute


Slovene	English

jos100k corpus V2.0

The jos100k corpus contains 100,000 words of sampled paragraphs from the FidaPLUS corpus. It is meant to serve as a reference annotated corpus of Slovene: its manually-validated annotations cover three level of linguistic description. The corpus has the following annotations:

the texts in the corpus are annotated with their bibliographic data and given text-types from the FIDA taxonomy
the texts contain complete sampled paragraphs, these are then comoposed of sentences, and these of words, punctuation and spaces
words have assigned their morphosyntactic description and lemma
sentences have assigned syntactic dependency relations
all the occurences of 100 most frequent nouns are annotated with their concept (synset id) from the Slovene WordNet sloWNet.

Documentation of the linguistic annotation (c.f. also the Bibliography):

Morphosyntax: Specifications and Annotators' Guidelines (in Slovene)
Syntax: Annotators' Guidelines (in Slovene)
Semantics: Annotators' Guidelines and Appendix with examples (in Slovene)

Download jos100k V2.0:

TEI corpus header in HTML:
- English
- Slovene
The corpus is available for download from http://nl.ijs.si/jos/download/