jos100k corpus V2.0
The jos100k corpus contains 100,000 words of sampled paragraphs from the FidaPLUS corpus. It is meant to serve as a reference annotated corpus of Slovene: its manually-validated annotations cover three level of linguistic description. The corpus has the following annotations:
- the texts in the corpus are annotated with their bibliographic data and given text-types from the FIDA taxonomy
- the texts contain complete sampled paragraphs, these are then comoposed of sentences, and these of words, punctuation and spaces
- words have assigned their morphosyntactic description and lemma
- sentences have assigned syntactic dependency relations
- all the occurences of 100 most frequent nouns are annotated with their concept (synset id) from the Slovene WordNet sloWNet.
Documentation of the linguistic annotation (c.f. also the Bibliography):
- Morphosyntax: Specifications and Annotators' Guidelines (in Slovene)
- Syntax: Annotators' Guidelines (in Slovene)
- Semantics: Annotators' Guidelines and Appendix with examples (in Slovene)
Download jos100k V2.0:
- TEI corpus header in HTML:
- The corpus is available for download from http://nl.ijs.si/jos/download/