jos1M corpus
The jos1M corpus contains 1,000,000 words of sampled paragraphs from the FidaPLUS corpus. It is meant to serve as a training corpus for word-level tagging of Slovene, as it contains partially hand-validated morphosyntactic descriptions and lemmas. The corpus has the following annotations:
- the texts in the corpus are annotated with their bibliographic data and given text-types from the FIDA taxonomy
- the texts contain complete sampled paragraphs, these sentences, and these words, punctuation and spaces
- words have assigned their morphosyntactic description and lemma
Documentation of the linguistic annotation (c.f. also the Bibliography):
- Morphosyntax: Specifications and Annotators' Guidelines (in Slovene)
Download jos1M V1.1:
- TEI corpus header in HTML:
- The corpus is available for download form the CLARIN.SI repository under the permanent URL:
http://hdl.handle.net/11356/1037
The manual annotation of the jos1M corpus was supported by the
project
BMT.