Project JOS:
Linguistic Annotation of Slovene
The JOS project developed Slovene annotated corpora and associated resources meant to facilitate development of Human Language Technologies for the Slovene language. The main results are the JOS morphosyntactic specifications (tagset definition), two annotated corpora, and two Web services. The developed resources are available under the Creative Commons licences.
Web Services
JOS annotated corpora: jos100k and jos1M
The JOS corpora contain sampled paragraphs from the FidaPLUS corpus. The jos100k corpus contains 100,000 words with manually validated linguistic features, which cover lemmas, morphosyntactic descriptions, dependency syntactic relations, and WordNet IDs on selected nouns. The jos1M corpus contains 1 million words with partially hand validated lemmas and morphosyntactic descriptions.
The corpora are available in XML and derived tabular files and TEI header in HTML. The XML schema is based on TEI P5. The tabular files are smaller and probably easier for direct use, but do not contain all the information from the XML. The tabular files are avaiable with the linguistic categories both in Slovene and English.
The TEI headers (corpus metadata) are available also in HTML format, in English (jos100k & jos1M) and in Slovene. The header contain, in alia, the bibliographic description of all the texts in the corpus, the FIDA text-type taxonomy, the JOS morphosyntactic library, comprising the morphosyntactic features and tagset, and a list of the syntactic depndencies with short descriptions.
The corpora are avaiable under the Creative Commons Attribution-Noncommercial 3.0 licence, meaning that you are free to use it for any non-commercial purpose, provided that you give the original authors credit; in scientific publications this means citing the relevant publication or publications, referred to in the bibliography part of this page.
Publications
Tomaž Erjavec, Darja Fišer, Simon Krek, Nina Ledinek: The JOS Linguistically Tagged Corpus of Slovene. Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Malta, 2010.
Darja Fišer, Tomaž Erjavec: sloWNet: Construction and Corpus Annotation. Proceedings of Fifth International Conference of the Global WordNet Association (GWC'10), Mumbay, 2010.
Nina Ledinek, Tomaž Erjavec: Odvisnostno površinskoskladenjsko označevanje slovenščine: specifikacije in označeni korpusi. Zbornik Simpozija Obdobja: Infrastruktura slovenščine in slovenistike, Ljubljana, 2009.
Tomaž Erjavec, Simon Krek: Oblikoskladenjske specifikacije in označeni korpusi JOS. Zbornik Šeste konference Jezikovne tehnologije, 2008, Ljubljana.
Tomaž Erjavec, Simon Krek: The JOS Morphosyntactically Tagged Corpus of Slovene. Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), Marrakech 2008.
Further links
- Project SSJ: Communication in Slovene
- CLARIN.SI infrastructure
- MULTEXT-East language resources
- sloWNet
- Text Encoding Initiative and TEI P5