Slovene Dependency Treebank

The Slovene Dependency Treebank project built a small syntactically annotated corpus of Slovene texts. The corpus was annotated with dependency analyses, taking the Prague Dependecy Treebank as the model. The Slovene Dependency Treebank is annotated with Analytic Tree Structures and contains a part of the morphosyntactically annotated Slovene component of the parallel MULTEXT-East corpus, i.e. the first third of the Slovene translation of the novel "1984" by G. Orwell, containing 30,000 words.

SDT took part in the CoNLL-X Shared Task: Multi-lingual Dependency Parsing. The data for this shared task, including Slovene, is available via LDC and ELRA:

LDC: LDC2015T12
ELRA: ELRA-W0087

Just the SDT can be also downloaded from http://nl.ijs.si/sdt/data/. Here we offer two version of SDT: the data used for CoNLL-X, and a somewhat more recent release which fixes some annotation erros and also offers the treebank encoded in TEI P4, as well as in the derived CoNLL tabular format. More information about the current version of SDT is given in its TEI header.

If you report on your research involving SDT in a published paper, please cite the first reference below.

In subsequent work we changed to a local, simpler format for annotation. Treebanks annotated in this format are available form the JOS project (jos100k with 100.000 words) and the SSJ project (ssj500k with 250.000 words treebanked). Recently, we moved to the Universal Dependecies framework where you can find the Slovene UD treebank (derived from ssj500k).

Tree samples

References

Sašo Džeroski, Tomaž Erjavec, Nina Ledinek, Petr Pajas, Zdenek Žabokrtský, Andreja Žele:
Towards a Slovene Dependency Treebank.
In Proceedings of Fifth International Conference on Language Resources and Evaluation, LREC'06, 24-26 May 2006. Genoa.
Nina Ledinek, Andreja Žele:
Building of the Slovene Dependency Treebank Corpus According to the Prague Dependency Treebank Corpus.
Conference "Grammar and Corpus" Prague, 23. - 25. 11. 2005
Nina Ledinek:
Površinskoskladenjsko označevanje korpusa Slovene Dependency Treebank (s poudarkom na predikatu).
(Surface syntactic annotation of the Slovene Dependency Treebank (with focus on the predicate)).
B.A. thesis. University of Ljubljana (2005)

Further links

Last change 2020-06-22, et