JOS
ToTaLe text analyser for Slovene texts
Note that this is a very old service, and better web tools for tagging Slovene exists,
in particular ReLDIanno.
Here you can Tokenise, Tag and Lemmatise Slovene texts.
The tags (morphosyntactic descriptions, MSDs) follow the
JOS morphosyntactic specifications
and can be shown either in Slovene (e.g.
Gp-g = glagol pomožni pogojnik
)
or English (e.g.
Va-c = Verb auxiliary conditional
).
The output file is in "vertical" format, appropriate for using in
SketchEngine and
CWB.
Each line is either an XML tag (<doc>, <p>, <s> and
</s>, </p>, </doc>) or an annotated token.
Token lines are tab-separated and contain 1) the token, 2)
the lemma (base form) of the word, and 3) the
MSD tag.
For punctuation, the MSD and lemma fields are identical to the token.
The MSDs can be converted into various other formats with the JOS MSD
conversion tables.
Documentation
- ws-jos-ljubljana-06.txt:
Grammatical relations file (using Slovene MSDs), needed for compiling Word Sketches in
SketchEngine.
The grammatical relations for
JOS MSDs
are based on the relations
developed for the FidaPLUS corpus by Simon Krek and described in
Simon Krek, Adam Kilgarriff:
Slovene Word Sketches.
Proceedings of 5th Slovenian and 1st international Language Technologies Conference 2006
Jozef Stefan Institute, Ljubljana, Slovenia.
- Old paper describing ToTaLe:
Tomaž Erjavec, Camelia Ignat, Bruno Pouliquen, Ralf Steinberger.
Massive multi-lingual corpus
compilation: Acquis Communautaire and totale.
In Proceedings of the 2nd Language & Technology Conference, April 21-23, 2005, Poznan, Poland. 2005, pp. 32-36.
- Papers describing the training set for ToTaLe used in this service:
Tomaž Erjavec, Simon Krek:
The JOS Morphosyntactically Tagged Corpus of Slovene.
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), Marrakech 2008.
Tomaž Erjavec, Darja Fišer, Simon Krek, Nina Ledinek.
The JOS Linguistically Tagged Corpus of Slovene.
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Malta, 2010.
Related services
Page last updated 2022-05-14,
et