JOS ToTaLe text analyser for Slovene texts

Here you can Tokenise, Tag and Lemmatise Slovene texts. The tags (morphosyntactic descriptions, MSDs) follow the JOS morphosyntactic specifications and can be shown either in Slovene (e.g. Gp-g = glagol pomožni pogojnik) or English (e.g. Va-c = Verb auxiliary conditional). The output file is in "vertical" format, appropriate for using in SketchEngine and CWB. Each line is either an XML tag (<doc>, <p>, <s> and </s>, </p>, </doc>) or an annotated token. Token lines are tab-separated and contain 1) the token, 2) the lemma (base form) of the word, and 3) the MSD tag. For punctuation, the MSD and lemma fields are identical to the token. The MSDs can be converted into various other formats with the JOS MSD conversion tables.

Analyse the text and the result or the compressed files, with the MSDs in Slovene or English.

Type or paste text into the window below:

or upload plain utf-8 text files: (.txt, or .zip, .rar, .tgz)

the form!
Note: uploaded files are being archived, and could be used as a basis for further research.


Related services

Valid XHTML 1.0 Transitional Page last updated 2013-04-02, et