JOS morphosyntactic specifications for Slovene

1. Background

These specifications define morphosyntactic properties and their mapping to morphosyntactic descriptions (MSDs) appropriate for tagging word tokens in Slovene texts. The recommendation is compatible with the MULTEXT-East V4 specifications for the Slovene language, where the previous version of these specifications, MULTEXT-East V3, was used, among others, for the annotation of the Fida and FidaPLUS reference corpora of Slovene.

The MULTEXT(-East) morphosyntactic specifications are based on work by EAGLES and set out the grammar and vocabulary of valid morphosyntactic descriptions, MSDs. The specifications determine what, for each language, is a valid MSD and what it means. For instance, they can define that the MSD
Noun, Type = common, Gender = masculine, Number = singular, Case = accusative, Animate = no
It should be noted that the Slovene MULTEXT-East tagset differs substantially from tagsets of inflectionally less rich languages, such as the majority of Western European ones. In Slovene, as in other Slavic languages, words can be marked with a large number of features, and the Slovene MULTEXT-East tagset is correspondingly large, with about 1,900 tags.
While it was felt that the formal basis and principles of the Slovene tagset of MULTEXT-East V3 were adequate -- if at times not perfect -- for Slovene, there were a number of details that were considered problematic, e.g. certain attributes or their values, allowed combinations of attribute-values, as well as the lexical assignment of MSD to particular words or word groups. Another problematic aspect, this one of the MULTEXT-East V3 specifications as a whole, is the ordering of the attributes in the MSD string; as the specifications cover a large number of quite varied languages, language specific attributes (or those added to the specifications at a later date) wind up at the end of the string, leading to unwieldy strings, such as the MULTEXT-East MSD
It would be better if an individual language had the freedom to reorder attributes, as long the mapping to feature-structure representation was maintained.

These are the reasons why new morphosyntactic specifications were developed for JOS, which will hopefully be able to serve as a standard morphosyntactic tagset for Slovene. To this end, the choices made in MULTEXT-East were re-examined, and the tagset compared and contrasted to other annotation schemes of Slovene, in particular the one used in the LC-Star corpus, and the "Nova beseda" tagset which differs from the previous two in its fundamental design, i.e. it does not use positional attributes and is very closely tied to traditional Slovene grammars. Tagsets of related languages were also studied to compare best practices, in particular the Prague tagset used e.g. in the Czech National Corpus and Prague Dependency Treebank.

The resulting JOS specification is compatible with MULTEXT-East V4 but the procedure to convert between the FidaPLUS / MULTEXT-East V3 corpus MSDs and those of JOS / MULTEXT-East V4 is non-trivial because the mapping has to take into account not only the MSDs but, in general, also the word-form or its lemma.

The specifications are available in both Slovene and English. This holds not only for the text of the specifications but also for the names of the categories, their attributes and values. With the specifications and attendatnt XSLT scripts it is thus possible to translate the Slovene MSD
into the English
corresponding to the feature structure
This enables simple translations between the MSDs in English and Slovene.

The rest of these specifications is structured as follows. The second part of the specifications defines the Categories (these for the most part correspond to Parts-of-Speech) and for each defines its attributes and their values. The attributes have a fixed ordering, and their values a one-letter code, which enables the translation from the format of morphosyntactic descriptions to that of features structures. In addition to the table with the attribute-value definitions each category section also contains the complete list of valid MSD codes, together with examples of usage. The third part then gives synoptic lists of categories, attributes, values and MSDs. The usage of the latter is illustrated with frequencies and examples from corpora. The specifikation also have two appendices, which are, however, not translated into English. The first appendix describes, by category, the changes that have been made to the MULTEXT-East specifications to arrive at the JOS ones. The second appendix gives an overview of other related recommendations for morphosyntactic annotation.

Tomaž Erjavec, Simon Krek, Špela Arhar, Darja Fišer, Nina Ledinek, Amanda Saksida, Breda Sivec, Blaž Trebar. Date: 2010-03-07
This work is licenced under the Creative Commons Attribution 3.0 Slovenia.