Up: Contents Next: 2. Definitions of Morphosyntactic Categories
These specifications define morphosyntactic properties and their mapping to morphosyntactic descriptions (MSDs) appropriate for tagging word tokens in Slovene texts. The recommendation is compatible with the MULTEXT-East V4 specifications for the Slovene language, where the previous version of these specifications, MULTEXT-East V3, was used, among others, for the annotation of the Fida and FidaPLUS reference corpora of Slovene.
These are the reasons why new morphosyntactic specifications were developed for JOS, which will hopefully be able to serve as a standard morphosyntactic tagset for Slovene. To this end, the choices made in MULTEXT-East were re-examined, and the tagset compared and contrasted to other annotation schemes of Slovene, in particular the one used in the LC-Star corpus, and the "Nova beseda" tagset which differs from the previous two in its fundamental design, i.e. it does not use positional attributes and is very closely tied to traditional Slovene grammars. Tagsets of related languages were also studied to compare best practices, in particular the Prague tagset used e.g. in the Czech National Corpus and Prague Dependency Treebank.
The resulting JOS specification is compatible with MULTEXT-East V4 but the procedure to convert between the FidaPLUS / MULTEXT-East V3 corpus MSDs and those of JOS / MULTEXT-East V4 is non-trivial because the mapping has to take into account not only the MSDs but, in general, also the word-form or its lemma.
The rest of these specifications is structured as follows. The second part of the specifications defines the Categories (these for the most part correspond to Parts-of-Speech) and for each defines its attributes and their values. The attributes have a fixed ordering, and their values a one-letter code, which enables the translation from the format of morphosyntactic descriptions to that of features structures. In addition to the table with the attribute-value definitions each category section also contains the complete list of valid MSD codes, together with examples of usage. The third part then gives synoptic lists of categories, attributes, values and MSDs. The usage of the latter is illustrated with frequencies and examples from corpora. The specifikation also have two appendices, which are, however, not translated into English. The first appendix describes, by category, the changes that have been made to the MULTEXT-East specifications to arrive at the JOS ones. The second appendix gives an overview of other related recommendations for morphosyntactic annotation.
Up: Contents Next: 2. Definitions of Morphosyntactic Categories