Up: Contents Next: 2. Definitions of Morphosyntactic Categories
These specifications define morphosyntactic properties and their mapping to morphosyntactic descriptions (MSDs) appropriate for tagging word tokens in Slovene texts. The recommendation is based on the MULTEXT-East specifications for the Slovene language, which have been, among others, used for the annotation of the Fida and FidaPLUS reference corpora of Slovene.
These are the reasons why new morphosyntactic specifications were developed for JOS, which will hopefully be able to serve as a standard morphosyntactic tagset for Slovene. To this end, the choices made in MULTEXT-East were re-examined, and the tagset compared and contrasted to other annotation schemes of Slovene, in particular the one used in the LC-Star corpus, and the "Nova beseda" tagset which differs from the previous two in its fundamental design, i.e. it does not use positional attributes and is very closely tied to traditional Slovene grammars. Tagsets of related languages were also studied to compare best practices, in particular the Prague tagset used e.g. in the Czech National Corpus and Prague Dependency Treebank.
The resulting JOS specification is still an application of the MULTEXT-East principles, but the procedure to convert between the FidaPLUS/MULTEXT-East corpus MSDs and those of JOS is non-trivial because the mapping has to take into account not only the MSDs but, in general, also the word-form or its lemma.
The rest of these specifications is structured as follows. The second part of the specifications defines the Categories (these for the most part correspond to Parts-of-Speech) and for each defines its attributes and their values. The attributes have a fixed ordering, and their values a one-letter code, which enables the translation from the format of morphosyntactic descriptions to that of features structures. In addition to the table with the attribute-value definitions each category section also contains the complete list of valid MSD codes, together with examples of usage. The third part then gives synoptic lists of categories, attributes, values and MSDs. The usage of the latter is illustrated with frequencies and examples from corpora. The specifikacije also have two appendices, however, not yet translated into English. The first appendix describes, by category, the changes that have been made to the MULTEXT-East specifications to arrive at the JOS ones. The second appendix gives an overview of other related recommendations for morphosyntactic annotation.
Up: Contents Next: 2. Definitions of Morphosyntactic Categories