SPOOK morphosyntactic specifications

1. Background

Up: SPOOK specifications Next: 2. Common part

The SPOOK morphosyntactic specification define common morposyntactic featurs and harmonised tagsets for Slovene, English, Frech, German and Italian. The specifications are based on the MULTEXT-East specifications, which cover 16 languages, but only Slovene from the above langauges. The SPOOK specifications cover four Western European langauges which are part of the SPOOK annotated parallel corpus. The SPOOK tagsets take as their starting point the tagsets used by the TreeTagger language models, which was used to tag the Western European langauges of the SPOOK corpus. The SPOOK specifications for Slovene have a special status, as the Slovene part of the corpus was annotated with the ToTrTaLe program, which uses the tagset defined in the JOS morphosyntactic specifications, which are dientical to the MULTEXT-East specifications for Slovene. As all the other langauges use significanlty smaller tagsets, while the corpus and specifications contain both the TreeTagger tagsets as well as their mapping to the MULTEXT-East format, we also defined a mapping to the coarse-grained tagset for Slovene as defined for the IMP project, which is developing language resources for historical Slovene. The IMP morphosyntactic specifications reduce the JOS tagset to lexical morphosyntactic feathres and defined only 32 tags, instead of the 1900 tags as defined in JOS.

In these specifications the most useful part is probably the list of morphosyntactic tags with their mapping to the TreeTaggerja tagsets and examples of usage. They are available for the following languages:
Up: SPOOK specifications Next: 2. Common part
Tomaž Erjavec Dept. of Knowledge TechnologiesJožef Stefan Institute. Date: 2012-05-22
This work is licensed under the Creative Commons licence Attribution-ShareAlike 3.0.