MULTEXT-East Morphosyntactic Specifications


Up: MULTEXT-East Next: 1. Background

The purpose of this document is to provide harmonised lexical specifications for the following languages or language variants: Albanian, Bulgarian, Chechen, Czech, Damaskini, Estonian, English, Georgian, Hungarian, Macedonian, Persian, Polish, Romanian, Russian, Ukrainian, Serbo-Croatian macrolanguage, Slovak, Slovene, Resian, and Torlak.

These specifications are based on the proposals for lexicon specifications presented in the MULTEXT D1-6-1B Deliverable of the MULTEXT Project [mt:D161B] and in the Eagles documents of the Lexicon sub-group on Morphosyntactic annotation [eagles:morphsyn], [eagles:morphana].

These specifications were first made in the scope of the MULTEXT-East project, as the project report D11F, and were, slightly modified, made available on the TELRI CD-ROM "A Compendium of Multilingual Resources" [telri:CD], [lrec98:mtelex].

The second version, the so called "Concede edition" [elsnews01:v2], [mte:nlprs] offered the addition of a new language, Croatian, and additions (for Romanian and Slovene) to the common tables in terms of attributes and values. The tables for Slovene have also been localised. The format of the specification had been improved, by converting the master LaTeX document to Latin-2 encoding. Finally, the common tables had been made available additionally in XML, as a TEI feature libraries <fLib>, one for each PoS.

In Version 3 [mte:slav] some minor errors found in the Concede edition were fixed; this version also added two more specifications: for Serbian, and for a dialect of Slovene, Resian.

In Version 4, the "MONDILEX edition" [V4] the specifications have been re-cast in XML, in TEI P5. This offeres many possibilities for improvement, such as automatically producing derived tables. Six new languages were also added, in particular Russian, Macedonian and Persian, and, due to the support of the EU MONDILEX project, Slovak, Polish and Ukrainian.

In Version 5, a new category "Z" was added for punctuation; new values of the Residual Type have been added for CMC texts; two new languages, Bosnian and Chechen, have been added; and the Croatian and Slovene specifications have been changed. However, this version never made it beyond the "DRAFT" status.

In the present Version 6 the TEI encoding has been updated, as have the conversion scripts, and the maintenance of the specifications has moved to GitHub. Furthermore, while the Croatian, Serbian, and Bosnian specifications have been removed, the specifications for the macrolanguage Serbo-Croatian have been added, and are meant to cover Croatian, Serbian, Bosnian and Montenegrin. Last but not least, the Macedonian specifications have been updated, and the specifications for the Albanian and Georgian languages, the Torlak dialect of Serbian, and for the "Damaskini" diachronic corpus of Balkan Slavic texts from 16th-19th century have been added to the specifications.

Pisa, October 1995

Ljubljana, December 1997 (V1)

Ljubljana, March 2001 (V2)

Ljubljana, May 2004 (V3)

Ljubljana, May 2010 (V4)

Ljubljana, June 2016 (V5)

Erlangen, November 2018 (V6 draft)

Ljubljana, August 2021 (V6, current version)

Up: MULTEXT-East Next: 1. Background
Date: 2022-06-24
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International.