MULTEXT-East Morphosyntactic Specifications, Version 5 (draft)

Foreword

Up: MULTEXT-East Next: 1. Background

The purpose of this document is to provide harmonised lexical specifications for eighteen languages: Bosnian, Bulgarian, Chechen, Croatian, Czech, Estonian, English, Hungarian, Macedonian, Persian, Polish, Romanian, Russian, Ukrainian, Serbian, Slovak, Slovene, and the Resian dialect of Slovene.

These specifications are based on the proposals for lexicon specifications presented in the MULTEXT D1-6-1B Deliverable of the MULTEXT Project [mt:D161B] and in the Eagles documents of the Lexicon sub-group on Morphosyntactic annotation [eagles:morphsyn], [eagles:morphana].

These specifications were first made in the scope of the MULTEXT-East project, as the project report D11F, and were, slightly modified, made available on the TELRI CD-ROM "A Compendium of Multilingual Resources" [telri:CD], [lrec98:mtelex].

The second version, the so called "Concede edition" [elsnews01:v2], [mte:nlprs] offered the addition of a new language, Croatian, and additions (for Romanian and Slovene) to the common tables in terms of attributes and values. The tables for Slovene have also been localised. The format of the specification had been improved, by converting the master LaTeX document to Latin-2 encoding. Finally, the common tables had been made available additionally in XML, as a TEI feature libraries <fLib>, one for each PoS.

In Version 3 [mte:slav] some minor errors found in the Concede edition were fixed; this version also added two more specifications: for Serbian, and for a dialect of Slovene, Resian.

In Version 4, the "MONDILEX edition" [V4] the specifications have been re-cast in XML, in TEI P5. This offeres many possibilities for improvement, such as automatically producing derived tables. Six new languages were also added, in particular Russian, Macedonian and Persian, and, due to the support of the EU MONDILEX project, Slovak, Polish and Ukrainian.

In the present Version 5, currently in DRAFT status: a new category "Z" has been added for punctuation; new values of the Residual Type have been added for CMC texts; two new languages, Bosnian and Chechen, have been added; and the Croatian and Slovene specifications have been changed.

Pisa, October 1995

Ljubljana, December 1997

Ljubljana, March 2001

Ljubljana, May 2004

Ljubljana, May 2010

Ljubljana, June 2016

Up: MULTEXT-East Next: 1. Background
Date: 2016-06-20
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International.