Next: Description of the proposal Up: Introduction Previous: Background work

Multext-East work and approach

The Multext-East partners have analysed and evaluated the set of Multext specifications with respect to their languages; when the scheme has not been sufficient, then, the Eagles inventory has been consulted. If necessary, new distinctions for language-specific information have been proposed.

The work proceeded in a cyclical process: firstly, the partners have circulated the results of their evaluation work in applications that have been, in turn, evaluated by Pisa, and later by Ljubljana. These applications have constituted the basis for the proposal of a set of Central & Eastern specifications.

The nucleus of common features, already isolated within Multext, proved to be applicable to the Central & Eastern languages but the selection of distinctions needed for encoding the information peculiar to the Multext-East languages and the harmonisation of proposals has required much further work. Some of the features proposed by the partners in their first applications have been dropped, as they pertained to a not-purely morphosyntactic level (i.e. Transitivity); other distinctions have been kept out from the set since considered too fine.

The emerged set has been again circulated among the partners for new cycles of revisions and re-applications, until the specifications have been considered acceptable for all the language groups and stable enough for the first version (IM1) of this deliverable.

However, there were still significant problems with the IM1 this version of the deliverable, in particular:

different attributes or values were used to describe the same phenomena with different categories / languages (e.g. 1st and 1 for first person);
especially the three Slavic languages at times described the same phenomena in different ways;
language independent aspects of the specifications (e.g. the 'form' of numerals: digit, roman, etc.) were treated differently for different languages;
the ordering of attributes in some categories was sub-optimal, often necessitating long morphosyntactic descriptions.
the attributes and values used different 'punctuation' (e.g. full art, Modific.Type, Pron-Form, SubType;
the common tables were formated differently for different categories.

Therefore it was decided to produce, for milestone M, version 2 of the deliverable. This harmonisation led to a more motivated and -- on the average -- more compact morpho-syntactic descriptions for the Multext-East languages. The formalisation of the tables and descriptions had an added benefit: a simple Perl program (mtems-expand) was written, which could, working directly with common tables of the morpho-syntactic descriptions of this report, either expand or validate lexical morphosyntactic descriptions. This program was used to validate the word-form lexica of the project, thus ensuring that all the morpho-syntactic descriptions in the lexica of the particular languages are well-formed. Another Perl program, mtems-split was also written, which, again working on the common tables, produces language specific tables. These tables were circulated to the partners, thus ensuring that the language specific section do in fact reflect the common tables.

In the final stage of the project, the report was again revised. Tagsets were developed for some of the languages. In the tagging of the Multext-East corpus, it was found that the initial lexica still contained errors or non-optimal choices; this lead to a revised set of lexica, and this to new language applications of the morpho-syntactic tables. Finally, the English morphosyntactic specifications were added to the tables. This Final report incorporates these modificatons and additions.

Next: Description of the proposal Up: Introduction Previous: Background work

Multext-East