Next: Criteria for the Up: Introduction Previous: Background work

MULTEXT-East work and approach

The MULTEXT-East partners have analysed and evaluated the set of MULTEXT specifications with respect to their languages; when the scheme has not been sufficient, then, the EAGLES inventory has been consulted. If necessary, new distinctions for language-specific information have been proposed.

The work proceeded in a cyclical process: firstly, the partners have circulated the results of their evaluation work in applications that have been, in turn, evaluated by Pisa, the coordinator. These applications have constituted the basis for the proposal of a set of Central & Eastern specifications, which has been designed with a bottom-up procedure.

The nucleus of common features, already isolated within MULTEXT , proved to be readily applicable to the Central & Eastern languages; the formulation of the distinctions needed for encoding the morphosyntactic information peculiar to them has required much harmonization work. Some of the features proposed by the partners in their first applications have been dropped, as they pertained to a not-purely morphosyntactic level (i.e. Transitivity); other distinctions have been kept out from the set since considered too fine.

The emerged set has been again circulated among the partners for new cycles of revisions and re-applications, until the specifications have been considered acceptable for all the language groups and stable enough for the first version (IM1) of this deliverable.

However, there were still significant problems with this first version of the deliverable, in particular:

the three Slavic languages sometimes described the same phenomena in different ways;
language independent aspects of the specifications (e.g. the 'form' of numerals: digit, roman, etc.) were treated differently for different languages;
different attributes or values were used to describe the same phenomena with different categories / languages (e.g. 1st and 1 for first person);
the ordering of attributes in some categories was sub-optimal, necessitating long morphosyntactic descriptions, i.e. descriptions with long strings of '-';
the attributes and values used different 'punctuation' (e.g.\ full art, Modific.Type, Pron-Form, SubType;
the common tables were formated differently for different categories.

Therefore it was decided to produce, for milestone M, version 2 of the deliverable; this effort was led by the Ljubljana site. This harmonisation led to a more motivated and --- on the average --- more compact morpho-syntactic descriptions for the MULTEXT-East languages, while the formalisation of the tables and descriptions had an added benefit. Namely, a simple Perl program (mtems-expand) was written, which could, working directly with common tables of the morpho-syntactic descriptions of this report, either expand or validate lexical morphosyntactic descriptions. This program was used to validate the word-form lexica of the project, thus ensuring that all the morpho-syntactic descriptions in the lexica of the particular languages are well-formed. Another Perl program, mtems-split was also written, which, again working on the common tables, produces language specific tables. These tables were circulated to the partners, thus ensuring that the language specific section do in fact reflect the common tables.

Criteria for the inclusion/exclusion of new distinctions

Next: Criteria for the Up: Introduction Previous: Background work

Tomaz Erjavec
Wed Oct 16 12:08:36 MDT 1996