Next: Common Morphosyntactic Specifications Up: Contents Previous: Contents

Subsections

Background

Multext-East / Concede Morphosyntactic specifications V.2-- Introduction

Description of the proposal

The proposal has been prepared in the Multext table format, which displays the specifications (as sets of attribute-values, see below for further details about the notation), with their respective codes used to mark them in the lexicons. Two types of features are distingished:

(i): the minimal core features
These are shared by most of the languages and appear at the top of the list of features -- they are deliniated by a row of asterisks (*). We tried to keep this set in common to all the Multext and Multext-East languages. In such a way, the comparability across the information encoded in the lexical lists of Central & Eastern and of the six original Multext languages is to a certain extent ensured. A side-effect is that certain values are present in the tables, although none of the MULTEXT-East languages makes used of them/
(ii): the Multext-East language-specific features
These are shared by the Multext-East languages, but differ from the Multext ones. The formulation of this set has been, as already mentioned, highly delicate, due also to the fact that many language-specific values were presented in the applications and sometimes the same (or 'similar enough') morphosyntactic phenomena were referred to with two different attribute or value names. The phase of recognition and harmonisation of the semantics of some attributes, values and naming conventions has, hence, required much effort.

This representation, with the concrete applications which display and exemplify the attributes and values and provide their internal constraints and relationships, makes the proposal self-explanatory. Other groups can easily test the specifications on their language, simply by following the method of the applications. The possibility of incorporating idiosyncratic classes and distinctions after the common core features makes the proposal relatively adaptable and flexible, without compromising compatibility.

Lexical lists

The specifications presented here have been used in the encoding of word-form lexica of the languages. These lexica contain one entry per line, where an entry has the following form:

word-form $\langle$ TAB $\rangle$ lemma $\langle$ TAB $\rangle$ morphosyntactic description

This reports describes the morphosyntactic descriptions used in the lexica and also in the annotated multilingual corpus.

Notation

In Multext the notation has been chosen following current practices for NLP, where information is represented in attribute-value formalisms and following the idea that it should also be self-informative for human understanding. At the same time, a relativelly compact encoding was maintained. The notation format has the following main characteristics:

attributes are marked by positions;
values are represented by a single character;
a special marker reflects the non applicability of a given attribute.

These characteristics make the proposed lexical notation similar to attribute-value pairs used in unification based formalisms (see the Multext D1-6-1B Deliverable for further details).

The linear strings of characters representing the morphosyntactic descriptions are constructed following the philosophy of the Intermediate Format proposed in the Eagles Corpus proposal (Leech and Wilson, 1994), i.e. of having agreed symbols in predefined and fixed positions: the positions of a string of characters are numbered 0, 1, 2, etc. in the following way:

a.: the agreed character at position 0 encodes part-of-speech;
b.: each character at position 1, 2, n, encodes the value of one attribute (person, gender, number, etc.);
c.: if an attribute does not apply, the corresponding position in the string contains a special marker, the hyphen ('-').

Example: Ncms- (Noun, common, masculine, singular, nocase)

This notation adopts the Eagles Intermediate Format with a small revision: the Intermediate Format encodes information by means of digits, while in Multext characters of a mnemonic nature are preferred.

The marker '-' has a special semantics and it means 'not-applicable'. As stated above, its function is to keep the relationship established between attributes and values. It is used in the following cases:

(a): not relevant to a particular language , e.g. Gender to Estonian.
(b): not applicable to a particular combination of attribute-values , i.e. although the attribute is used by a category in a given language it does not apply to a particular subclass of the category; e.g., Person applies to Pronouns, but not to the Type demonstrative.
(c): not applicable to a particular lexical item , i.e. although the attribute applies to the rest of its paradigm, e.g., Gender in the paradigm of English Personal Pronouns applies only to the 3rd person, I, you vs. she, he).

Finally, it should be noted that in the lexica trailing hyphens have been omitted, as this often leads to a more compact encoding. Hence codes like Ncms- are written as Ncms.

Organization of the language-specific chapters

The applications of the proposal to the individual eight languages are then given in Chapter II. The language application parts present the lexicon specifications category by category and are structured as follows:

(a)

a chapter describing the features and values pertinent to a given category in the form of tables with examples from the language in question; these tables have been mostly automatically generated from the common tables. The language specific tables, however, do often have significant additional information, e.g. examples or localisation of names.

(b)

a chapter providing the allowed combinations of values for the particular language, which is, in some cases, supplemented by c) a list of examples from the lexicon or corpus.

Two different strategies of displaying combinations have been followed:

(i): all the admitted combinations are provided together with an example. This has the disadvantage of producing big lists, given the high number of features and values to combine together, but the advantage of providing the only legal combinations with the relevant constraints in the application of some features/values in presence of other features or values or combination of them (see e.g Gender in Personal Pronouns).
(ii): a mathematical expression (essentially a regular expression) describing the combinations is provided, which can subsequently used to generate all combinations, possibly including some which are not valid.

Next: Common Morphosyntactic Specifications Up: Contents Previous: Contents

Tomaz Erjavec
4/1/2001