Next: Common Morphosyntactic Specifications
Up: Contents
Previous: Contents
Subsections
Multext-East / Concede Morphosyntactic specifications V.2-- Introduction
The proposal has been prepared in the Multext table format, which
displays the specifications (as sets of attribute-values, see below
for further details about the notation), with their respective codes
used to mark them in the lexicons. Two types of features are
distingished:
- (i)
- the minimal core features
These are shared by most of the languages and appear at the top of
the list of features -- they are deliniated by a row of asterisks
(*). We tried to keep this set in common to all the Multext and
Multext-East languages. In such a way, the comparability across the
information encoded in the lexical lists of Central & Eastern and
of the six original Multext languages is to a certain extent
ensured. A side-effect is that certain values are present in the
tables, although none of the MULTEXT-East languages makes used of
them/
- (ii)
- the Multext-East language-specific features
These are shared by the Multext-East languages, but differ from the
Multext ones. The formulation of this set has been, as already
mentioned, highly delicate, due also to the fact that many
language-specific values were presented in the applications and
sometimes the same (or 'similar enough') morphosyntactic phenomena
were referred to with two different attribute or value names. The
phase of recognition and harmonisation of the semantics of some
attributes, values and naming conventions has, hence, required much
effort.
This representation, with the concrete applications which display and
exemplify the attributes and values and provide their internal
constraints and relationships, makes the proposal self-explanatory.
Other groups can easily test the specifications on their language,
simply by following the method of the applications. The possibility
of incorporating idiosyncratic classes and distinctions after the
common core features makes the proposal relatively adaptable and
flexible, without compromising compatibility.
The specifications presented here have been used in the encoding of
word-form lexica of the languages. These lexica contain one entry per
line, where an entry has the following form:
word-form
TAB
lemma
TAB
morphosyntactic description
This reports describes the morphosyntactic descriptions used in the
lexica and also in the annotated multilingual corpus.
In Multext the notation has been chosen following current practices
for NLP, where information is represented in attribute-value
formalisms and following the idea that it should also be
self-informative for human understanding. At the same time, a
relativelly compact encoding was maintained. The notation format has
the following main characteristics:
- attributes are marked by positions;
- values are represented by a single character;
- a special marker reflects the non applicability of a given
attribute.
These characteristics make the proposed lexical notation similar to
attribute-value pairs used in unification based formalisms (see the
Multext D1-6-1B Deliverable for further details).
The linear strings of characters representing the morphosyntactic
descriptions are constructed following
the philosophy of the Intermediate Format proposed in the
Eagles Corpus proposal (Leech and Wilson, 1994), i.e. of having agreed
symbols in predefined and fixed positions: the positions of a string
of characters are numbered 0, 1, 2, etc. in the following way:
- a.
- the agreed character at position 0 encodes part-of-speech;
- b.
- each character at position 1, 2, n, encodes the value of one
attribute (person, gender, number, etc.);
- c.
- if an attribute does not apply, the corresponding position
in the string contains a special marker, the hyphen ('-').
Example: Ncms- (Noun, common, masculine, singular, nocase)
This notation adopts the Eagles Intermediate Format with a small
revision: the Intermediate Format encodes information by means of
digits, while in Multext characters of a mnemonic nature are
preferred.
The marker '-' has a special semantics and it means
'not-applicable'. As stated above, its function is to keep the
relationship established between attributes and values. It is used in
the following cases:
- (a)
- not relevant to a particular language , e.g. Gender
to Estonian.
- (b)
- not applicable to a particular combination of
attribute-values , i.e. although the attribute is used by a
category in a given language it does not apply to a particular
subclass of the category; e.g., Person applies to Pronouns, but not
to the Type demonstrative.
- (c)
- not applicable to a particular lexical item , i.e.
although the attribute applies to the rest of its paradigm, e.g.,
Gender in the paradigm of English Personal Pronouns applies only to
the 3rd person, I, you vs. she, he).
Finally, it should be noted that in the lexica trailing hyphens have
been omitted, as this often leads to a more compact encoding. Hence
codes like Ncms-
are written as Ncms
.
The applications of the proposal to the individual eight languages are
then given in Chapter II. The language application parts
present the lexicon specifications category by category and are
structured as follows:
- (a)
- a chapter describing the features and values
pertinent to a given category in the form of tables with
examples from the language in question; these tables have been
mostly automatically generated from the common tables. The language
specific tables, however, do often have significant additional
information, e.g. examples or localisation of names.
- (b)
- a chapter providing the allowed combinations of values for
the particular language, which is, in some cases, supplemented by c)
a list of examples from the lexicon or corpus.
Two different strategies of displaying combinations have been
followed:
- (i)
- all the admitted combinations are provided together
with an example. This has the disadvantage of producing big lists,
given the high number of features and values to combine together,
but the advantage of providing the only legal combinations with the
relevant constraints in the application of some features/values in
presence of other features or values or combination of them (see e.g
Gender in Personal Pronouns).
- (ii)
- a mathematical expression (essentially a regular expression)
describing the combinations is provided, which can subsequently used to
generate all combinations, possibly including some which are not
valid.
Next: Common Morphosyntactic Specifications
Up: Contents
Previous: Contents
Tomaz Erjavec
4/1/2001