The work carried out in this task aims at formulating harmonized
specifications
and at proposing
a notation for the lexica and the tagsets,
to be contributed by each language group
involved in the MULTEXT Project.
MULTEXT's general aim is to develop tools for
corpus annotation which contribute to the standardization of this kind
of work in an
academic and an industrial environment. These tools will
be provided with resources from six different languages to ensure their
validity. Resources used
to feed the tools are, among others, lexical lists
for the six
languages, containing the necessary information to run the tools.
Tools that will use lexica are mainly those which perfom
morphological analysis and generation, and lexical lookup tools. MULTEXT
proposes to deliver a morphological tool together with
basic
morphological rules and a number of base form entries, duly
coded with respect to the rules. The morphological tool is intended
to expand
these base forms into word-form lists, with corresponding
morphosyntactic information. These word-forms will,
in turn, be used for the tagger,
providing that a correspondence between the morphosyntactic
information and the tags to be used by the tagger is defined. The
morphological tool must guarantee extensibility of the MULTEXT
tools, as it is thought to be used by end-users to enlarge lexical
material treated by the tools. It is also expected that a
morphological analysis will be able
to perform a ``guess" on at least the
category of
unknown words and, where possible, on morphosyntactic features.
Within MULTEXT, therefore,
``lexical list" refers to a list of forms
with related information: both to base-form lexica, coded in
such a way as to feed
the morphological tool, and to the word-form lexica, containing
relevant information for corpus annotation purposes.
At the first workpackage coordinators' meeting held
in Paris, and as also reported in D1.6.1. (September 1994), it was
agreed that in
view of the urgent need for lexical lists for the creation of
the tools, lexical lists of word-forms
in a particular format could be supplied
already in the first phase,
meanwhile leaving for the second phase the development of
base-form
morphological lexica, input for the morphological tool. These word-form
lexical
lists were generated from the resources
already available at
the different sites. Further work will be done in order to ensure the
complete mappability between the results of the morphological tool
and the formalism proposed for lexical lists.
The present report
is mainly devoted to the definition of the information associated with
the word-form lists, from now on referred to as
``lexical descriptions".
We provide here
the notation to be used in the lists corresponding to each
language to describe a given word-form. Major effort has been devoted
to ensure compatibility between the three different types of
information to be associated with a given word: morphological
information, morphosyntactic lexical description and TAG label.
The present report is divided into four sections: