Classification of lexical items relies on the old tradition of Greek and Latin grammar. What is normally referred to as ``Parts-of-Speech" distinction for different words is well-known to be a crucial task, but not accurate or universal. Lyons (1981, p.109), for instance, warns the reader about this: ``It is important to realize, however, that the traditional list of ten or so parts of speech is very heteregeneous in composition and reflects, in many of the details of the definitions that accompany it, specific features of the grammatical structure of Greek and Latin that are far from being universal. Furthermore, the definitions themselves are often logically defective. Some of them are circular; and most of them combine inflectional, syntactic and semantic criteria which yield conflicting results when they are applied to a wide range of particular instances in several languages. ... Like most of the definitions in traditional grammar, they rely heavily upon the good sense and tolerance of those who apply and interpret them."
These difficulties in classifying word classes have been the concern
of many linguists and greatly affect computational applications,
as one cannot expect from machines the sort of ``good sense and
tolerance" asked for in
applying current classifications. On the other
hand, the tools MULTEXT is going to develop will be used by humans
sharing similar linguistic backgrounds.
It is, therefore, imperative that MULTEXT makes these tools user-frendly.
It should not be forgotten that
the output of corpus annotation as its main goal, as
well as the internal codification used for this purpose, should be
easily
understandable by the expected end-users of its products.
MULTEXT tools will be associated with
data for demonstration and validation purposes, but,
being public domain tools, one should expect that being allowed to
use them for experimentation, end-users will incorporate
their own classes and
distinctions. It must be ensured that users can
take supplied data as
guidelines to show the functionalities and behaviour of the tools
(the MULTEXT-EAST project evidences the importance of this
consideration).
With this aim, MULTEXT proposes to address classification problems
by joining forces with the EAGLES initiative
(MULTEXT T.A. 1993, p.10) which proposes to address them by
highlighting ``the area of common ground and some aspects of
discrepancy between the different systems for classifying
morphological units, in order to provide, after testing with respect
to all EC languages, the possibility of elaborating common consensual
guidelines for morphosyntactic encoding in lexica and corpora"
(``Synopsis and Comparison of Morphosyntactic Phenomena encoded in
Lexicons and in Corpora. A Common Proposal and Applications to
European Languages",
Monachini and Calzolari, Oct. 1994, p.12).
In EAGLES, a bottom-up procedure, looking at existing practices in a large number of lexical and textual projects world-wide (both in lexical specifications and in corpus tagsets), has been followed, thus allowing to highlight the large core of commonalities between lexical and textual large projects with respect to the morphosyntactic phenomena described. The procedure adopted within EAGLES was, in fact:
Thus, the EAGLES proposal (in the already mentioned
EAGLES reports ``Synopsis and Comparison
of Morphosyntactic Phenomena encoded in Lexicons and in Corpora" and
the ``Morphosyntactic Annotation",
Leech and Wilson, 1994) - which is also at the
basis of, or is mappable to,
the lexical and corpus specifications of the LRE projects
DELIS, RENOS, CRATER and MECOLB, MLAP project PAROLE, and the French
project GRACE -
has been the starting point of Task 1.6 within
MULTEXT.
The partners have been asked to:
Reports on the evaluation of the EAGLES specifications have been
contributed by the partners involved in this MULTEXT task, and
comments, suggestions and critical remarks are being taken into
account in the EAGLES proposal
which is being accordingly revised. MULTEXT Task 1.6. can be
seen as the largest contribution, together with DELIS, to the testing,
refining and revising of the EAGLES proposal. An example of this
interaction was the major revision of the EAGLES proposal which
affected the
Pronoun/Determiner category proposed, now split into two different
categories.
Experience shows that the process of consensus building is a slow process, because of the different interests to be adjusted. Considerations coming from ``re-usability" of existing material, as well as from theoretical and application-oriented arguments, have been raised in discussions under this task and should also be taken into account when evaluating its progress and results. Leading ideas to reach final decisions have been described above and will be examined in detail in the following subsections. They can be summarized by the statement of the MULTEXT strong committment to standardization and harmonization of lexical encoding initiatives, now active in Europe with the aim of sharing public domain resources.