[Next] [Up] [Previous] [Contents]
Next: Lexical descriptions and corpus Up: Multext D1.6.1 B Previous: Introduction

Background Considerations

Classification of lexical items relies on the old tradition of Greek and Latin grammar. What is normally referred to as ``Parts-of-Speech" distinction for different words is well-known to be a crucial task, but not accurate or universal. Lyons (1981, p.109), for instance, warns the reader about this: ``It is important to realize, however, that the traditional list of ten or so parts of speech is very heteregeneous in composition and reflects, in many of the details of the definitions that accompany it, specific features of the grammatical structure of Greek and Latin that are far from being universal. Furthermore, the definitions themselves are often logically defective. Some of them are circular; and most of them combine inflectional, syntactic and semantic criteria which yield conflicting results when they are applied to a wide range of particular instances in several languages. ... Like most of the definitions in traditional grammar, they rely heavily upon the good sense and tolerance of those who apply and interpret them."

These difficulties in classifying word classes have been the concern of many linguists and greatly affect computational applications, as one cannot expect from machines the sort of ``good sense and tolerance" asked for in applying current classifications. On the other hand, the tools MULTEXT is going to develop will be used by humans sharing similar linguistic backgrounds. It is, therefore, imperative that MULTEXT makes these tools user-frendly. It should not be forgotten that the output of corpus annotation as its main goal, as well as the internal codification used for this purpose, should be easily understandable by the expected end-users of its products. MULTEXT tools will be associated with data for demonstration and validation purposes, but, being public domain tools, one should expect that being allowed to use them for experimentation, end-users will incorporate their own classes and distinctions. It must be ensured that users can take supplied data as guidelines to show the functionalities and behaviour of the tools (the MULTEXT-EAST project evidences the importance of this consideration).

With this aim, MULTEXT proposes to address classification problems by joining forces with the EAGLES initiative (MULTEXT T.A. 1993, p.10) which proposes to address them by highlighting ``the area of common ground and some aspects of discrepancy between the different systems for classifying morphological units, in order to provide, after testing with respect to all EC languages, the possibility of elaborating common consensual guidelines for morphosyntactic encoding in lexica and corpora" (``Synopsis and Comparison of Morphosyntactic Phenomena encoded in Lexicons and in Corpora. A Common Proposal and Applications to European Languages", Monachini and Calzolari, Oct. 1994, p.12).

In EAGLES, a bottom-up procedure, looking at existing practices in a large number of lexical and textual projects world-wide (both in lexical specifications and in corpus tagsets), has been followed, thus allowing to highlight the large core of commonalities between lexical and textual large projects with respect to the morphosyntactic phenomena described. The procedure adopted within EAGLES was, in fact:

to survey a number of encoding practices for morphosyntactic description in lexica (mainly MULTILEX and GENELEX, which in turn, are based on many different lexica for many European languages), and in corpora (i.e. the NERC consensual nucleus of morphosyntactic information encoded by the most well-known existing tagging practices and the preliminary scheme proposed by the EAGLES Corpus working groups) with the aim of finding a consensus from their comparison;
to work in close cooperation between the groups on linguistic annotation of text corpora and morphosyntactic description in computational lexica, wuth the aim of working out a compatible sets of distinctions;
to first test the proposal by applying it to the European languages.

Thus, the EAGLES proposal (in the already mentioned EAGLES reports ``Synopsis and Comparison of Morphosyntactic Phenomena encoded in Lexicons and in Corpora" and the ``Morphosyntactic Annotation", Leech and Wilson, 1994) - which is also at the basis of, or is mappable to, the lexical and corpus specifications of the LRE projects DELIS, RENOS, CRATER and MECOLB, MLAP project PAROLE, and the French project GRACE - has been the starting point of Task 1.6 within MULTEXT.
The partners have been asked to:

evaluate if the features and values presented in the tables for each PoS at Level 1, i.e. the recommended features, suit their respective languages and their established practice (an example of the PoS tables used within EAGLES is given below);

[IMAGE ]
add features and values needed at the language specific level.

Reports on the evaluation of the EAGLES specifications have been contributed by the partners involved in this MULTEXT task, and comments, suggestions and critical remarks are being taken into account in the EAGLES proposal which is being accordingly revised. MULTEXT Task 1.6. can be seen as the largest contribution, together with DELIS, to the testing, refining and revising of the EAGLES proposal. An example of this interaction was the major revision of the EAGLES proposal which affected the Pronoun/Determiner category proposed, now split into two different categories.

Experience shows that the process of consensus building is a slow process, because of the different interests to be adjusted. Considerations coming from ``re-usability" of existing material, as well as from theoretical and application-oriented arguments, have been raised in discussions under this task and should also be taken into account when evaluating its progress and results. Leading ideas to reach final decisions have been described above and will be examined in detail in the following subsections. They can be summarized by the statement of the MULTEXT strong committment to standardization and harmonization of lexical encoding initiatives, now active in Europe with the aim of sharing public domain resources.

[Next] [Up] [Previous] [Contents]
Next: Lexical descriptions and corpus Up: Multext D1.6.1 B Previous: Introduction

Multext