The notation format proposed
to represent lexical descriptions consists of
linear strings of characters representing the morphosyntactic information to
be associated with word-forms. The string is constructed following the
philosophy of the Intermediate Format proposed in the EAGLES Corpus proposal
(Leech and Wilson, 1994), i.e. of having agreed symbols in predefined and
fixed positions: the positions of a string of characters are numbered 0,
1,2, etc. in the following way:
a. the agreed character at position 0 encodes part-of-speech;
b. each character at position 1, 2, n, encodes the value of one attribute
(person, gender, number, etc.);
c. if an attribute does not apply, the corresponding position in the string
contains a special marker, in our case `-' (hyphen).
Example: Ncms- (noun,common,masculine,singular,nocase)
This notation adopts the EAGLES Intermediate Format with a small
revision: the Intermediate Format encodes information by
means of digits, while in MULTEXT characters of a mnemonic nature
are preferred.
It is worth noting here that this representation is proposed for
word-form lists which will be used for a specific application, i.e.
corpus annotation. We have
foreseen these lexical descriptions as containing a full description of
lexical items. As
noted above, the sets of tags, to be used properly for automatic
corpus annotation tools, are expected to contain less information.
These lexical descriptions can be seen as notational variants of the feature-based notation in the form of attribute-value pairs. In fact, the string notation proposed, e.g.
Ex.: Ncms- (noun,common,masculine,singular,nocase)is completely synonymous to a feature-structure representation:
Ex.: {cat=noun, type=common, gender=masculine, number=singular, case=none}or
{cat=noun, type=common, gender=masculine, number=singular}The above feature structures are often also represented as follows:
+- -+ | Cat: Noun | | Type: common | | Gender: masculine | | Number: singular | +- -+
Formal characteristics relevant for our applications have been
kept.
Use of
position in the string to
encode attributes makes no restrictions on the set of
characters to be used as values. It could then be inferred that, if we
wanted to keep the formal characteristic of order independent notation,
we would have to make sure that the characters meant to represent
attribute-values
are not ambiguous. As attributes and values are linked by
positional criteria, the need of a special marker for void
attribute-value pairs is evident if we want to keep descriptions
coherent. Thus, the ``Ncms-"
style can be viewed as a short-hand notation
convenient for some users and straightforwardly mappable to the
information used in unification-based attribute-value pairs
formalisms.
When comparing MULTEXT lexical description representation format with other notations one must keep in mind that they are intended to describe word-forms, and are used in very large lexical lists which contain word-forms. It seems to us relevant to comment on this point because, although it can be justified (and we will do so below) that the same formal operations can be declared in both styles, there is little evidence for justifying the need of operations such as negation and disjunction of features and values when applying them to tagged word-forms as a result of corpus annotation.
We call this marker `not-applicable'[+],
and, as stated
above, its function is just to keep
the relationship established between attributes and values. It might be
used for the following cases (it has been proposed to use the
`not-applicable' marker in order to encode the case of a
not-applicable
feature for a particular language. However this decision
is still under discussion due to the facts reported in
section
``Comparison of attributes/values used by languages"):
a. not
applicable given a particular combination of attributes/values, i.e.
although the attribute applies to the category in a given language, it
does not apply to a particular subclass of the category.
b. not applicable to a particular lexical item, although the attribute
applies to the rest of its paradigm.
Example: in the description of pronouns, for personal pronouns the grammatical person is to be encoded, but for demonstrative pronouns it is avoided; in this case '-' is applied following (a). On the other hand, gender cannot be informative for some personal pronouns, but it is still relevant for other personal pronouns; the application of `-' follows (b):
Pd-ms "Este" Pronoun, demonstrative, masculine, singular. Pp1-s "Yo" Pronoun, personal, first, singular. Pp1mp "Nosotros" Pronoun, personal, first, masculine, plural. Pp1fp "Nosotras" Pronoun, personal, first, feminine, plural.Their uses are clearly not equivalent, but there would only be meaningful differences would occur in highly typed theories of lexical description. For illustrating this point let'us have the following type system for pronouns:
TYPES SUBTYPES ATTRIBUTES VALUES Pronoun gender masculine feminine number singular plural Demonstrative Personal person 1 2 3For this system, gender and number attributes belong to the set of features which describe all pronouns. Person will only belong to the set of features which describe personal pronouns - in addition to gender and to number. Applied to this type system, case (a) would mean that the attribute-value pair does not belong to the set of features which describe a subtype, while (b) would mean indeterminacy of a given word-form (which could be expressed as a disjunction of all the values for the particular attribute or leaving a void for the value, being open to unification; this choice mainly depends on the purpose of the description, e.g. syntactic parsing).
|phon este| |cat |gender masc|| | 'dem'|number sing|| |phon yo | |cat |gender [] || | 'pers'|number sing|| | |person 1 ||
In simpler flat type systems where distinctions are made only for the
generic type ``pronoun", both cases a. and b. will be
treated by unification mechanisms in the same way.
From the conversion point of view, we have to be concerned with the
output of the MULTEXT morphological tool, as it will be the source of
word-form lexical lists. The
Mmorph tool does not incorporate a highly
hierarchical typing system and thus no problems are expected in
converting Mmorph output into lexical descriptions of the proposed
format, if desired. The
results from applying the Mmorph tool
will probably (it strongly depends on implementation
strategies) be the following:
1. a non present attribute in the description attached to the
word-form;
2. a disjunction expression, i.e. {gender=masc|fem};
3. encoded as a third possible value, i.e. {gender=none}.
The simplest case for converting would be the third one, as then automatic non-intelligent conversion is possible. In the first two cases the conversion routine will have to make some inferences on type declarations. It is also expected, that when converting from other lexical sources, special conversion routines will have to be used. As seen above, the conversion from ``Ncms" lexical description notation into other unification based format will only be difficult if the target formalism is a highly typed system. If this is not the case, the presence of the ``not-applicable" marker will have to be converted into a special value or into nothing, leaving it open. For conversion into highly typed system it might be useful to have cases (a) and (b) marked by different characters, in order to guide an intelligent conversion routine to the desired results.
The tags (see the examples below) used to exemplify issues and
problems to be dealt with in the mapping between lexical descriptions
and corpus tags, come from the tagsets proposed in the
language-specific applications of four of the MULTEXT partners.
These tagsets (containing dfferences among them, because
constructed on the basis of tagging practices already
used by the partners) should be considered as a preliminary proposal
to be discussed for harmozation
and refined after experimentations on the MULTEXT
tagger.
Mapping of these lexical descriptions into corpus tags has also
been taken into
account. It is also considered desirable to see whether under-informative
corpus tags can be directly mappable to the lexical descriptions each one
subsumes.
Decisions about corpus tags are language dependent. The information to
be encoded depends on the ability of a given tool to disambiguate
between different potential lexical descriptions for a
given word-form. We have already mentioned the key concepts to be
applied for defining sets of corpus tags in the preceding sections.
Therefore one can first assume that the mapping from lexical
descriptions onto corpus tags can be done with conversion tables which
relate two different items: corpus tags and lexical descriptions. These
tables are likely to be modified many times in the course of the
project, based on experimentation with the disambiguation tool.
An example of such mappings is:
Lex.spec. TAG Definition Pp1msa- P1S Personal pronoun, first person, masc. sing. accusative Px1msa- P1S Reflexive pronoun, first person, masc. sing. accusative Pp1fsa- P1S Personal pronoun, first person, fem. sing. accusative Px1fsa- P1S Reflexive pronoun, first person, fem. sing. accusative Pp1msd- P1S Personal pronoun, first person, masc. sing. dative Px1msd- P1S Reflexive pronoun, first person, masc. sing, dative Pp1fsd- P1S Personal pronoun, first person, fem. sing., dative Px1fsd- P1S Reflexive pronoun, first person, fem. sing., dativeAll these lexical descriptions correspond to the Spanish form ``me". For this word-form the tags P1S - which conflates all the possible lexical descriptions - has been decided on the basis of the assumption that an automatic tool would have disambiguation problems in assigning the correct analysis among all the lexical descriptions. The correct analyis of this word-form would require syntactic analysis.
The mapping from the lexical descriptions to the corpus tags should be
applicative, that is, ``each lexical description should map to one and
only one corpus tag, while it is not possible to do
the reverse" due to the
limitations of current tagging
techniques. The situation where corpus tags
are more precise than a lexical description (i.e. one lexical tag
corresponds to more than one corpus tag) should be, in principle,
avoided.
In order to avoid redundancy in the conversion tables and to make tag optimization work easier, it has been proposed to study the possibility of having intermediate representations which prepare the conflation of information and which facilitate automatic mapping from lexical descriptions onto tags. This intermediate internal notation makes use of ``regular expressions" which incorporate operators in order to sum up the information referred by different lexical descriptions and conflated in a given tag. For the example given above, the resulting regular expression may incorporate two operators: ``match any" (.), ``list" ([]) - other possible operators proposed are ``disjunction" | and negation .
P[px]1.s[ad]- P1SHowever, the application of such regular expressions is still being studied as its use conveys some requirements on the conflation of lexical descriptions and on the construction of corpus tags. An example will illustrate the issues to be taken into account. For Spanish, first and third person of some tenses are homographs. This can be taken into account when conflating information:
Verbal paradigm regular exp. TAG cantaba, comi'a, veni'a Vmii[13]s- VMIIS cantari'a, comeri'a, vendri'a Vmcs[13]s- VMCSS cante, coma, venga Vmsp[13]s- VMSPS cantara, comiera, viniera Vmsi[13]s- VMSIS
For Italian, the conflation of information on homographs also in the verbal paradigm may cause problems to the applicative principle mentioned above:
Verbal paradigm lex.descr. regular exp. TAG premiate Vmip2p- Vm([ims]p2p-)|(ps-pf) VMP2IMCPP Vmmp2p- Vmsp2p- Vmps-pf leggete Vmip2p- Vm[im]p2p- VMP2IMP Vmmp2p- leggiate Vmsp2p- Vmsp2p- VP2CP lette Vmps-pf Vmps-ps VFPPRAs can be seen, if we use tags such as the ones above which are based on the principle ``one graphical form - one tag", there is a violation of the applicative principle, i.e. the same lexical description will correspond to two different tags, because of different conflation clusters.
In general,
it is observed
that the use of operators in regular expressions
results in a form of marking the information which is not
going to be expressed in the corpus tag. Thus, tags would have to contain
less information than the regular expression and hence than the lexical
description.
Another issue to be considered is the following. Having tags with little lexical information, as in the following French example, may lead to another problematic issue in cases where such regular expressions are also used in helping to recover all possible lexical information from a given ``under-specified" corpus tag. The mapping from the regular expression onto lexical descriptions will also have to take into account the word-form in order to reject possible descriptions which do not correspond to the tagged word-forms. Below are some examples from the proposed verbal tags and regular expressions:
TAG Regular expression Lexical descriptions Possible word-forms VM1P Vm[iscm][pifs]1p-- Vmip1p-- venons Vmii1p-- venions Vmif1p-- viendrons ... ....Let us consider that the word ``venons" is tagged as ``VM1P". If we want to know which are the lexical description to which the tag can be referring to, the explosion of the information contained in the regular expression will also give lexical descriptions which do not correspond to the word ``venons", but to other words. Regular expressions can only map a given tag for a word-form into all possible lexical descriptions for such a word-form if the information conflated only reflects ambiguities due to homography. Only with this criterion for defining tags, all the possible lexical descriptions subsumed by the corpus tag and expressed in the regular expressions will be true of a given tagged word.
If the criterion for conflating information is limited to homograph ambiguities, we see - as in the following example - that all possible lexical descriptions expanded from the regular expression are true of a given word-form.
TAG Regular expression Lexical descriptions Possible word-forms VSXICP Vm(sp.s)|(ip2s)- Vmip2s- ami Vmsp1s- ami Vmsp2s- ami Vmsp3s- amiAs mentioned in the section ``Comparison of Attributes/values used by languages", the application of the proposed operators in regular expressions for avoiding redundancy, in some cases, is not needed if lexical expressions already encode the possibility of having, for a given word-form, more than one possible lexical description. This is the case with the proposed values ``common" for gender, ``invariant" for number (in Italian), or ``object" for case (in French pronouns).
Almost all the languages treated in MULTEXT have nouns, adjectives, determiners (among others) which have the same word-form both for feminine and masculine agreement. The Italian group has proposed a value for gender named ``common" which avoids having to write two different entries with the same word-form, but with different lexical descriptions. In fact, this use of a special value advances the possible use of proposed operators in the regular expression.
word-form lexical description regular expression TAG insegnante Nccs- Nccs- NNScould also be expressed as:
word-form lexical description regular expression TAG insegnante Ncms- Nc[mf]s- or Nc.s- NNS Ncfs- or Nc(m|f)sThe need, as well as the consequences, for the mapping between lexical descriptions and corpus tags, of the regular expressions must still be regulated. It should be noted that regular expressions can be regarded as a convenient way to map the lexical descriptions to the corpus tags since, in many cases, the information in the lexicon is more precise than the information we can/want to have in the corpus tag set. Such a mapping still seems very interesting because there are many corpus tag systems, even for the same language, which makes it extremely difficult to relate the one to the other. Regular expressions could act as a common reference for the different systems to make comparison easy. Besides, regular expressions could make translations between the lexical description and corpus tags easier and enable the automatic generation of conversion tables.
The categories listed below with the relevant attributes and values are
based on EAGLES documents and are the results of a first testing based
on a proposal made by Veronis et al. 1994 for lexical specifications in
MULTEXT.
As it has already been mentioned in the section ``Background
considerations" that propose features for describing lexical items
of different languages aiming at defining a set which can be
said ``common" for all of them is a complex task. The underlying
philosophy for this task has then be to lead different groups into a
pragmatic solution where the concept of an "harmonized" set of features
could be reached.
The groups have first worked out
their lexical descriptions taking as input
EAGLES and Veronis et al. (1994) documents. The very general criterion
was to encode those proposed features which were considered relevant for
the language in question. Therefore MULTEXT also followed EAGLES
bottom-up methodology in trying to define extensively the features
``used" in the lexical descriptions for each group language,
as this procedure will
make evident the features commonly used. After this phase, whose result
can now be seen in the section ``Comparison of attribute/values used by
the groups", a new phase is envisaged as to accomodate language-specific
considerations into a general model to be used by MULTEXT. This
accomodation must take into account extensibility to other languages and
also application motivated arguments, as well as internal coherence.
For this new phase more specific criteria would be desirable with
respect the addition of new features to the EAGLES Level-1 set. The
aimed result is a ``harmonized" set of features which properly describe
lexical items of the different languages.
Following the general aim of the project, these harmonized
specifications - and the related resources - will contribute to the
standarization of the corpus annotation work. They are supposed to
serve as a user oriented additional characteristic of our tool package
in the sense that end-users will have a common ground for inspecting
and understanding the resources and tool results independently
to a
large extent of the language. This common set of features will also be
a common ground to perform comparisons of different annotation tool
results, because, as mentioned in the previous section, the existence
of many lexical description systems is causing nowadays a problem for
comparing results.
Therefore the categories and features listed below are the
common reference for the work done by the
different groups. Further discussion on this first proposal is to be
found in the section ``Comparison of the attributes/values used by the
groups" which is in turn to define criteria for changing this first
proposal.
Tables of categories
=============== ==== Part-of-Speech Code =============== ==== Noun N Verb V Adjective A Pronoun P Determiner D (for those who do not have a separate category Article T for Articles, these are included in Determiner) Adverb R Adposition S Conjunction C Numeral M Interjection I Unique U Residual X Abbreviation Y =============== ==== Each character at positions 1, 2, etc. encodes the value of one attribute (person, gender, number, etc.), according to the tables given below. 2.2.2 Attribute/value tables ---------------------------- Abbreviations used: P Position (starts with 0 for encoding PoS values) ATT Attribute name VAL Value C Code 1. Nouns (N) = ============== ============== = P ATT VAL C = ============== ============== = 1 Type common c proper p - -------------- -------------- - 2 Gender masculine m feminine f neuter n - -------------- -------------- - 3 Number singular s plural p - -------------- -------------- - 4 Case nominative n genitive g dative d accusative a = ============== ============== = 2. Verbs (V) = ============== ============== = P ATT VAL C = ============== ============== = 1 Type main m auxiliary a modal o - -------------- -------------- - 2 Mood/VForm indicative i subjunctive s imperative m conditional c infinitive n participle p gerund g supine s base b - -------------- -------------- - 3 Tense present p imperfect i future f past s - -------------- -------------- - 4 Person first 1 second 2 third 3 - -------------- -------------- - 5 Number singular s plural p - -------------- -------------- - 6 Gender masculine m feminine f neuter n = ============== ============== = 3. Adjectives (A) = ============== ============== = P ATT VAL C = ============== ============== = 1 Type qualificative f ordinal o cardinal c indefinite i possessive s - -------------- -------------- - 2 Degree positive p comparative c superlative s - -------------- -------------- - 3 Gender masculine m feminine f neuter n - -------------- -------------- - 4 Number singular s plural p - -------------- -------------- - 5 Case nominative n genitive g dative d accusative a = ============== ============== = 4. Pronouns (P) = ============== ============== = P ATT VAL C = ============== ============== = 1 Type personal p demonstrative d indefinite i possessive s interrogative t relative r exclamative e reflexive x reciprocal l - -------------- -------------- - 2 Person first 1 second 2 third 3 - -------------- -------------- - 3 Gender masculine m feminine f neuter n - -------------- -------------- - 4 Number singular s plural p - -------------- -------------- - 5 Case nominative n genitive g dative d accusative a oblique o object j - -------------- -------------- - 6 Possessor singular s plural p = ============== ============== = 5. Determiners (D) = ============== ============== = P ATT VAL C = ============== ============== = 1 Type demonstrative d indefinite i possessive s interrogative t - -------------- -------------- - 2 Person first 1 second 2 third 3 - -------------- -------------- - 3 Gender masculine m feminine f neuter n - -------------- -------------- - 4 Number singular s plural p - -------------- -------------- - 5 Case nominative n genitive g dative d accusative a oblique o - -------------- -------------- - 6 Possessor singular s plural p = ============== ============== = 6. Articles (T) = ============ =============== = P ATT VAL C = ============ =============== = 1 Type definite d indefinite i ------------- ---------------- - 2 Gender masculine m feminine f neuter n ------------- ---------------- - 3 Number singular s plural p ------------- ----------------- - 4 Case nominative n genitive g dative d accusative a = ============ ================ = 7. Adverbs (R) = ============== ============== = P ATT VAL C = ============== ============== = 1 Type general g particle p - -------------- -------------- - 2 Degree positive p comparative c superlative s = ============== ============== = 8. Adpositions (S) = ============== ============== = P ATT VAL C = ============== ============== = 1 Type preposition p postposition t circumposition c - -------------- -------------- - 2 Formation simple s compound c = ============== ============== = 9. Conjunctions (C) = ============== ============== = P ATT VAL C = ============== ============== = 1 Type coordinating c subordinating s = ============== ============== = 10. Numerals (M) = ============== ============== = P ATT VAL C = ============== ============== = 1 Type cardinal c ordinal o - -------------- -------------- - 2 Gender masculine m feminine f neuter n - -------------- -------------- - 3 Number singular s plural p - -------------- -------------- - 5 Case nominative n genitive g dative d accusative a = ============== ============== = 11. Interjections (I) 12. Unique membership class (U) 13. Residual (X) 14. Abbreviations (Y)
The following tables reflect the attributes and values used for
lexical description in MULTEXT. They take into account input supplied
by different groups which the reader can find further detailed in
specific language application annexes (see section 5).
It is worth noting that the
tables reflect both
features used by five groups using lexical descriptions
as proposed in previous versions of this document and also
features for Dutch
resulting from morphological generation with the
``mmorph" tool. The
comparison is certainly
of help in order to
have a clear picture of the level of consensus
reached with respect to harmonization when elaborating lexical lists.
It has already been mentioned in previous sections that the project is
working towards defining criteria for the application of EAGLES
guidelines on standardization of lexical resources for easy
re-usability.
Looking at the tables below some very general issues arise with
respect to the application criteria of the general tables to the
particular languages and the interpretation of the general guidelines.
Until now groups have been working on the assumption that they must
encode recommended Level-1 EAGLES features if they are relevant
to their languages. The possibility of adding new
values and new attribute/value pairs was also foreseen
if recommended features
were not enough to describe lexical items with
fine-granularity of
lexical descriptions. It was also
found useful in view of
supplying lexical material
to be used by other tools than the MULTEXT ones. This
openness has led to a number of incoherencies with respect to
application criteria which we summarize
in the points below. A decision with respect to general criteria for
application must be reached in the
next phase. Hence, here the
issues concerning harmonization which arise from comparing application
sections follow.
There is an unbalanced treatment of features considered as ``general"
for the different categories. The presence of a particular attribute
seems to be mainly justified for two reasons:
- representative in most of the studied languages;
- linguistic tradition.
We see in the comparative tables that a particular language is allowed
to add a new
attribute because of its relevance for the lexical description of a
given category (i.e.
when the language items belonging to that category are
inflected or marked with respect to it). The most evident case is the
proposal made in order to encode Possessor-gender (among other
features) for Pronouns and Determiners. It is obvious that this
feature cannot be used by languages which do not have different forms
regarding this particular distinction.
On the other hand, note that ``case" as a feature recommended as
``general" for describing Nouns, Adjectives, Pronouns, Determiners,
Articles and Numerals, is in fact used only for Pronouns by most of
the languages, and only German can apply it for the rest of
the categories.
What we mean by ``unbalanced treatment" has to do with the fact that
features being used by just a few languages, or even just one, receive
different treatment when considering them ``general" or ``language
specific".
Also arising from the possibility of adding language specific
attributes and values where relevant for a given description, the
procedure followed in this task has shown that it has not
been easy to reach a consensus in order to harmonize a number of
specific features
and values considered by a given language.
One of
cases of such proliferation of features is seen, in fact,
when considering the
comparative table for Determiners. One of the groups suggests having
language specific types to refer to ``definite article" and
``indefinite article", while other groups prefer to have a general
type
``article" and other attributes, i.e. Quantification or Definiteness to
encode this distinction at a lower level. The particular
features suggested by the groups which, in our opinion, could be adapted
to the EAGLES model will be discussed during the next phase.
Because of this openness with respect to
adding attibutes and values, we
would like to point out the case for the values ``common" and
``invariant" added by the Italian descriptions to all nominal inflected
categories for the attributes ``gender" and ``number" respectively
(where a disjunction of values could be used instead). It
is a fact that most of the languages could easily adopt this value for
the forms which are identical for masculine/feminine, singular/plural
agreement features, but this issue has certainly to be clarified and
further discussed.
Probably a decision with respect to the ``fine
granularity" of lexical descriptions should be devised. In fact there
is another example in another category of the same strategy, that
is to conflate in a new value for a given attribute a homography
which causes explosion of entries. The
French group has suggested conflating
accusative/dative values for pronominal case into ``object" as
a generic value. The new division proposed would also apply to other
romance languages but it might compromise the ``fine granularity"
tendency the project aimed at for lexical descriptions.
There can also be observed a certain confrontation of two different
traditions when some
groups propose to add a new attribute to characterise an
element while others propose to add a new value to label a new class
under a general,
already available attribute such as ``type". To add a new
attribute would correspond to the unification based grammar practice,
and a label for a class would correspond to the so called ``taxonomic"
theories. We see an example of this confrontation in the proposal of
having an attribute/value ``wh" for marking relative particles in
different categories: pronouns, determiners, adverbs. EAGLES level-1
seems to prefer separating relatives with a different value for the
attribute ``type" of pronouns and determiners. Marking as an additional
feature the relative characteristics of a given pronominal would help
for instance to specifically characterise items such as the
English ``whose"
or Spanish ``cuyo, cuya, cuyos, cuyas" which are normally described as
Possessive relative pronouns. Under the current classification a decision
must be made either
to put them under the Possessive or the Relative value of
the attribute ``type". It has also to be mentioned that no special
treatment can be made for relative adverbs which are not taken as a
separate class under Adverb type in the EAGLES proposal. Thus, from the
comparison made, it is worth mentioning that a new
attribute ``wh" for adverbs or, as suggested by the German group, a new
value for interrogative - and also for relatives -
adverbs should be devised.
As we have seen, the
EAGLES recommendations lack in some cases the desired
fine-grained distinctions which groups working in MULTEXT consider
desirable for our applications. Another example of this case is raised
by German and English. The groups dealing with these languages - and it
could also be applied to the rest of the
languages - have suggested a
specific value for comparative conjunctions. This addition seems
reasonable under the argument that it is an important feature with
respect to distributional criteria and can be of great importance for
tagging purposes. Again, some guidelines must be defined for considering
the addition of features not contained in EAGLES level-1, but it is
worth noting that several of the features added for language specific
reasons could be considered as applicable to the rest of the
languages.
We recommend a new round of discussions
on the new features suggested in
specific language applications to see whether they can be of use in
our concrete application and applicable to the rest of the
languages. Once
this discussion has led to conclusions, the approved features must be
included in the general model. Besides linguistic considerations,
having an agreed set of general features is of great concern for the
chosen notation style in lexical descriptions.
There must be regulations with
respect to the encoding of language specific attributes by the other
groups or on the ways of differentiating them from those of the general
model. This is especially relevant if general conversion routines are to
be developed. And because of theroretical coherence, the treatment
given to these features must take into account the above mentioned
``unbalanced treatment of features". Some other doubts remain in
connection with theoretical coherence and the applied nature of the
lexica to be supplied. We would only mention one of them to
illustrate the kind of issues which must be taken into account in the
next phase. It has to do with agreement features of person, number and
gender. Are they to be encoded with respect to grammatical agreement
or with respect to semantic differentiations. As it is now, following
EAGLES recommendations, it seems as if only semantic considerations
are taken into account, i.e. Possessive-person
of determiners is taken as
the ``possessor person" for most of the languages which in fact does
not trigger agreement.
A decision must be taken with respect to these cases and more specific
guidelines must be established for further development of lexical
descriptions. It seems from the comparison made that the general
criteria ``relevant for your language" is not enough. New guidelines
must also take the application side into account.
Comparison tables
Abbreviations used: P = Position (starts with 0 for encoding PoS values) ATT = Attribute name VAL = Value C = Code x = value marked by a given group (any character other than x means that a given 'language group' codes, in their application, the relevant value with that character, not using the agreed one). The column of characters is left empty in correspondence of language specific attributes/values of Dutch: they are attested, in fact, among the set of attributes and values for Dutch implementation of Mmorph, where they are not represented by means of single codes. 1. Nouns (N) Features used by the groups IT DE ES FR NL EN = ============== ============== = P ATT VAL C = ============== ============== = 1 Type common c x x x x x x proper p x x x x x x - -------------- -------------- - 2 Gender masculine m x x x x x feminine f x x x x x neuter n x x l-s. common c x l-s. De x l-s. Het x l-s. None x - -------------- -------------- - 3 Number singular s x x x x x x plural p x x x x x x l-s. invariant n x - -------------- -------------- - 4 Case nominative n x genitive g x dative d x accusative a x = ============== ============== = 5 Sem-gender M x F x N x - -------------- -------------- - 2. Verbs (V) Features used by the groups IT GE SP FR DU EN = ============== ============== = P ATT VAL C = ============== ============== = 1 Type main m x x x x x v auxiliary a x x x x x x modal o x x m l-s. copula x l-s. impersonal x - -------------- -------------- - 2 Mood/VForm indicative i x x x x subjunctive s x x x x imperative m x x x x conditional c x x x infinitive n x x x x x participle p x x x x gerund g x x supine s base b x l-s. inf. + particle u x l-s. ImPart x l-s. Past participle l-s. Present participle l-s. PerfPart x l-s. Fin x - -------------- -------------- - 3 Tense present p x x x x x x imperfect i x x x x future f x x x past s x x x x x - -------------- -------------- - 4 Person first 1 x x x x x x second 2 x x x x x x third 3 x x x x x x - -------------- -------------- - 5 Number singular s x x x x x x plural p x x x x x x - -------------- -------------- - 6 Gender masculine m x x x feminine f x x x neuter n l-s. common c x = ============== ============== = 7 Clitic l-s. no n x x yes y x x - -------------- -------------- - 8 Clitic l-s. both t x accusa a x dative d x - -------------- -------------- - 3. Adjectives (A) Features used by the groups IT DE ES FR NL EN = ============== ============== = P ATT VAL C = ============== ============== = 1 Type qualificative f x x x ordinal o x x cardinal c x x indefinite i x possessive s x x x l-s. part1 1 x part2 2 x - -------------- -------------- - 2 Degree positive p x x x x x x comparative c x x x x x x superlative s x x x x x - -------------- -------------- - 3 Gender masculine m x x x x feminine f x x x x neuter n x l-s. common c x - -------------- -------------- - 4 Number singular s x x x x plural p x x x x l-spc. invariant n x - -------------- -------------- - 5 Case nominative n x genitive g x dative d x accusative a x = ============== ============== = 6 Position l-spc. attributive a x predicative p x - -------------- -------------- - 4. Pronouns (P) Features used by the groups IT DE ES FR NL EN = ============== ============== = P ATT VAL C = ============== ============== = 1 Type personal p x x x x x x demonstrative d x x x x x indefinite i x x x x possessive s x x x x interrogative t x x x x x relative r x x x x x exclamative e x reflexive x x x x reciprocal l x l-s. general g x l-s. quantificational x - -------------- -------------- - 2 Person first 1 x x x x x x second 2 x x x x x x third 3 x x x x x x - -------------- -------------- - 3 Gender masculine m x x x x feminine f x x x x neuter n x x x l-s. common c x - -------------- -------------- - 4 Number singular s x x x x x x plural p x x x x x x l-s. invariant n x - -------------- -------------- - 5 Case nominative n x x x genitive g x dative d x x accusative a x x oblique o x x object j x l-s. 1 x l-s. 4 x - -------------- -------------- - 6 Possessor singular s x x x plural p x x x = ============== ============== = 7 Wh Not-wh n x Relative r x Int q x - -------------- -------------- - 8 Poss-person First 1 x Second 2 x Third 3 x - -------------- -------------- - 9 Poss-gender Masculine m x Femenine f x Neuter n x - -------------- -------------- - 10Sem-gender M x F x N x - -------------- -------------- - 5. Determiners (D) Features used by the groups IT DE ES FR NL EN = ============== ============== = P ATT VAL C = ============== ============== = 1 Type demonstrative d x x x x x indefinite i x x x x possessive s x x x x x x interrogative t x x x x exclamative e x relative r x article a x x l-s. Def-article t x l-s. Indef-article a x l-s. General g x l-s. quantificational x - -------------- -------------- - 2 Person first 1 x x x x second 2 x x x x third 3 x x x x - -------------- -------------- - 3 Gender masculine m x x x x x feminine f x x x x x neuter n x x x l-s common c x - -------------- -------------- - 4 Number singular s x x x x x x plural p x x x x x x l-s invariant n x - -------------- -------------- - 5 Case nominative n x genitive g x dative d x accusative a x oblique o - -------------- -------------- - 6 Possessor singular s x x plural p x x = ============== ============== = 7 Quantif./or definite d x x Defness indefinite i x x - -------------- -------------- - 8 Wh Not-wh n x Relative r x Int/Ecl q x - -------------- -------------- - 9 Poss-person First 1 x Second 2 x Third 3 x - -------------- -------------- - 10 Poss-gender Masculine m x Feminine f x Neuter n x - ------------ --------------- - 6. Articles (T) Features used by the groups IT DE ES FR NL EN = ============ =============== = P ATT VAL C = ============ =============== = 1 Type definite d x x x indefinite i x x x ------------- ---------------- - 2 Gender masculine m x x x feminine f x x x neuter n x x x l-s. common c x ------------- ---------------- - 3 Number singular s x x x plural p x x x ------------- ----------------- - 4 Case nominative n x genitive g x dative d x accusative a x = ============ ================ = 7. Adverbs (R) Features used by the groups IT DE ES FR NL EN = ============== ============== = P ATT VAL C = ============== ============== = 1 Type general g x x particle p x x l-s. degree d x l-s. interrogative i x l-s. conjunction c x l-s. modal m x l-s. pronom p x l-s. temporal t x l-s. place l x - -------------- -------------- - 2 Degree positive p x x x x x comparative c x x x x superlative s x x x x l-s. negative n x = ============== ============== == 3 Function mod x spe x - -------------- -------------- -- 4 Wh-ness interrogative q x relative r x no n x - -------------- -------------- -- 8. Adpositions (S) Features used by the groups IT DE ES FR NL EN = ============== ============== = P ATT VAL C = ============== ============== = 1 Type preposition p x x x x x x postposition t x x x circumposition c x l-s. part1 a x l-s. part2 z x - -------------- -------------- - 2 Formation simple s x x x compound c x x = ============== ============== = 3 Gender masculine m x femenine f x common c x - -------------- -------------- - 4 Number singular s x plural p x - -------------- -------------- - 9. Conjunctions (C) Features used by the groups IT DE ES FR NL EN = ============== ============== = P ATT VAL C = ============== ============== = 1 Type coordinating c x x x x x x subordinating s x x x x x x l-spc. compar v x x l-spc. infinitive i x l-spc. part1 a x l-spc. part2 z x = ============== ============== = 2 ctype finite f x that t x subjunctive s x - -------------- -------------- - 3 coord-posit. initial i x non-initial n x - -------------- -------------- - 10. Numerals (M) Features used by the groups IT DE ES FR NL EN = ============== ============== = P ATT VAL C = ============== ============== = 1 Type cardinal c x x x x ordinal o x x x - -------------- -------------- - 2 Gender masculine m x x x feminine f x x x neuter n - -------------- -------------- - 3 Number singular s x x x plural p x x x - -------------- -------------- - 5 Case nominative n genitive g dative d accusative a = ============== ============== = Categories used by the groups IT DE ES FR NL EN 11. Interjections (I) x x x x x 12. Unique membership class (U) 13. Residual (X) x x x 14. Particle (Q) x 15. Punctuation (F) x x 16. Abreviations (Y) x