Lexical descriptions and corpus tags

  This section has been elaborated by Vladimír Petkevic, Faculty of Philosophy, Charles University, Prague.

As in Multext, Multext-East distinguishes the morphosyntactic information contained in the word-form lexica from the corpus tags. Five of the partners out of seven (Bulgarian, English, Estonian, Hungarian, Romanian) developed the set of corpus tags to be used for automatic disambiguation, whereas for Czech and Slovene no special set of corpus tags was developed, i.e. the corpus was annotated directly with the lexical morphosyntactic descriptions.

The tagsets for the four languages are less discriminatory than their morphosyntactic descriptions, i.e. for each language there is a many-to-one mapping between the set of morphosyntactic descriptions and the set of corpus tags. In other words, a corpus tag is, as in Multext seen as a kind of underspecified lexical morphosyntactic description.

As detailed in (Bel, Calzolari and Monachini, eds. 1995), there are two reasons for distinguishing corpus tags from morphosyntactic descriptions:

The probabilistic taggers used for the disambiguation can only take account of the local context of a word. Thus some morphological distinctions reflected in morphosyntactic descriptions cannot be resolved by such taggers. It would be necessary to use much more complex tools for distinguishing these fine distinctions, which are, however, well beyond the scope of the project.
The corpora developed within the project are too small for the tagger(s) to be properly trained so that they could distinguish between rare configurations. The much smaller tagsets thus make better use of the limited training material.

The following table gives the ``measure of collapsing'' of the morphosyntactic descriptions into the corpus tags for all the languages involved:

            Number of lex. specifications  Number of corpus tags
            =============================  =====================

English               131                            44
Romanian              661                            79
Slovene              2044                             -
Czech                1316                             -
Bulgarian             323                           119
Estonian              760                            76
Hungarian             618                           101

As the choice of the corpus tagset is fully language-dependent, no attempt at reaching a common tagset for all the languages involved in the project has been made (one has also to consider that among the languages there are representatives of four language families: Germanic, Finno-Ugric, Romance and Slavic). Thus, for each of the four languages there is a distinct tagset. For each language, the mapping of the set of lexical specifications to the corpus tagset is reflected in a separate table (tbl.tag.corpus.xx, where xx is the country's code).

In addition to corpus tags related to lexical morphosyntactic specifications, there are also corpus tags for punctuation which are described in the subsequent section.

