As in Multext, Multext-East distinguishes the morphosyntactic information contained in the word-form lexica from the corpus tags. Five of the partners out of seven (Bulgarian, English, Estonian, Hungarian, Romanian) developed the set of corpus tags to be used for automatic disambiguation, whereas for Czech and Slovene no special set of corpus tags was developed, i.e. the corpus was annotated directly with the lexical morphosyntactic descriptions.
The tagsets for the four languages are less discriminatory than their morphosyntactic descriptions, i.e. for each language there is a many-to-one mapping between the set of morphosyntactic descriptions and the set of corpus tags. In other words, a corpus tag is, as in Multext seen as a kind of underspecified lexical morphosyntactic description.
As detailed in (Bel, Calzolari and Monachini, eds. 1995), there are two reasons for distinguishing corpus tags from morphosyntactic descriptions:
The following table gives the ``measure of collapsing'' of the morphosyntactic descriptions into the corpus tags for all the languages involved:
Number of lex. specifications Number of corpus tags
English 131 44
Romanian 661 79
Slovene 2044 -
Czech 1316 -
Bulgarian 323 119
Estonian 760 76
Hungarian 618 101
As the choice of the corpus tagset is fully language-dependent, no attempt at reaching a common tagset for all the languages involved in the project has been made (one has also to consider that among the languages there are representatives of four language families: Germanic, Finno-Ugric, Romance and Slavic). Thus, for each of the four languages there is a distinct tagset. For each language, the mapping of the set of lexical specifications to the corpus tagset is reflected in a separate table (tbl.tag.corpus.xx, where xx is the country's code).
In addition to corpus tags related to lexical morphosyntactic specifications, there are also corpus tags for punctuation which are described in the subsequent section.