next up previous contents
Next: Notation Up: Introduction Previous: Description of the

Lexical lists

The proposed specifications with the respective codes will be used to encode the word-form lists constituting the resources to run under the tool which will perform the automatic tagging of the corpus.

Note that the lexicons supplied by each MULTEXT-East partners will have the following form:

Word-form, lemma, morphosyntactic information, TAG

The TAG part will be provided in a second phase of the Project. Our experience with the six original MULTEXT languages demonstrated that it is not possible to specify identical tagsets across languages, even those within the same language family. The need for idiosyncratic tagsets for each language has been confirmed also within PAROLE-MLAP. The comparability and the harmonization of the linguistic properties represented in different tagsets can be obtained only by defining them according to the specifications contained in the lexicons, i.e. relating the tagset to the lexicon. In such a way, these specifications, agreed and harmonized across languages, make different physical tagsets compatible and mappable one onto each other. In other words, lexical specifications are used as a common platform across languages, a sort of ``interface'', which permits different tagsets ``to speak''.

This is the philosophy which guided, within EAGLES , the tagset mapping exercise (Teufel 1995), where two different tagsets are mapped via the lexicon: lexicon specifications are modelled in typed hierarchy, the semantics of each physical tag of the two tagsets is defined according to the specifications themselves by means of Prolog rules and the mapping is automatically performed by a powerful tool (the same is being done for the mapping of two different Italian tagsets).

Tomaz Erjavec
Wed Oct 16 12:08:36 MDT 1996