[Next] [Up] [Previous] [Contents]
Next: Language specific applications Up: Multext D1.6.1 B Previous: Notation

Comments on labels for corpus tags

Current enconding practices use widely different naming conventions for corpus tags. We can find different sets of labels also for the same language - for example SUBSMS, SBMS, NCMS, Nms, etc. can represent "Common noun, masc. sing." in different systems for the very same language.

It has been found, as already mentioned, that corpus tags are strongly committed to the tool and to the language. Therefore, each language will have its own set based on different considerations. However it was considered helpful to suggest some naming conventions for the sake of harmonization. The following is an attempt done by the French partners to give simple general guidelines for achieving a coherent naming convention within the project.

Corpus tags should be all upper case, in order to distinguish them from lexical descriptions.
Part-of-Speech should be encoded in a single character by the first letter using the same convention as the lexical categories used for lexical descriptions.
If possible, each of the characters after the first one should encode one by one the attribute-values using the same letter as in the lexical descriptions but upper case.
In case of ambiguity, or merging of values, one should use new characters, not already in the set of possible values for that attribute.

This is not a formal system, and may lead to ambiguities. In order to have a final set of tags a thourough testing must be performed as experimentation is going to show the behaviour of a given set. Also considerations coming from the decision taken with respect to the need and usefulness of special devices for automatic conversion are expected to have some impact in the concrete tags given for a language. Thus, the tags proposed by each group for the time being must be considered temptative until the end of the experimentation phase.

[Next] [Up] [Previous] [Contents]
Next: Language specific applications Up: Multext D1.6.1 B Previous: Notation

Multext