[Next] [Up] [Previous]
Next: Evaluation Up: Probabilistic Tagging in Previous: Language Resources

A Complexity Metric for Tagging Experiments

The performance of a tagger is usually measured in the percentage of correct tag assignments. While this initially sounds quite plausible, it does not say very much about the quality of the tagger. There are parameters that influence the performance which are not taken into account by a single percentage figure. The complexity of a text is proposed as an additional qualifying parameter to put the percentage score into the right perspective.

One simple measure is calculated as the average number of tags per word, ie the sum of all possible tags for all words divided by the number of words. A text with a resulting score of 1.0 is therefore trivial to tag, as each word only has one possible tag. A sample score for an English text is 1.92, while the sample Romanian scores are 1.69 (Orwell's 1984) and 1.72 (Plato's Republic). These scores were computed by ignoring the punctuation. The larger the score, the higher the ambiguity of the text and the more difficult it is to tag. The complexity of a text to be tagged is mostly dependent on the tagset used in the tagger.

In evaluating the complexity of the Romanian texts we should note that it was decreased due to some encoding decisions which are largely described in D. Tufis et all. ``Corpora and Corpus-Based Morpho-Lexical Processing''. For instance by considering the case syncretism in Romanian, the Nominative and Accusative were conflated into Direct case while Genitive and Dative were conflated into the Oblique case. For instance, considering the tags for Nouns, Adjectives, Pronouns, Determiners and Articles for which the Case information has been preserved (as oposed to Numerals, Prepositions and Abbreviations where this distinction has beed dropped), one could find a large difference in the complexity of the same text when case syncretism is expanded as versus the case when sycretic cases are conflated. In Plato's Republic the number of Case relevant tags, which the tagger would be supposed to differentiate would be in the first case 138585 while in the second one this number is just 24425.

Obviouslly the cardinality of the tagset when expanding the Case syncretism would be much larger (instead of the actual 79 codes, the tagset would contain almost 500 codes).

Another important decision was to eliminate the neuter gender since in singular all the neuters behave like masculines and in plural like feminines. With only these two encoding decisions, the complexity of the texts decreased significantly down to the before-mentioned manageable values.


[Next] [Up] [Previous]
Next: Evaluation Up: Probabilistic Tagging in Previous: Language Resources

Multext-East