The tagger was evaluated twice on about 23.000 words, the beginning passages of 1984 andRepublic. They had not been included in the training data. A third test included tagging the full training corpus with resources taken from the total amount of data. The results are highly encouraging, with correct assignments between 97% and 98%.
As shown in the previous sections, the tagger output is a a tabular
format with each word on a line followed by an ordered list of pairs
[tag:probability]''. The above mentioned accuracy was measured
considered only the highest probability assigned tag.
Another option is 2-best tagging (a variant of K-best tagging, (see Brill, Computational Linguistics, 1995), that is to preserve in the output the first two best rated tags, when the difference among their probabilities is below a specific threshold. We set this threshold to 1, and out of the 641 errors in the first 23643 words in 1984, 126 were removed making an increase of accuracy from 97.288% to 97.821% at the price of increasing the average number of tags per word from 1 to 1.04, that is out of the 23643 tagged words, only 969 had 2 tags.
By analysing the 3085 tagging errors in the entire Orwell's 1984 we discovered that 189 errors attributed to the tagger were human made errors in the hand-tagged training corpus. The tags atributed by the tagger were in all these cases the correct ones, so the actual accuracy for this experiment even better than the one reported above. Several other (real errors) errors suggested further modifications on the tagset. Some of these modifications are discussed below.
We run the tagger on several tagsets and in those cases where it was definitely wrong (that is the propability of the chosen tag was much larger than the probability of the correct one) we modified by hand the MSD and the corresponding tag. This happened for instance with the wordforms si and si- which in the lexicon are listed as both conjunction and adverb. Being very hard to be classified correctly, the two interpretations were merged into one tag (CVR). Another example is given by the word fi, initially marked as an aspectual particle, but which rarely has been correctly identified by the tagger. Since this is a very frequent word in Romanian, it's ``contribution'' to the errors list was significant(4.28% of all the errors). We turned it into an infinitive (what most grammar books would do), and all these errors have been prunned. Other follow-up modification of the tagset consisted in making distinction among the particles (Qn, Qs, Qf) initially conflated into one tag (Q). This distiction eliminated several disambiguation errors (such as distinction between auxiliary and main usage of some verbs). On the same line, must be mentioned making a distiction (which in the initial tagset was not done) among the proper nouns and common nouns. Making this distinction helped in diminishing tagging errors of the word lui which when precedes a proper noun is always a genitival article (not a pronoun - which is by far the most probable tag). Another problem was made by the initial tag ASRN which was meant for adjectives, singular, direct case, indefinite. By analysing all the MSDs that were mapped to this tag we noticed that they identified only feminine adjectives, although we didn't mean to preserve the gender distinction in our tagset. This tag was in most cases mistakenly prefered to the more general tag ASN (adjectives, singular, indefinite). Again, by observing the MSDs mapped into the ASN tag, we noticed only masculine, indefinite adjectives in singular, which are Case undetermined. Therefore, a natural decision was to conflate the ASRN and ASN tags. This decision eliminated tagging errors for those few adjectives (but very commonly used) which in singular and indefinite forms have the direct Case of the feminine foms identical to the masculine forms (for instance ``mare''-big).