[Next] [Up] [Previous]
Next: About this document Up: Probabilistic Tagging in Previous: Evaluation

Conclusion and Further Work

It has been shown that it is quite easy to adapt a language dependent probabilistic tagger to work with data from other languages as well. Due to the way the resource files are created, the training process takes hardly any time at all. A client/server architecture provides an ideal framework for programs that require linguistic data, and the client implementation in Java means that it is possible to run the tagger on a wide variety of platforms.

With a comparatively small amount of training data it is possible to reach quite impressive results.

The empirical methodology we described for deriving a convenient tagset (i.e informative enough and manageable for the tagger) from a large set of morpho-syntactic description codes proved to be successful.

An old misconception, namely that highly inflected languages are doomed to poor performances when processed with probabilistic methods has been shown to be completely wrong. In fact, we do believe that a highly inflected language has better chances for a high tagging accuracy that other languages. The motivation is based on the fact that inflected words are less ambiguous than the base forms and in many cases they are simply unambiguous. Therefore, unambiguous words would act as tagging islands - clues - for the rest of the words in the text. This way the number of possibilities to be considered in finding the most probable tags assignment is signigicantly reduced.

We plan further experiments with several other tagsets and we are currently investigating the possibility to develop a postprocessor (essentially a dictionary lookup and string pattern matching program) for the tagger, able to recover the MSD-corpus from a given tag-corpus.



Multext-East