Dan Tufis & al * Corpora and Corpus-Based Morpho-Lexical Processing

5. Conclusions

The paper described the results in constructing TEI-conformant encoded corpora in a multilingual environment, a wide coverage lexicon and a large training corpus. By several counts and statistics we showed that the training corpus covers most of the morpho-syntactic classes defined for Romanian, thus offering a solid basis for further projects on automatic disambiguation and inductive natural language processing. We have shown that given the fact that in Romanian (as in other highly inflectional languages) the intra-categorial ambiguity is by far more significant than the inter-categorial ambiguity, the tagsets used for statistical disambiguation as developed for western languages (especially those used for English) are not appropriate and future trial- and error experiments are needed to design useful tagsets and tagging procedures.

We plan to investigate, based on the MSD ambiguity classes which we identified, the possibility to construct in a principled way language specific tagsets. Given that the morpho-syntactic specification of the lexicon was done in a multilingual environment (7 languages), the prospect of development of the language specific tagsets with a common methodology, starting from these common morpho-syntactic specifications, is challenging. At present, a semi-automatic procedure allows for generating from the MSD several tagsets, according to criteria specified by the designer. This is a first step towards the tagset design. Since modifying the training corpus is not desirable (actually annotated at the MSD level, that is at the finest granularity as provided by the lexicon), the automatic tagging procedure should be able to consider an external resource file mapping the MSDs onto a corpus experimental tagset. This way, one should be able to experiment with several tagsets without modifying training corpus, but just the small mapping file.

Note: At the time this article went into print some extremely good results were obtained in automatic tagging of Romanian texts. The tagger, a brand new version of the Birmingham tagger authored by Oliver Manson, was trained on about 180.000 hand tagged texts extracted from "1984" and "Republic" and tested on about 50.000 unseen words from the same books. The tagset contained 82 tags (including 9 tags for punctuation) and the accuracy of the tagging process was well beyond our expectations, namely 95.5% for "Republic" and 97.1% for "1984".

According to our knowledge these are the best results obtained up to now for an Eastern European Language.

The details on the tagger, the tagset design as well as other experimental data are presented in [16].


  1. J. SINCLAIR, Corpus Typology, EAGLES DOCUMENT EAG-CWG-IR-2, Version of October 1994.
  2. W. TEUBERT, Language Resources for Language Technology, in Recent Advances in Romanian Language Technology, D. Tufiº, P. Anderson (eds.), Edit. Academiei Române, Bucureºti, 1997.
  3. ANA-MARIA BARBU, D. TUFIª, C. DIACONU, LIDIA DIACONU, A Plea for Corpus-based Linguistics, in Language and Technology, D. Tufiº (ed.), Edit. Academiei Române, Bucureºti, 1996 (in Romanian).
  4. N. IDE, J. VERONIS, Corpus Encoding Standard, MULTEXT/EAGLES Report, http//www.lpl.univ-aix.fr/projects/multext/CES/CES1.html, 1995.
  5. D. TUFIª, ªt. BRUDA, Structure Markup in CES and Preliminary Statistics on Romanian Translation of Plato's "Republic", in TELRI Newsletter, no.5, April 1997.
  6. C.M. SPERBERG-MCQUEEN, L. BURNARD, Text Encoding Initiative: Guidelines for Electronic Text Encoding and Interchange, Chicago and Oxford Press, 1994.
  7. N. IDE, J. VERONIS (eds.), Text Encoding Initiative: Background and Context, Dordrecht, Kluwer Academic Publishers, 1995.
  8. M. MONACHINI, N. CALZOLARI (eds.), Synopsis and Comparison of Morphosyntactic Phenomena Encoded in Lexicons and Corpora. A Common Proposal and Applications to European Languages, EAGLES Report, October 1994.
  9. N. CALZOLARI, M. MONACHINI (eds.), Common Specifications and Notation for Lexicon Encoding and Preliminary Proposal for the Tagsets, MULTEXT Report, March 1995.
  10. T. ERJAVEC, N. IDE, D. TUFIª, Development of Common Lexical Specifications for Six Eastern European Languages and Their Application to Stochastic Part of Speech Tagging, in Proceedings of ACH/ALLC'97 Conference, Kingston, Ontario, Canada, June 3-7, 1997.
  11. T. ERJAVEC, N. IDE, V. PETKEVIC, J. VERONIS, MULTEXT-EAST: Multilingual Text Tools and Corpora for Central and Eastern European Languages, in Proceedings of the Trans European Language Resource Infrastructure First Conference, Tihany, September 1996.
  12. D. TUFIª, Morpho-lexical processing in a multilingual environment, Tutorial Notes at EUROLAN'97 Summer School on Corpus Linguistics, Tuºnad, July 1997, http//www.infoiasi.ro/eurolan/annl and also http//www.racai.ro/~tufis.
  13. D. TUFIª, A Unification-Based Description of Gender-Agreement in Romanian, (forthcoming).
  14. D. FARKAS, D. ZEC, Agreement and Pronominal Reference, in Gugliermo Cinque and Giuliana Giusti (eds.), Advances in Romanian Linguistics, John Benjamin Publishing Company, Amsterdam -Philadelphia, 1995.
  15. D. TUFIª, A Generic Platform for Developing Language Resources and Applications, in Proceedings of the Trans European Language Resource Infrastructure Second Conference, Kaunas, April 1997.
  16. O. MANSON, D. TUFIª, Probabilistic Tagging in a Multilingual Environment: Making an English Tagger Understand Romanian, in Proceedings of the Third TELRI International Seminar, Tuscan World Center, Montecatini, October 1997.

