Dan Tufis & al * Corpora and Corpus-Based Morpho-Lexical
Processing
5. Conclusions
The paper described the results in constructing TEI-conformant
encoded corpora in a multilingual environment, a wide coverage
lexicon and a large training corpus. By several counts and statistics
we showed that the training corpus covers most of the morpho-syntactic
classes defined for Romanian, thus offering a solid basis for
further projects on automatic disambiguation and inductive natural
language processing. We have shown that given the fact that in
Romanian (as in other highly inflectional languages) the intra-categorial
ambiguity is by far more significant than the inter-categorial
ambiguity, the tagsets used for statistical disambiguation as
developed for western languages (especially those used for English)
are not appropriate and future trial- and error experiments are
needed to design useful tagsets and tagging procedures.
We plan to investigate, based on the MSD ambiguity
classes which we identified, the possibility to construct in a
principled way language specific tagsets. Given that the morpho-syntactic
specification of the lexicon was done in a multilingual environment
(7 languages), the prospect of development of the language specific
tagsets with a common methodology, starting from these common
morpho-syntactic specifications, is challenging. At present, a
semi-automatic procedure allows for generating from the MSD several
tagsets, according to criteria specified by the designer. This
is a first step towards the tagset design. Since modifying the
training corpus is not desirable (actually annotated at the MSD
level, that is at the finest granularity as provided by the lexicon),
the automatic tagging procedure should be able to consider an
external resource file mapping the MSDs onto a corpus experimental
tagset. This way, one should be able to experiment with several
tagsets without modifying training corpus, but just the small
mapping file.
Note:
At the time this article went into print some extremely good results
were obtained in automatic tagging of Romanian texts. The tagger,
a brand new version of the Birmingham tagger authored by Oliver
Manson, was trained on about 180.000 hand tagged texts extracted
from "1984" and "Republic" and tested on about
50.000 unseen words from the same books. The tagset contained
82 tags (including 9 tags for punctuation) and the accuracy of
the tagging process was well beyond our expectations, namely 95.5%
for "Republic" and 97.1% for "1984".
According to our knowledge these are the best results
obtained up to now for an Eastern European Language.
The details on the tagger, the tagset design as well
as other experimental data are presented in [16].
References
- J. SINCLAIR, Corpus Typology, EAGLES DOCUMENT EAG-CWG-IR-2,
Version of October 1994.
- W. TEUBERT, Language Resources for Language Technology,
in Recent Advances in Romanian Language Technology, D. Tufiº,
P. Anderson (eds.), Edit. Academiei Române, Bucureºti,
1997.
- ANA-MARIA BARBU, D. TUFIª, C. DIACONU, LIDIA
DIACONU, A Plea for Corpus-based Linguistics, in Language
and Technology, D. Tufiº (ed.), Edit. Academiei Române,
Bucureºti, 1996 (in Romanian).
- N. IDE, J. VERONIS, Corpus Encoding Standard,
MULTEXT/EAGLES Report,
http//www.lpl.univ-aix.fr/projects/multext/CES/CES1.html,
1995.
- D. TUFIª, ªt. BRUDA,
Structure Markup in CES and Preliminary Statistics on Romanian
Translation of Plato's "Republic", in TELRI Newsletter,
no.5, April 1997.
- C.M. SPERBERG-MCQUEEN, L.
BURNARD, Text Encoding Initiative: Guidelines for Electronic
Text Encoding and Interchange, Chicago and Oxford Press, 1994.
- N. IDE, J. VERONIS (eds.),
Text Encoding Initiative: Background and Context, Dordrecht,
Kluwer Academic Publishers, 1995.
- M. MONACHINI, N. CALZOLARI
(eds.), Synopsis and Comparison of Morphosyntactic Phenomena
Encoded in Lexicons and Corpora. A Common Proposal and Applications
to European Languages, EAGLES Report, October 1994.
- N. CALZOLARI, M. MONACHINI
(eds.), Common Specifications and Notation for Lexicon Encoding
and Preliminary Proposal for the Tagsets, MULTEXT Report,
March 1995.
- T. ERJAVEC, N. IDE, D. TUFIª,
Development of Common Lexical Specifications for Six Eastern
European Languages and Their Application to Stochastic Part of
Speech Tagging, in Proceedings of ACH/ALLC'97 Conference,
Kingston, Ontario, Canada, June 3-7, 1997.
- T. ERJAVEC, N. IDE, V. PETKEVIC,
J. VERONIS, MULTEXT-EAST: Multilingual Text Tools and Corpora
for Central and Eastern European Languages, in Proceedings
of the Trans European Language Resource Infrastructure First Conference,
Tihany, September 1996.
- D. TUFIª, Morpho-lexical
processing in a multilingual environment, Tutorial Notes at
EUROLAN'97 Summer School on Corpus Linguistics, Tuºnad, July
1997,
http//www.infoiasi.ro/eurolan/annl and also
http//www.racai.ro/~tufis.
- D. TUFIª, A Unification-Based
Description of Gender-Agreement in Romanian, (forthcoming).
- D. FARKAS, D. ZEC, Agreement
and Pronominal Reference, in Gugliermo Cinque and Giuliana
Giusti (eds.), Advances in Romanian Linguistics, John Benjamin
Publishing Company, Amsterdam -Philadelphia, 1995.
- D. TUFIª, A Generic
Platform for Developing Language Resources and Applications,
in Proceedings of the Trans European Language Resource Infrastructure
Second Conference, Kaunas, April 1997.
- O. MANSON, D. TUFIª,
Probabilistic Tagging in a Multilingual Environment: Making
an English Tagger Understand Romanian, in Proceedings of the
Third TELRI International Seminar, Tuscan World Center, Montecatini,
October 1997.
43