Tools for annotation and exploitation of parallel corpora:
	   the case of the IJS-ELAN Slovene-English corpus

			    Tomaz Erjavec
		  Department of Intelligent Systems
		      Institute ``Jozef Stefan''
			 Ljubljana, Slovenia

			    Talk given at
		    Centre for Corpus Linguistics
		       University of Birmingham
			  February 9th 2001


The talk presents the production chain used to build the IJS-ELAN
parallel corpus and experiments used to morphosyntactically tag and
lemmatise the corpus, and to extract from it bi-lingual terminological
lexica. The common strand in this work is availability: open standards
are used to encode the resources, the corpus itself is freely
downloadable, and the tools used in the production and exploitation of
the resources are, for the most part, publicly available.

The first part of the talks presents the current release of the
IJS-ELAN corpus. The corpus, produced with the support of the EU
actions ELAN and TELRI, is a 1 million word Slovene-English parallel
corpus composed of fifteen recent terminology-rich texts. The corpus
is segmented, sentence aligned and tokenised and is encoded according
to the Guidelines for Text Encoding and Interchange (TEI). We overview
the encoding of the corpus and the tools used to produce it: Perl, the
Vanilla aligner and the MULTEXT segmenter and tokeniser.

The second part of the talk deals with the more experimental issues of
word-class syntactic tagging and extraction of translation
equivalents. Previous work proceeded in the scope of the MULTEXT-East
project where we developed, for 7 languages, morphoysntactic tagsets
and descriptions, medium sized inflectional lexica, and a small hand
annotated corpus (Orwell's 1984).

These tagging resources are now being used for a series of experiments
in automatic tagging centering around the IJS-ELAN corpus. We describe
our work on tagger evaluation, methods used to improve accuracy of
tagging with the tri-gram tagger TnT, and the problem of lemmatising
unknown words. For this last we used a machine learning program CLOG,
which learns inflectional analysis rules by building decision lists of
affixes. We also briefly mention the question of tagging and
lemmatising the English part of the corpus.

Finally, we describe some tools to extract translation equivalents
from parallel corpora. Very simple programs can already be used to
extract useful, if limited translations - as the example serves the
Perl library Approx as used to extract cognates.  We next present the
publicly available systems 21 and UPLUG; the is first limited to
single words, while the second is already able to extract multiword
units.  We conclude with a discussion of [Tufis, TELRI'00], where,
given sufficiently rich knowledge (tagged and lemmatised corpus), and
making various prior assumptions on the possible translation
equivalence patterns, extremely high precision can be achieved.