Tools for annotation and exploitation of parallel corpora: the case of the IJS-ELAN Slovene-English corpus Tomaz Erjavec Department of Intelligent Systems Institute ``Jozef Stefan'' Ljubljana, Slovenia Talk given at Centre for Corpus Linguistics University of Birmingham February 9th 2001 The talk presents the production chain used to build the IJS-ELAN parallel corpus and experiments used to morphosyntactically tag and lemmatise the corpus, and to extract from it bi-lingual terminological lexica. The common strand in this work is availability: open standards are used to encode the resources, the corpus itself is freely downloadable, and the tools used in the production and exploitation of the resources are, for the most part, publicly available. The first part of the talks presents the current release of the IJS-ELAN corpus. The corpus, produced with the support of the EU actions ELAN and TELRI, is a 1 million word Slovene-English parallel corpus composed of fifteen recent terminology-rich texts. The corpus is segmented, sentence aligned and tokenised and is encoded according to the Guidelines for Text Encoding and Interchange (TEI). We overview the encoding of the corpus and the tools used to produce it: Perl, the Vanilla aligner and the MULTEXT segmenter and tokeniser. The second part of the talk deals with the more experimental issues of word-class syntactic tagging and extraction of translation equivalents. Previous work proceeded in the scope of the MULTEXT-East project where we developed, for 7 languages, morphoysntactic tagsets and descriptions, medium sized inflectional lexica, and a small hand annotated corpus (Orwell's 1984). These tagging resources are now being used for a series of experiments in automatic tagging centering around the IJS-ELAN corpus. We describe our work on tagger evaluation, methods used to improve accuracy of tagging with the tri-gram tagger TnT, and the problem of lemmatising unknown words. For this last we used a machine learning program CLOG, which learns inflectional analysis rules by building decision lists of affixes. We also briefly mention the question of tagging and lemmatising the English part of the corpus. Finally, we describe some tools to extract translation equivalents from parallel corpora. Very simple programs can already be used to extract useful, if limited translations - as the example serves the Perl library Approx as used to extract cognates. We next present the publicly available systems 21 and UPLUG; the is first limited to single words, while the second is already able to extract multiword units. We conclude with a discussion of [Tufis, TELRI'00], where, given sufficiently rich knowledge (tagged and lemmatised corpus), and making various prior assumptions on the possible translation equivalence patterns, extremely high precision can be achieved.