Tools

Next: Markup Up: Background and approach Previous: Background and approach

Tools

In order to ensure the project's feasibility, MULTEXT is using only state-of-the-art methods in tool development. The project uses these methods to produce a set of tools that is freely available, coherent, extensible, and language-independent. The tools are implemented under UNIX. All MULTEXT tools are designed with an engine-based approach where all language-dependent materials are provided as data. Therefore, extension of the tools in MULTEXT-East to cover CEE languages will generally only involve providing the appropriate tables and rules. However, some adjustments are expected in the engines, given the new range of problems posed by different families of languages. The tools fall into two general categories:

Corpus annotation tools:

segmenter: marks sentences, quotations, words, abbreviations, names, terms, etc.;
morphological analyser: provides possible lemmas, morphological features, and parts of speech;
part-of-speech disambiguator: disambiguates parts of speech where alternatives exist;
aligner: provides alignments of passages among parallel texts;
prosody tagger: derives automatic modelling of F0 curve and symbolic coding of intonation from the speech signal;
post-editing tools: assist in hand validation of automatically annotated corpora.

Corpus exploitation tools:

indexing tools: construct indexes for fast access to data;
search and retrieval tools: browse, build concordances, retrieve collocations, etc., based on a given word or words, pattern, syntactic category, etc.;
statistical and quantitative tools: generate lists and basic statistics for words, collocates (pattern or part of speech) such as frequency, mutual information, etc. Also word lists, lists by syntactic category, etc.

All tools are integrated by means of a common user interface into a general-purpose corpus manipulation system suitable for NLP research.

Tomaz Erjavec
Mon May 20 13:01:13 MDT 1996