Next: Markup
Up: Background and approach
Previous: Background and approach
In order to ensure the project's feasibility, MULTEXT is using only
state-of-the-art methods in tool development. The project uses these
methods to produce a set of tools that is freely available, coherent,
extensible, and language-independent. The tools are implemented under
UNIX. All MULTEXT tools are designed with an engine-based approach
where all language-dependent materials are provided as data.
Therefore, extension of the tools in MULTEXT-East to cover
CEE languages will generally only involve
providing the appropriate tables and rules. However, some adjustments
are expected in the engines, given the new range of problems posed by
different families of languages. The tools fall into two general
categories:
Corpus annotation tools:
- segmenter: marks sentences, quotations, words, abbreviations,
names, terms, etc.;
- morphological analyser: provides possible lemmas, morphological
features, and parts of speech;
- part-of-speech disambiguator: disambiguates parts of speech
where alternatives exist;
- aligner: provides alignments of passages among parallel texts;
- prosody tagger: derives automatic modelling of F0 curve and
symbolic coding of intonation from the speech signal;
- post-editing tools: assist in hand validation of automatically
annotated corpora.
Corpus exploitation tools:
- indexing tools: construct indexes for fast access to data;
- search and retrieval tools: browse, build concordances, retrieve
collocations, etc., based on a given word or words, pattern, syntactic
category, etc.;
- statistical and quantitative tools: generate lists and basic
statistics for words, collocates (pattern or part of speech) such as
frequency, mutual information, etc. Also word lists, lists by
syntactic category, etc.
All tools are integrated by means of a common user interface into a
general-purpose corpus manipulation system suitable for NLP research.
Tomaz Erjavec
Mon May 20 13:01:13 MDT 1996