Next: WP2. Tool application
Up: WP1: Resource building
Previous: Task 1.1. Specifications
This task will produce the language-specific resources needed by the
various annotation tools:
- Segmentation rules. This includes rules describing the form of
sentence boundaries, quotations, numbers, punctuation,
capitalization, etc. These rules will be improved via a step-wise
refinement which involves applying the segmenter to each language,
evaluating the results, improving the rules, etc.
- Special tokens. The language-specific data required by the
segmenter includes lists of special tokens (frequent abbreviations
and names, titles, patterns for proper names, etc.) with their
types. These lists can be partially derived semi-automatically from
the MULTEXT-East corpus and developed via a step-wise refinement
which involves applying the segmenter to each language, evaluating
the results, improving the lists, etc.
- Morphological rules. This task will also provide the
morphological rules for the MULTEXT-East languages, needed by the
morphological tools. The rules should provide exhaustive treatment
of inflection and minimal derivation. Each lemma in the lexical
lists used by the project will be associated with its POS(s) and
morphological rules.
- Lexical lists. This task will produce the lexical lists needed
by the morphological analyser and the POS disambiguator. Each list
will contain no fewer than 15,000 lemmas with the following
information: inflected-form / POS / morphological information /
lemma. These lists will be produced by starting from similar lists
available from various partners, and by mapping to a POS tagset
developed in Task 2.1. The lists will be augmented and improved via
a step-wise refinement which involves applying the morphological
analyser and POS disambiguator to each language.
1.2. Language-specific resources (DATA+REPORT, public)
Tomaz Erjavec
Mon May 20 13:01:13 MDT 1996