Task 1.2. Language-specific resources

Next: WP2. Tool application Up: WP1: Resource building Previous: Task 1.1. Specifications

Task 1.2. Language-specific resources

This task will produce the language-specific resources needed by the various annotation tools:

Segmentation rules. This includes rules describing the form of sentence boundaries, quotations, numbers, punctuation, capitalization, etc. These rules will be improved via a step-wise refinement which involves applying the segmenter to each language, evaluating the results, improving the rules, etc.
Special tokens. The language-specific data required by the segmenter includes lists of special tokens (frequent abbreviations and names, titles, patterns for proper names, etc.) with their types. These lists can be partially derived semi-automatically from the MULTEXT-East corpus and developed via a step-wise refinement which involves applying the segmenter to each language, evaluating the results, improving the lists, etc.
Morphological rules. This task will also provide the morphological rules for the MULTEXT-East languages, needed by the morphological tools. The rules should provide exhaustive treatment of inflection and minimal derivation. Each lemma in the lexical lists used by the project will be associated with its POS(s) and morphological rules.
Lexical lists. This task will produce the lexical lists needed by the morphological analyser and the POS disambiguator. Each list will contain no fewer than 15,000 lemmas with the following information: inflected-form / POS / morphological information / lemma. These lists will be produced by starting from similar lists available from various partners, and by mapping to a POS tagset developed in Task 2.1. The lists will be augmented and improved via a step-wise refinement which involves applying the morphological analyser and POS disambiguator to each language.

Deliverables:

1.2. Language-specific resources (DATA+REPORT, public)

Tomaz Erjavec
Mon May 20 13:01:13 MDT 1996