Next: Sample corpus
Up: Background and approach
Previous: Tools
One of the goals of MULTEXT is to develop standards for encoding
text corpora. We distinguish four levels of document markup:
- Level 0.
- Document-wide markup:
- bibliographic description of the document, etc.
- character sets and entities
- description of encoding conventions
- Level 1.
- Gross structural markup:
- structural units of text, such as volume, chapter, etc., down to
paragraph level
- footnotes, titles, headings, tables, figures, etc.
- Level 2.
- Markup for sub-paragraph structures:
- sentences, quotations
- words
- abbreviations, names, dates, terms, cited words, etc.
- Level 3.
- Markup for linguistic annotation:
- morphological information
- syntactic information--e.g., parts of speech
- alignment of parallel texts
- prosody
Level 0 provides global information about the text, its content, and
its encoding. Level 1 includes universal text elements down to
paragraph level, which is the smallest unit that can be identified
language-independently. Level 2 explictly marks sub-paragraph
structures which are language-dependent and usually signalled
(sometimes ambiguously) by typographical marks in the text. Level 3
enriches the text with the results of some linguistic analyses.
The TEI Guidelines provide the basis for markup at levels 0 (the
TEI header), 1 and 2 as well as many elements of level 3. In
collaboration with Eagles , MULTEXT is extending the TEI scheme in
order to specify a TEI -conformant Corpus Encoding Style (CES) that is
optimally suited to NLP research and can therefore serve as a widely
accepted TEI -based style for European corpus work. Application of the
CES to CEE languages, which may require minor
modifications to accomodate CEE language-specific information and
structures, will provide a test of both the TEI Guidelines and
MULTEXT and Eagles ' extensions to it.
Next: Sample corpus
Up: Background and approach
Previous: Tools
Tomaz Erjavec
Mon May 20 13:01:13 MDT 1996