Markup

Next: Sample corpus Up: Background and approach Previous: Tools

Markup

One of the goals of MULTEXT is to develop standards for encoding text corpora. We distinguish four levels of document markup:

Level 0.

Document-wide markup:

bibliographic description of the document, etc.
character sets and entities
description of encoding conventions

Level 1.

Gross structural markup:

structural units of text, such as volume, chapter, etc., down to paragraph level
footnotes, titles, headings, tables, figures, etc.

Level 2.

Markup for sub-paragraph structures:

sentences, quotations
words
abbreviations, names, dates, terms, cited words, etc.

Level 3.

Markup for linguistic annotation:

morphological information
syntactic information--e.g., parts of speech
alignment of parallel texts
prosody

Level 0 provides global information about the text, its content, and its encoding. Level 1 includes universal text elements down to paragraph level, which is the smallest unit that can be identified language-independently. Level 2 explictly marks sub-paragraph structures which are language-dependent and usually signalled (sometimes ambiguously) by typographical marks in the text. Level 3 enriches the text with the results of some linguistic analyses.

The TEI Guidelines provide the basis for markup at levels 0 (the TEI header), 1 and 2 as well as many elements of level 3. In collaboration with Eagles , MULTEXT is extending the TEI scheme in order to specify a TEI -conformant Corpus Encoding Style (CES) that is optimally suited to NLP research and can therefore serve as a widely accepted TEI -based style for European corpus work. Application of the CES to CEE languages, which may require minor modifications to accomodate CEE language-specific information and structures, will provide a test of both the TEI Guidelines and MULTEXT and Eagles ' extensions to it.

Next: Sample corpus Up: Background and approach Previous: Tools

Tomaz Erjavec
Mon May 20 13:01:13 MDT 1996