next up previous contents
Next: Sample corpus Up: Background and approach Previous: Tools

Markup

 

One of the goals of MULTEXT is to develop standards for encoding text corpora. We distinguish four levels of document markup:

Level 0.
Document-wide markup:

Level 1.
Gross structural markup:

Level 2.
Markup for sub-paragraph structures:

Level 3.
Markup for linguistic annotation:
Level 0 provides global information about the text, its content, and its encoding. Level 1 includes universal text elements down to paragraph level, which is the smallest unit that can be identified language-independently. Level 2 explictly marks sub-paragraph structures which are language-dependent and usually signalled (sometimes ambiguously) by typographical marks in the text. Level 3 enriches the text with the results of some linguistic analyses.

The TEI Guidelines provide the basis for markup at levels 0 (the TEI header), 1 and 2 as well as many elements of level 3. In collaboration with Eagles , MULTEXT is extending the TEI scheme in order to specify a TEI -conformant Corpus Encoding Style (CES) that is optimally suited to NLP research and can therefore serve as a widely accepted TEI -based style for European corpus work. Application of the CES to CEE languages, which may require minor modifications to accomodate CEE language-specific information and structures, will provide a test of both the TEI Guidelines and MULTEXT and Eagles ' extensions to it.



next up previous contents
Next: Sample corpus Up: Background and approach Previous: Tools



Tomaz Erjavec
Mon May 20 13:01:13 MDT 1996