next up previous contents
Next: The Corpus on Up: Introduction Previous: Description of the

The Corpus Encoding

At this stage of the project, 16 of the 19 component corpora were encoded as stand-alone SGML documents and validated against Level 1 of the Corpus encoding Standards DTD, V3.15. The CES DTD, along with documentation can be obtained from http://www.cs.vassar.edu/CES/.

The Level 1 markup includes a TEI conformant header (file, encoding, profile and revision descriptions), and universal text elements down to the level of the paragraph, e.g. textual divisions, paragraphs, titles and headings, footnotes, tables and poems. Some CES 2 level markup has also been included, e.g. quoted material, rendition information, and, to varying degrees, abbreviations, dates, names and numbers and sentences.

In general, each component corpus (e.g. Estonian Fiction) is currentlly encoded as a separate ``corpus'', that is, with it's own header, making no reference to the other MULTEXT-East component corpora.

For most corpus components, a ``sampler'' is also available, containing the complete header (but marked as ``sample'') and a small portion of the corpus text. These samples usually represent the more heavily marked-up portion of the component corpus.

The details on the mark-up of the component corpora can be found in the 'Corpus Encoding' sections of Chapter 2, Chapter 3, and Chapter 4.

Normalised SGML

The intention is to have the corpus encoded in normalised SGML (nSGML), which imposes further restrictions on SGML documents in order to improve readability and reduce the level of SGML awareness needed by tools that are to process the corpora. In the present milestone M corpus not all nSGML criteria are satisfied. A document in nSGML format satisfies (MLCC p.37) the following criteria:

  1. Document is a valid SGML document according to some supplied DTD.
  2. Document is coded using one of the ISO character sets, with embedded character entities where necessary.
    (Currently some of the component documents are encoded using SGML entities only.)
  3. Reference concrete syntax --- processing 8-bit clean in data and attribute values.
  4. No capacity/length restrictions.
  5. No short refs or tag minimisation.
  6. No SUBDOCS.
  7. No marked sections.
  8. All end-tags present (except for empty elements).
  9. All entity references terminated with ``;''.
  10. No SGML elements are broken across multiple lines
    (This does currently not hold in the corpus).

Further corpus work

A full list of the files in the MULTEXT-East corpus is given in Appendix A.3. Currently, all the 19 component corpora were encoded as stand-alone SGML documents and checked for SGML conformance against the Corpus Encoding Standards DTD, V3.15. This DTD, along with the accompanying declaration and entity files is also part of the current MULTEXT-East corpus file set.

To harmonise and combine the component corpora into one corpus, the following steps are still needed:

The effort estimated for the above tasks is approximately 2--4 person months.

In the second year of the project, the following tasks are then to be performed on the MULTEXT-East CES level 1 corpus:



next up previous contents
Next: The Corpus on Up: Introduction Previous: Description of the



Tomaz Erjavec
Sat May 18 20:25:31 MDT 1996