Next: The Corpus on Up: Introduction Previous: Description of the

The Corpus Encoding

At this stage of the project, 16 of the 19 component corpora were encoded as stand-alone SGML documents and validated against Level 1 of the Corpus encoding Standards DTD, V3.15. The CES DTD, along with documentation can be obtained from http://www.cs.vassar.edu/CES/.

The Level 1 markup includes a TEI conformant header (file, encoding, profile and revision descriptions), and universal text elements down to the level of the paragraph, e.g. textual divisions, paragraphs, titles and headings, footnotes, tables and poems. Some CES 2 level markup has also been included, e.g. quoted material, rendition information, and, to varying degrees, abbreviations, dates, names and numbers and sentences.

In general, each component corpus (e.g. Estonian Fiction) is currentlly encoded as a separate ``corpus'', that is, with it's own header, making no reference to the other MULTEXT-East component corpora.

For most corpus components, a ``sampler'' is also available, containing the complete header (but marked as ``sample'') and a small portion of the corpus text. These samples usually represent the more heavily marked-up portion of the component corpus.

The details on the mark-up of the component corpora can be found in the 'Corpus Encoding' sections of Chapter 2, Chapter 3, and Chapter 4.

Normalised SGML

The intention is to have the corpus encoded in normalised SGML (nSGML), which imposes further restrictions on SGML documents in order to improve readability and reduce the level of SGML awareness needed by tools that are to process the corpora. In the present milestone M corpus not all nSGML criteria are satisfied. A document in nSGML format satisfies (MLCC p.37) the following criteria:

Document is a valid SGML document according to some supplied DTD.
Document is coded using one of the ISO character sets, with embedded character entities where necessary.
(Currently some of the component documents are encoded using SGML entities only.)
Reference concrete syntax --- processing 8-bit clean in data and attribute values.
No capacity/length restrictions.
No short refs or tag minimisation.
No SUBDOCS.
No marked sections.
All end-tags present (except for empty elements).
All entity references terminated with ``;''.
No SGML elements are broken across multiple lines
(This does currently not hold in the corpus).

Further corpus work

A full list of the files in the MULTEXT-East corpus is given in Appendix A.3. Currently, all the 19 component corpora were encoded as stand-alone SGML documents and checked for SGML conformance against the Corpus Encoding Standards DTD, V3.15. This DTD, along with the accompanying declaration and entity files is also part of the current MULTEXT-East corpus file set.

To harmonise and combine the component corpora into one corpus, the following steps are still needed:

converting the corpus into nSGML and possibly into text-markup invariant, i.e. ensuring that each line consists of either only a tag or only data;
standardising the corpus as regards whitespace, rendering information (value of the rend attribute);
re-checking the encoding practices used by partners, and harmonising them where this seems required --- this is especially important for the parallel corpus part, and the MULTEXT aligner could be of help in discovering discrepancies in the encoding practices;
further standardising the header information in the component corpora, which is important for
combining the component corpora into one CES corpus (one SGML document);
with the help of, and as a test for MULTEXT tools, further tagging the corpora for word-level units, i.e. for names, abbreviations, dates & numbers.

The effort estimated for the above tasks is approximately 2--4 person months.

In the second year of the project, the following tasks are then to be performed on the MULTEXT-East CES level 1 corpus:

pairwise sentence alignment between the English original and the six translations of the Multilingual Parallel Corpus, using the MULTEXT tools and hand-validation;
part-of-speech tagging of the corpus, using the MULTEXT tools and hand-validating a subset of the corpus;
speech corpus translation, recording, digitising and alignment of speech and transcription.

Next: The Corpus on Up: Introduction Previous: Description of the

Tomaz Erjavec
Sat May 18 20:25:31 MDT 1996