At this stage of the project, 16 of the 19 component corpora were encoded as stand-alone SGML documents and validated against Level 1 of the Corpus encoding Standards DTD, V3.15. The CES DTD, along with documentation can be obtained from http://www.cs.vassar.edu/CES/.
The Level 1 markup includes a TEI conformant header (file, encoding, profile and revision descriptions), and universal text elements down to the level of the paragraph, e.g. textual divisions, paragraphs, titles and headings, footnotes, tables and poems. Some CES 2 level markup has also been included, e.g. quoted material, rendition information, and, to varying degrees, abbreviations, dates, names and numbers and sentences.
In general, each component corpus (e.g. Estonian Fiction) is currentlly encoded as a separate ``corpus'', that is, with it's own header, making no reference to the other MULTEXT-East component corpora.
For most corpus components, a ``sampler'' is also available, containing the complete header (but marked as ``sample'') and a small portion of the corpus text. These samples usually represent the more heavily marked-up portion of the component corpus.
The details on the mark-up of the component corpora can be found in the 'Corpus Encoding' sections of Chapter 2, Chapter 3, and Chapter 4.
The intention is to have the corpus encoded in normalised SGML (nSGML), which imposes further restrictions on SGML documents in order to improve readability and reduce the level of SGML awareness needed by tools that are to process the corpora. In the present milestone M corpus not all nSGML criteria are satisfied. A document in nSGML format satisfies (MLCC p.37) the following criteria:
A full list of the files in the MULTEXT-East corpus is given in Appendix A.3. Currently, all the 19 component corpora were encoded as stand-alone SGML documents and checked for SGML conformance against the Corpus Encoding Standards DTD, V3.15. This DTD, along with the accompanying declaration and entity files is also part of the current MULTEXT-East corpus file set.
To harmonise and combine the component corpora into one corpus, the following steps are still needed:
The effort estimated for the above tasks is approximately 2--4 person months.
In the second year of the project, the following tasks are then to be performed on the MULTEXT-East CES level 1 corpus: