As a result of this need, the European projects MULTEXT (LRE) and EAGLES (in particular, the EAGLES Text Representation subgroup), together with the Vassar/CNRS collaboration (supported by the U.S. National Science Foundation), have joined efforts to develop a Corpus Encoding Standard (CES) optimally suited for use in language engineering, which can serve as a widely accepted set of encoding standards for corpus-based work. The overall goal is the identification of a minimal encoding level that corpora must achieve to be considered standardized in terms of descriptive representation (marking of structural and linguistic information) as well as general architecture (so as to be maximally suited for use in a text database). It also provides encoding conventions for more extensive encoding and for linguistic annotation.
The CES is an application of SGML (ISO 8879:1986, Information Processing--Text and Office Systems--Standard Generalized Markup Language), conformant to the TEI Guidelines for Electronic Text Encoding and Interchange of the Text Encoding Initiative. The TEI Guidelines are expressly designed to be applicable across a broad range of applications and disciplines and therefore treat not only a vast array of textual phenomena, but are also designed with an eye toward the maximum of generality and flexibility. Most applications will use only those parts of the TEI that are required to meet their needs. The CES is such an application; we have utilized the TEI modular DTD and the TEI customization mechanisms to select those pieces of the TEI that are appropraite for corpus encoding.
The TEI is an ongoing project and for some areas it is not complete; and as a result, there are some areas of importance for corpus encoding that the TEI Guidelines do not cover. Therefore, developing the CES has involved not only selecting from, but also in some cases extending the TEI Guidelines to meet the specific needs of corpus-based work in language engineering. All results and specifications developed for the CES are fed back to the TEI as input for further revisions of the Guidelines.
The CES has also been developed taking into account several practical realities surrounding the encoding of corpora intended for use in language engineering research and applications. In particular, at the present time and for the foreseeable future, many corpora for language engineering will be adapted from legacy data, that is, pre-existing electronic data encoded in some arbitrary format (typically, word processor, typesetter, etc. formats intended for printing). The vast quantities of data involved and the difficulty (and cost) of the translation into usable formats imply that the CES must be designed in such a way that this translation does not require prohibitively large amounts of manual intervention to achieve minimum conformance to the standard. However, the markup that would be most desirable for the linguist is not achievable by automatic means. Therefore, a major feature of the CES is the provision for a series of increasingly refined encodings of text, beyond the minimum requirements.
Due to the need for massive amounts of data, many corpora intended for use in language engineering applications are currently being created. Electrtonic texts are obtained by
Corpora are used in language engineering to gather real language evidence, both qualitative and quantitative. Qualitative evidence consists of examples which can be used for the construction of computational lexicons, grammars, and multi-lingual lexicons and term banks, for lexicography, etc. Quantitative information consists of statistics which indicate frequent or characteristic uses of language. These statistics can also be used to guide preference-based parsers, assist in lexicography, determine translation equivalents, etc. In addition, statistics can be used to drive morphological taggers, POS taggers, alignment programs, sense taggers, etc. Common operations on corpora for the purposes of language engineering include extraction of sub-corpora; sophisticated search and retrieval, including collocation extraction, concordance generation, generation of lists of linguistic elements, etc.; and the determination of statistics such as frequency information, averages, mutual information scores, etc.
We do not address corpora intended for other applications, such as stylistic studies, socio-linguistics, historical studies, information retrieval, etc., although these uses are not excluded a priori (in fact, many of the features required for these applications may be the same as those needed for language engineering). Treating a restricted domain enables development of a standard tighter than that of the TEI, by providing specific encoding solutions rather than general or multiple ones, and, most importantly, by providing standards for elements particularly important in that domain (e.g., linguistic annotation).
The CES also covers encoding conventions for linguistic annotation of text and speech, including morphosyntactic tagging, parallel text alignment, prosody, phonetic transcription, etc.
The CES is intended to cover those areas of corpus encoding on which there exists consensus among the language engineering community, or on which consensus can be easily achieved. Areas where no consensus can be reached (for example, sense tagging) are not treated at this time.
The CES provides a TEI-conformant Document Type Definition (DTD) to be used for encoding various levels of primary data encoding together with its documentation. These levels are described in section 1.4. The first level of primary data encoding is the minimum encoding level required to make the corpus (re)usable across all possible language engineering applications. Succeeding levels provide for increasing enhancement in the amount of encoded information and increasing precision in the identification of text elements. Automatic methods to achieve markup at each level are for the most part increasingly complex, and therefore more costly; the sequence is designed to accomodate a series of increasingly information rich instantiations of the text at a minimum of cost.
The CES recommends that the encoding of linguistic annotation is maintained in SGML documents separate from the primary data, to which it is associated by hyper-links. Therefore, in addition to the DTD for primary data, the CES also provides a series of independent DTDs for documents containing the different kinds of annotation information.
We recognize that changes in the specifications present problems for those who have previously implemented the standard. To alleviate this problem, we have adopted the following development strategy:
All current CES documents and DTDs will continue to be available at the following site:
<URL: http://www.cs.vassar.edu/CES/>
Anyone actively implementing the standard should consult this site regularly.
In developing the CES we have look at the work of other TEI-based corpus applications, including in particular the British National Corpus Project and the English-Norwegian Parallel Corpus Project. The various modifications of the TEI which have been developed by these groups and independently in the CES are often very similar, and it is at times difficult to know where an idea or strategy originated. We would therefore like to offer here a general acknowledgement of the work of these other projects and their influence on the CES.
We welcome and encourage user input concerning the CES.