Corpus Encoding Standard - Document CES 1. Part 0. Version 1.2. Last modified 18 June 1996.

Part 0

Introduction

0.1. Background
0.2. Scope of the CES
0.3. Overview of the CES
0.4. Status of this document
0.5. Key to tag descriptions

0.1. Background

The language engineering community has recently revived its interest in the use of empirical methods, thus creating a demand for large-scale corpora. Numerous data-gathering efforts exist on both sides of the Atlantic to provide wide-spread access to both mono- and bi-lingual resources of sufficient size and coverage for data-oriented work, including the U.S. Linguistic Data Consortium, the European Corpus Initiative (ECI), ICAME, the British National Corpus (BNC), and recently, the European Language Resources Association (ELRA). The rapid multiplication of such efforts has made it critical for the language engineering community to create a set of standards for encoding corpora.

As a result of this need, the European projects MULTEXT (LRE) and EAGLES (in particular, the EAGLES Text Representation subgroup), together with the Vassar/CNRS collaboration (supported by the U.S. National Science Foundation), have joined efforts to develop a Corpus Encoding Standard (CES) optimally suited for use in language engineering, which can serve as a widely accepted set of encoding standards for corpus-based work. The overall goal is the identification of a minimal encoding level that corpora must achieve to be considered standardized in terms of descriptive representation (marking of structural and linguistic information) as well as general architecture (so as to be maximally suited for use in a text database). It also provides encoding conventions for more extensive encoding and for linguistic annotation.

The CES is an application of SGML (ISO 8879:1986, Information Processing--Text and Office Systems--Standard Generalized Markup Language), conformant to the TEI Guidelines for Electronic Text Encoding and Interchange of the Text Encoding Initiative. The TEI Guidelines are expressly designed to be applicable across a broad range of applications and disciplines and therefore treat not only a vast array of textual phenomena, but are also designed with an eye toward the maximum of generality and flexibility. Most applications will use only those parts of the TEI that are required to meet their needs. The CES is such an application; we have utilized the TEI modular DTD and the TEI customization mechanisms to select those pieces of the TEI that are appropraite for corpus encoding.

The TEI is an ongoing project and for some areas it is not complete; and as a result, there are some areas of importance for corpus encoding that the TEI Guidelines do not cover. Therefore, developing the CES has involved not only selecting from, but also in some cases extending the TEI Guidelines to meet the specific needs of corpus-based work in language engineering. All results and specifications developed for the CES are fed back to the TEI as input for further revisions of the Guidelines.

The CES has also been developed taking into account several practical realities surrounding the encoding of corpora intended for use in language engineering research and applications. In particular, at the present time and for the foreseeable future, many corpora for language engineering will be adapted from legacy data, that is, pre-existing electronic data encoded in some arbitrary format (typically, word processor, typesetter, etc. formats intended for printing). The vast quantities of data involved and the difficulty (and cost) of the translation into usable formats imply that the CES must be designed in such a way that this translation does not require prohibitively large amounts of manual intervention to achieve minimum conformance to the standard. However, the markup that would be most desirable for the linguist is not achievable by automatic means. Therefore, a major feature of the CES is the provision for a series of increasingly refined encodings of text, beyond the minimum requirements.

0.2. Scope of the CES

0.2.1. Text types

The term corpus typically designates a collection of linguistic data, including written, spoken, or both, in one or multiple languages. In some cases, the term corpus (as opposed to terms such as collection, archive,etc.) is further restricted to apply to collections constructed according to various linguistic criteria such as representativeness and balance across a given domain, set of languages, etc. (for a fuller discussion, see the EAGLES Text Typology subgroup document). Here, we use the term corpus to refer to any collection of linguistic data, whether or not it is selected or structured according to some design criteria. According to this definition, a corpus can potentially contain any text type, including not only prose, newspapers, as well as poetry, drama, etc., but also word lists, dictionaries, etc. The CES is also intended to cover transcribed spoken data.

Due to the need for massive amounts of data, many corpora intended for use in language engineering applications are currently being created. Electrtonic texts are obtained by

typing the text into the machine by hand;
scanning the text with scanners;
acquiring texts already in electronic form, either in existing archives or in the form of material usually prepared for print publication.

The third is at present the most usual source of material for inclusion in linguistic corpora. As a result, a wide range of text types must be accomodated by the CES, including law records, technical manuals, transcriptions of debates, etc., as well as newspapers (which are an important source of material for corpora), many of which have irregular formats that require special consideration for encoding.

0.2.2. Languages

The CES applies to monolingual corpora including texts from a variety of western and eastern European languages, as well as multi-lingual corpora and parallel corpora comprising texts in any of these languages.

0.2.3. Applications

The CES is intended to be used for encoding corpora used as a resource in language engineering, including all areas of natural language processing, machine translation, lexicography, etc.

Corpora are used in language engineering to gather real language evidence, both qualitative and quantitative. Qualitative evidence consists of examples which can be used for the construction of computational lexicons, grammars, and multi-lingual lexicons and term banks, for lexicography, etc. Quantitative information consists of statistics which indicate frequent or characteristic uses of language. These statistics can also be used to guide preference-based parsers, assist in lexicography, determine translation equivalents, etc. In addition, statistics can be used to drive morphological taggers, POS taggers, alignment programs, sense taggers, etc. Common operations on corpora for the purposes of language engineering include extraction of sub-corpora; sophisticated search and retrieval, including collocation extraction, concordance generation, generation of lists of linguistic elements, etc.; and the determination of statistics such as frequency information, averages, mutual information scores, etc.

We do not address corpora intended for other applications, such as stylistic studies, socio-linguistics, historical studies, information retrieval, etc., although these uses are not excluded a priori (in fact, many of the features required for these applications may be the same as those needed for language engineering). Treating a restricted domain enables development of a standard tighter than that of the TEI, by providing specific encoding solutions rather than general or multiple ones, and, most importantly, by providing standards for elements particularly important in that domain (e.g., linguistic annotation).

0.2.4. Encoded facts

The CES distinguishes primary data, which is "unannotated" data in electronic form, most often originally created for non-linguistic purposes such as publishing, broadcasting, etc.; and linguistic annotation, which comprises information information generated and added to the primary data as a result of some linguistic analysis. The CES covers the encoding of objects in the primary data that are seen to be relevant to corpus-based work in language engineering research and applications, such as

large units of discourse, such as paragraphs, chapters, etc. together with titles, footnotes, etc.;
sub-paragraph-level elements of interest for linguistic analyses, such as sentences, quotations in dialogue, names, dates, abbreviations, terms, etc.

The CES also covers encoding conventions for linguistic annotation of text and speech, including morphosyntactic tagging, parallel text alignment, prosody, phonetic transcription, etc.

The CES is intended to cover those areas of corpus encoding on which there exists consensus among the language engineering community, or on which consensus can be easily achieved. Areas where no consensus can be reached (for example, sense tagging) are not treated at this time.

0.3. Overview of the CES

In its present form, the CES provides the following :

a set of metalanguage level recommendations (particular profile of SGML use, character sets, etc.);
tagsets and recommendations for documentation of encoded data;
tagsets and recommendations for encoding primary data, including written texts across all genres, for the purposes of corpus-based work in language engineering.
tagsets and recommendations for encoding linguistic annotation commonly associated with texts in language engineering, currently including:
- segmentation of the text into sentences and words (tokens),
- morpho-syntactic tagging,
- parallel text alignment.

The CES provides a TEI-conformant Document Type Definition (DTD) to be used for encoding various levels of primary data encoding together with its documentation. These levels are described in section 1.4. The first level of primary data encoding is the minimum encoding level required to make the corpus (re)usable across all possible language engineering applications. Succeeding levels provide for increasing enhancement in the amount of encoded information and increasing precision in the identification of text elements. Automatic methods to achieve markup at each level are for the most part increasingly complex, and therefore more costly; the sequence is designed to accomodate a series of increasingly information rich instantiations of the text at a minimum of cost.

The CES recommends that the encoding of linguistic annotation is maintained in SGML documents separate from the primary data, to which it is associated by hyper-links. Therefore, in addition to the DTD for primary data, the CES also provides a series of independent DTDs for documents containing the different kinds of annotation information.

0.4. Status of the current document

The current version of the CES is a first draft of the standard. It has not been widely implemented, and the intention is to continue to develop the CES on the basis of input and feedback from users after it is put to greater use. Therefore, this document will continue to evolve and should not be regarded as "final".

We recognize that changes in the specifications present problems for those who have previously implemented the standard. To alleviate this problem, we have adopted the following development strategy:

the CES is for the most part being developed "bottom-up", beginning with relatively minimal specifications to which we can can easily add, rather than attempting to be comprehensive at the outset.
to the extent possible, all prior versions of the CES DTDs are upwardly compatible with the newer versions, so that previously encoded texts can be parsed with the newer DTDs.

The current version of the CES has the following immediate and major limitations:

it provides only general means to encode newspapers, which represent a large portion of the available corpora at the present time;
it does not cover the encoding of speech, spoken transcriptions, or annotation for either of these;
It does not provide tutorial matter concerning corpus encoding.

These areas are under development.

All current CES documents and DTDs will continue to be available at the following site:

<URL: http://www.cs.vassar.edu/CES/>

Anyone actively implementing the standard should consult this site regularly.

In developing the CES we have look at the work of other TEI-based corpus applications, including in particular the British National Corpus Project and the English-Norwegian Parallel Corpus Project. The various modifications of the TEI which have been developed by these groups and independently in the CES are often very similar, and it is at times difficult to know where an idea or strategy originated. We would therefore like to offer here a general acknowledgement of the work of these other projects and their influence on the CES.

We welcome and encourage user input concerning the CES.

0.5. Key to tag descriptions

Throughout this document, the tables describing the tags should be interpreted in the following way:

<tag>

description of element content

attribute 1: description of first attribute; VALUE1 first attribute value allowed; VALUE2* second attribute value allowed (* indicates default)

Part 0

Introduction

Contents