This document is the first version of the Corpus Encoding Standard (CES). The CES has been designed to be optimally suited for use in language engineering research and applications, in order to serve as a widely accepted set of encoding standards for corpus-based work in natural language processing applications. The CES is an application of SGML (ISO 8879:1986, Information Processing--Text and Office Systems--Standard Generalized Markup Language) compliant with the specifications of the TEI Guidelines for Electronic Text Encoding and Interchange of the Text Encoding Initiative.
The CES specifies a minimal encoding level that corpora must achieve to be considered standardized in terms of descriptive representation (marking of structural and typographic information) as well as general architecture (so as to be maximally suited for use in a text database). It also provides encoding specifications for linguistic annotation, together with a data architecture for linguistic corpora.
The CES is being developed in a bottom up fashion, starting with minimal specifications and expanding based upon feedback resulting from its use, and the input of the research community in general. We invite and encourage all comments and discussion of any aspect of the CES.
This document results from joint effort of the European projects EAGLES (in particular, the EAGLES Text Representation subgroup), MULTEXT (LRE), and MULTEXT-EAST (Copernicus), together with the Vassar/CNRS collaboration supported by the U.S. National Science Foundation. The Centre National de la Recherche Scientifique (CNRS) has also supported the integration effort.
Please report suggestions or problems to priestdo@cs.vassar.edu.
This document is also available as a Tar file
(approx. 200k, tar.gz format)