Next: The cesDoc DTD and Up: Standards for Language Encoding Previous: TEI Lite

Corpus Encoding Standard

CES is an effort supproted by EAGLES (the EAGLES Text Representation subgroup), European projects MULTEXT (LRE) and MULTEXT-East. It aims to develop a Corpus Encoding Standard (CES) optimally suited for use in language engineering, which can serve as a widely accepted set of encoding standards for corpus-based work. The overall goal is the identification of a minimal encoding level that corpora must achieve to be considered standardized in terms of descriptive representation (marking of structural and linguistic information) as well as general architecture (so as to be maximally suited for use in a text database). It also provides encoding conventions for more extensive encoding and for linguistic annotation. CES is, to a large extent, TEI conformant.

CES provides:

a set of metalanguage level recommendations (particular profile of SGML use, character sets, etc.);
tagsets and recommendations for documentation of encoded data;
tagsets and recommendations for encoding primary data, including written texts across all genres, for the purposes of corpus-based work in language engineering;
tagsets and recommendations for encoding linguistic annotation commonly associated with texts in language engineering, including:
- segmentation of the text into sentences and words (tokens),
- morpho-syntactic tagging,
- parallel text alignment.

Linguistic annotation is encoded in separate documents, linked to the primary data.

Next: The cesDoc DTD and Up: Standards for Language Encoding Previous: TEI Lite

Tomaz Erjavec
1/9/2000