Corpus Encoding Standard - Document CES 1. Title page. Version 1.4. Last modified 14 October 1996.

Corpus Encoding Standard

Nancy Ide, Coordinator

Abstract

This document is the first version of the Corpus Encoding Standard (CES). The CES has been designed to be optimally suited for use in language engineering research and applications, in order to serve as a widely accepted set of encoding standards for corpus-based work in natural language processing applications. The CES is an application of SGML (ISO 8879:1986, Information Processing--Text and Office Systems--Standard Generalized Markup Language) compliant with the specifications of the TEI Guidelines for Electronic Text Encoding and Interchange of the Text Encoding Initiative.

The CES specifies a minimal encoding level that corpora must achieve to be considered standardized in terms of descriptive representation (marking of structural and typographic information) as well as general architecture (so as to be maximally suited for use in a text database). It also provides encoding specifications for linguistic annotation, together with a data architecture for linguistic corpora.

The CES is being developed in a bottom up fashion, starting with minimal specifications and expanding based upon feedback resulting from its use, and the input of the research community in general. We invite and encourage all comments and discussion of any aspect of the CES.

CES Part 0. Introduction
- 0.1. Background
- 0.2. Scope of the CES
- 0.3. Overview of the CES
- 0.4. Status of this document
- 0.5. Key to tag descriptions
CES Part 1. General Principles
- 1.1. Definitions
- 1.2. Interchange vs. local processing
- 1.3. Levels of standardization
- 1.4. Types of information
- 1.5. Criteria
- 1.6. Customization of the TEI
CES Part 2. Recommendations common to all documents
- 2.1. Metalanguage recommendations
- 2.2. Character sets
CES Part 3. The header
- 3.1. Global attributes
- 3.2. Document structure
- 3.3. The File description
- 3.4. The Encoding description
- 3.5. The Profile description
- 3.6. The revision description
- 3.7. Use of decls attribute
- 3.8. Examples
- 3.9. The cesHeader DTD
- 3.10. The cesHeader DTD in hypertext navigable format
CES Part 4. Encoding primary data
- 4.0. Overview
- 4.1. Levels of encoding for primary data
- 4.2. Level 1 conformance
- 4.3. Level 2 conformance
- 4.4. Level 3 conformance
- 4.5. The cesDoc DTD for primary data
  - 4.5.1. Global attributes
  - 4.5.2. Element classes represented by entities in the cesDoc DTD
  - 4.5.3. Content models represented by entities in the cesDoc DTD
  - 4.5.4. Top-level structure
  - 4.5.5. Text body
  - 4.5.6. Text divisions
  - 4.5.7. Contents of text divisions
  - 4.5.8. Paragraph-level elements
  - 4.5.9. Sub-paragraph (phrase-level) elements
  - 4.5.10. Reference systems
  - 4.5.11. Encoding names
  - 4.5.12. Handling punctuation
  - 4.5.13. Encoding morpho-syntactic annotation in the primary data
  - 4.5.14. The cesDoc DTD
  - 4.5.15. The cesDoc DTD in hypertext navigable format
  - 4.5.16. The cesDoc DTD instantiated as a TEI customization
CES Part 5. Encoding linguistic annotation
- 5.0. Overview
- 5.1. Locators
- 5.2. Encoding conventions for segmentation and grammatical annotation
- 5.3. Encoding conventions for parallel text alignment
CES Part 6. Encoding speech
- [under construction]
CES Part 7. Encoding linguistic annotation for speech
- [under construction]
CES Annex 1 : Relevant standards
CES Annex 2 : Bibliography
CES Annex 3 : List of relevant URLs
CES Annex 4 : Tag index
CES Annex 5 : DTDs and related files
CES Annex 6 : DTDs in hypertext navigable format
CES Annex 7 : How to use the CES
CES Annex 8 : Consistency
CES Annex 9 : Minimization
CES Annex 10 : Overlapping hierarchies
CES Annex 11 : Installation and Revision notes (text file)

Acknowledgements

This document results from joint effort of the European projects EAGLES (in particular, the EAGLES Text Representation subgroup), MULTEXT (LRE), and MULTEXT-EAST (Copernicus), together with the Vassar/CNRS collaboration supported by the U.S. National Science Foundation. The Centre National de la Recherche Scientifique (CNRS) has also supported the integration effort.

Primary authors and contributors

Nancy Ide: Department of Computer Science
Vassar College
Poughkeepsie, New York 12601 USA
tel : (+1) 914 437 5988
fax : (+1) 914 437 7498
e-mail : ide@cs.vassar.edu; Laboratoire Parole et Langage
CNRS & Université de Provence
29, Avenue Robert Schuman, 13621 Aix-en-Provence Cedex 1, France
tel : (+33) 42 95 36 34
fax : (+33) 42 59 50 96
e-mail: ide@univ-aix.fr

Greg Priest-Dorman: Department of Computer Science
Vassar College
Poughkeepsie, New York 12601 USA
tel : (+1) 914 437 5990
fax : (+1) 914 437 7498
e-mail : priestdo@cs.vassar.edu

Jean Véronis: Laboratoire Parole et Langage
CNRS & Université de Provence
29, Avenue Robert Schuman, 13621 Aix-en-Provence Cedex 1, France
tel : (+33) 42 95 36 34
fax : (+33) 42 59 50 96
e-mail: veronis@univ-aix.fr

Other contributors: Lou Burnard, Oxford University Computing Service, Oxford, England; Dominic Dunlop, British National Corpus, Oxford, England; Ole Norling-Christensen, University of Copenhagen, Denmark; Eva Ejerhed, University of Umeå, Sweden; Tomaz Erjavec, Jozef Stefan Institute, Ljubljana, Slovenia; Hans van Halteren, University of Nijmegen, The Netherlands; Geoffrey Leech, University of Lancaster, England; Ole Norling-Christensen, The Society for Danish Language and Literature, Denmark; Daniel Ridings, University of Göteborg, Sweden; Laurent Romary, Centre de Recherche en Informatique de Nancy, France; John Sinclair, University of Birmingham, England; Henry Thompson, University of Edinburgh, Scotland

Please report suggestions or problems to priestdo@cs.vassar.edu.
This document is also available as a Tar file (approx. 200k, tar.gz format)

Corpus Encoding Standard

Nancy Ide, Coordinator

Abstract

Contents

Acknowledgements