Corpus Encoding Standard - Document CES 1. Part 2. Version 1.1. Last modified 1 April 1996.

Part 2

Recommendations common
to all documents

2.1. Metalanguage recommendations
2.2. Character sets

2.1. Metalanguage recommendations

The CES constitutes a TEI-conformant application of SGML (ISO 8879). CES documents may be parsed using any SGML parser.

2.1.1. Tag Syntax

All elements in a document are delimited by the use of tags. There are two forms of tag, a start-tag, marking the beginning of an element, and an end-tag marking its end.

The CES uses the "reference concrete syntax'' of SGML, which specifies that tags are delimited by the characters "<" and ">" and contain the name of the element (its gi for generic identifier). In end tags, the gi is preceded by "/". The gi may consist of upper and lower case letters and the digits 0-9.

2.1.2. Element names

The CES adopts the strategy of the TEI application of SGML by extending the legal length of delimeter names from 8 to 32 characters. Case is not significant in tag or attribute names. However, we recommend the use of the following conventions, following the TEI:

Lower case letters are used for identifiers, unless they are derived from more than one English word, in which case the first letter of the second word is capitalized.
Attributes are indicated within the start-tag, and take the form of an attribute name, an equal sign and the attribute value, which may be a number, a string literal or a quoted literal.

2.1.3. TEI Metalanguage extensions

For the purposes of encoding the complexity and wide range of texts treated by the TEI, the TEI has significantly extended its metalanguage level specification beyond what is offered by SGML. For instance, the TEI provides additional mechanisms for

defining the meaning of characters (the Writing System Declaration: chapter 25 in TEI-P3)
defining well-formed feature structures (Feature System Declaration: chapter 26 in TEI-P3)
specifying complex intra- and inter-textual references (extended pointers: chapter 14, section 2 in TEI-P3)

All of these extensions are adopted in the CES.

2.1.4. Tag Minimization

SGML permits various kinds of minimization, or abbreviatory conventions. The TEI interchange format prohibits the use of most minimization techniques (e.g., short references, omission of generic identifiers in start and end tags) allowed in ISO 8879. The CES adopts the TEI prohibition against the use of minimization techniques in general:

every non-empty element in the distributed form of a corpus must have both a start-tag and an end-tag.
all attributes are specified are supplied in the form "attributeName=value''.
end-tags are routinely omitted on empty elements.
exceptions to this rule are laid out for pervasive and commonly marked elements. For such elements, end tags may be omitted and attribute values may be given without any associated attribute name, to serve the interests of compactness and readability. Exceptions are explicitly described in the encoding recommendations for these elements.

2.2. Character sets

A universal character set (UCS) that will cover all languages is under development by ISO and the Unicode consortium. The results of the work so far on this character set has been approved as The Universal Multiple-Octet Coded Character Set standard ISO/IEC 10646-1. UCS will likely be the accepted encoding standard for characters in the future.

UCS encodes each character in four bytes, thus providing a single character set to encode all the worlds' languages.

However:

controversy still exists over some details of the scheme;
the standard is not complete;
some languages are not yet covered;
the standard is not yet supported in practice; only 8-bit character sets are generally supported.

Although there is little doubt that this standard will eventually become the basis for character representation, its full specification and implementation is long enough away that, for present purposes, it is necessary to provide a temporary solution.

For corpora intended for use in language engineering applications, much interchange will be accomplished via CD-ROM or ftp. Ftp allows binary interchange and can be used to safely transmit any 8-bit character set. Moreover, data interchange is becoming increasingly reliable, due to major international efforts towards standardization such as the Internet effort. For example, TCP/IP and many network applications (e.g., ftp, WWW, etc.) are "8-bit clean". In addition, recent standards have been proposed to guarantee delivery by automatically packing and unpacking data as required:

MIME (Multi-purpose Internet Mail Extensions: RFC-1521 and RFC-1522)
UTF-7 ( RFC-1642).

Even when such these standards are not yet implemented, files can be safely transferred by using universally available encoding programs such as 'uuencode'.

Therefore, we recommend that all data is distributed using the recommendations below for character sets. In the case of blind interchange, data should be encoded using 'uuencode'.

Our recommendation has the merit of being reasonably compatible with UCS, thus facilitating future migration to that standard.

The CES recommendations have been adopted by the EAGLES Tool subgroup for its Guidelines for Linguistic Software Development--see especially Part 1-1: Characters.

2.2.1. ISO 8859-X

The CES recommends the use of the ISO 8859-X series for all the following scripts: Arabic, Cyrillic, Greek, Hebrew, Latin.

The following is a rough list of the languages accomodated in the ISO 8859 series. See also the graphic representation of the code tables.

ISO-8859-1 - Latin 1: Western Europe and Americas: Afrikaans, Basque, Catalan, Danish, Dutch, English, Faeroese, Finnish, French, Galician, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish and Swedish.
ISO-8859-2 Latin 2: Latin-written Slavic and Central European languages: Czech, German, Hungarian, Polish, Romanian, Croatian, Slovak, Slovene.
ISO-8859-3 - Latin 3: Esperanto, Galician, Maltese, and Turkish.
ISO-8859-4 - Latin 4: Scandinavia/Baltic (mostly covered by 8859-1 also): Estonian, Latvian, and Lithuanian. It is an incomplete predecessor of Latin 6.
ISO-8859-5 - Cyrillic: Bulgarian, Byelorussian, Macedonian, Russian, Serbian and Ukrainian.
ISO-8859-6 - Arabic: Non-accented Arabic.
ISO-8859-7- Modern Greek: Greek.
ISO-8859-8 - Hebrew: Non-accented Hebrew.
ISO-8859-9 - Latin 5: Same as 8859-1 except for Turkish instead of Icelandic
ISO-8859-10 - Latin 6: Latin6, for Lappish/Nordic/Eskimo languages: Adds the last Inuit (Greenlandic) and Sami (Lappish) letters that were missing in Latin 4 to cover the entire Nordic area.

A list of characters used by a large number of languages is provided in "Characters and character sets for various languages " (Alvestrand, 1995).

See also "ISO 8859-1 National Character Set FAQ" (Gschwind, 1995).

Shortcomings of the ISO 8859 series

The ISO 8859 series lacks the ligatures Dutch ij, French oe and ,,German`` quotation marks, as well as several other characters.

There are also Bulgarian and Ukranian characters missing from ISO 8859-5.

2.2.2. Languages not covered by the ISO 8859 series

construction [THIS SECTION IS UNDER DEVELOPMENT]

The recommendations above do not provide for Asian languages, including Chinese, Japanese, and Korean. Independent standards have been developed for these languages. The CES specifications for these cases are under development.

If it is necessary to encode a text in a language not covered by the ISO 8859-X series, it is required to use

an ISO standard character set, if one exists; or
a Writing System Declaration (see TEI P3, chapter 25) documenting the use of any non-ISO character set.

It is also required that the character set used is fully documented in the header providing the encoding description for the corpus; see the description of <wsdUsage>.

Note that the TEI provides several pre-defined Writing System Declarations, including:

The official languages of the European community, using the character set ISO 8879-1;
Hebrew (using ISO 8859-8);
Russian (using ISO 8859-5).

2.2.3. Entities for odd characters

Characters not available in the character set that has been selected for the document as a whole must be represented by entity references, which take the form of an ampersand (&) followed by a mnemonic for the character, and terminated by a semicolon (;) where this is necessary to resolve ambiguity. All entities used in a document must be declared in the DTD.

We recommend the use of ISO entities. Standard public entity names can be declared by a reference to a standard public entity, e.g.,

<!ENTITY % ISOLat1 PUBLIC "ISO 8879-1986//ENTITIES Added Latin 1//EN"> %ISOLat1; <!ENTITY % ISOLat2 PUBLIC "ISO 8879-1986//ENTITIES Added Latin 2//EN"> %ISOLat1; <!ENTITY % ISOGrk1 PUBLIC "ISO 8879-1986//ENTITIES Greek Letters//EN"> %ISOGrk1; <!ENTITY % ISOGrk2 PUBLIC "ISO 8879-1986//ENTITIES Monotoniko Greek//EN"> %ISOGrk2; <!ENTITY % ISOCyr1 PUBLIC "ISO 8879-1986//ENTITIES Russian Cyrillic//EN"> %ISOCyr1; <!ENTITY % ISOCyr2 PUBLIC "ISO 8879-1986//ENTITIES Non-Russian Cyrillic//EN"> %ISOCyr2;

etc.

Many of the characters that commonly need to be represented are included in the ISO entity sets ISOpub and ISOnum. These sets include, for example, the special characters "&" and "<" which are part of the SGML markup syntax and cannot be included in an SGML document. They also contain entities such as "—" (for the dash the width of an "m"), "£" (for British sterling), etc. The ISOpub and ISOnum entity sets are declared as follows:

<!ENTITY % ISOPUB PUBLIC "ISO 8879-1986//ENTITIES Publishing//EN"> %ISOPUB;

<!ENTITY % ISONUM PUBLIC "ISO 8879-1986//ENTITIES Publishing//EN"> %ISONUM;

Note that these entity sets are declared in all the CES DTDs.

If no standard entity name exists or a standard entity is to be renamed, normal SGML syntax can be used to declare an appropriate entity, as follows:

<!ENTITY foo '[unprintable]'> 

Declaration of entities and entity sets not already included in the DTD for the document are added at the top of the encoded document, as in this example:

       <!doctype cesDoc PUBLIC "-//CES//DTD//cesDoc//EN" [
       <!ENTITY igcy     "i`"    --=small i grave, Cyrillic--    >
       <!ENTITY Igcy     "I`"    --=capital I grave, Cyrillic--  >
       <!ENTITY % ISOcyr1  PUBLIC 
            "ISO 8879-1986//ENTITIES Russian Cyrillic//EN"       >
       %ISOcyr1;
       <!ENTITY % ISOcyr2  PUBLIC 
            "ISO 8879-1986//ENTITIES  Non Russian Cyrillic//EN"  >
       %ISOcyr2;
       ]>
       <cesDoc version="3.9">...

Notes:

SGML entities should not be used to replace characters of the base character set (this applies to local character sets only, not blind interchange).
Transliteration should not be used to replace appropriate character sets, either for local processing or interchange.

2.2.4. Shifting among character sets

When different character sets are mixed in a single document, three alternative methods can be used (possibly in conjunction):

Explicitly:
- A wsd attribute can be used on any tag to indicate that the tag's content is encoded in the specified character set. The value of the attribute is the character set name (ISO-8859-1, etc.). WSD stands for "writing system declaration", borrowed from the TEI terminology.
- A lang attribute can be used on any tag to indicate that the tag's content is in the specified language. This method assumes a mapping between languages and character sets. The value of the attribute is composed of one of the following:
  - a two-letter code from ISO 639 (e.g., "en" for English;
  - a three-letter code from ISO 639-2 (e.g., "eng" for English);
  - one of the above extended by a country code from ISO 3166 (e.g., "en.uk" or "eng.uk" for English as spoken in the United Kingdom).
Implicitly:
- All instances of a given element can be associated with a particular character set, using the wsd attribute on the <tagUsage> element in the header.
- All instances of a given element can be associated with a particular language (using using the lang attribute on <tagUsage>) which is in turn associated with a particular character set (using the wsd attribute on the corresponding <language> element in the header).

These implicit methods are useful when there is a systematic mapping between tags and character sets (e.g., a list of words in one character set, with their translations in another).

The CES provides global lang and wsd attributes, as well as appropriate mechanisms to document correspondences between languages or tags with particular character sets in the CES header.

Note that the language tagging mechanism will still be valid with UCS. "Unicode characters do not specify the language of the text they represent; that is, they are completely language neutral. If the language of a character or character string must be known to accomplish a particular type of process (e.g. language sensitive collation), then a higher-level protocol must be used to specify the language." [from Unicode's "Basic Principles"].

2.2.5. International Phonetic Alphabet

construction [THIS SECTION IS UNDER DEVELOPMENT]

The TEI provides a pre-defined Writing System Declaration (WSD) for transcribing the International Phonetic Alphabet. This is distributed by the TEI both as an SGML entity set and as a TEI Writing System Declaration documenting the entity set:

-//TEI P3: 1994//ENTITIES International Phonetic Alphabet//EN

The CES recommends using the SGML entities and providing the TEI WSD (with reference to it in the <wsdUsage> element in the header) when the IPA system is used in a document.

Part 2

Recommendations common to all documents

Contents

Recommendations common
to all documents