Corpus Encoding Standard - Document CES 1. Part 4.5. Version 1.9. Last modified 5 December 1996.

Part 4.5

The cesDoc DTD
for primary data

4.5.1. Global attributes
4.5.2. Element classes represented by entities in the cesDoc DTD
4.5.3. Content models represented by entities in the cesDoc DTD
4.5.4. Top-level structure
4.5.5. Text body
4.5.6. Text divisions
4.5.7. Contents of text divisions
4.5.8. Paragraph-level elements
4.5.9. Sub-paragraph (phrase-level) elements
4.5.10. Reference systems
4.5.11. Encoding names
4.5.12. Handling punctuation
4.5.13. Encoding morpho-syntactic annotation in the primary data
4.5.14. The cesDoc DTD
4.5.15. The cesDoc DTD in hypertext navigable format
4.5.16. The cesDoc DTD instantiated as a TEI customization

4.5. The cesDoc DTD: description

This section defines the cesDoc DTD, which is used for Level 1, Level 2, and and Level 3 CES-conformant encodings. The cesDoc DTD defines the required structure for marking Level 1 conformant documents down to the paragraph level. It also defines additional elements at the sub-paragraph level which may appear, but are not required, in a Level 1 encoding, and which are used in Level 2 and Level 3 encodings.

The cesDoc DTD specifies rules which determine where the included elements may legally appear in a document conforming to this DTD. The rules are expressed formally in the DTD for the document, which is given at the end of the section. This section also provides informal semantics for the use of the defined elements.

4.5.1. Global attributes

Five global attributes are defined in the cesDoc DTD:

id

a unique identifier for the element bearing the ID value.

n

a number or other label for the element, not necessarily unique within the corpus.

lang

indicates that the tag's content is in the specified language. The value of the lang attribute which should be the same as that appearing on a <language> element in the header document which describes that character set, composed of one of the following:

a two-letter code from ISO 639 (e.g., "en" for English;
a three-letter code from ISO 639-2 (e.g., "eng" for English);
one of the above extended by a country code from ISO 3166 (e.g., "en.uk" or "eng.uk" for English as spoken in the United Kingdom).

wsd

indicates that the tag's content is encoded in the specified character set. The value of the attribute is the character set name (ISO-8859-1, etc.) which should be the same as that appearing on a <writingSystem> element in the header document which describes that character set.

rend

provides information about rendition in an original printed version. The value of the rend attribute may take one of the following attributes, although other values are also valid:

BO bold face

BX boxed

IT italic font

RO roman font

UL underlined

CA capital letters

The global attributes are defined at the top of the cesDoc DTD and represented by an entity, A.TEXT. This entity is used to represent the list of global attributes on the attribute declarations for most elements in the document.

4.5.2. Element classes represented by entities in the cesDoc DTD

For modularity and readability, the cesDoc DTD follows the TEI model of creating element classes for groups of elements which commonly appear together in content models. These element classes differ from the TEI's in two major ways:

elements are grouped into classes only on the basis of common appearance in content models, and not on the basis of shared attributes;
the CES element classes are far more simplified, comprising a shallow hierarchy with no common elements.

Element classes are defined in the cesDoc DTD by declaring an entity that represents a group of elements. This entity can then appear in the content model of some element and indicates that all of the members of that class may appear at a common location.

The cesDoc DTD defines the following element classes (class names consistent with similar TEI classes):

M.INTER: paragraph-level elements, i.e., elements which can appear inside <div> elements at the paragraph level, or between paragraphs
M.PHRASE: phrase-level elements, i.e., elements which can appear inter-mixed with PCDATA at the sub-paragraph level. M.PHRASE contains the class M.TOKEN
M.TOKEN: elements which are regarded as individual tokens even when they may contain sub-constituents.

4.5.3. Content models represented by entities in the cesDoc DTD

It is similarly useful to define entities that represent content models which are frequently used in defining elements, since common content models are readily obvious, and modification is simple. The content models defined by entities in the cesDoc DTD are:

BASE.SEQ: the base content model for token level elements, including PCDATA, possibly inter-mixed with <abbr> and <num> elements.
PAR.SEQ: elements that can appear at the paragraph level--i.e., in between paragraphs, at the same level as <p>. This includes the elements in class M.INTER plus <p> and <sp>.
PHRASE.SEQ: phrase-level elements, consisting of PCDATA inter-mixed with the elements in class M.PHRASE.

4.5.4. Top-level structure

The top level structure of the cesDoc DTD is as follows:

<cesCorpus>

contains the whole of a CES encoded corpus, comprising a single corpus header and one or more cesDoc elements, each containing a single text header and a text. Additionally, the <cesCorpus> element can be recursively nested, and sequences of this element can appear at any nested level, in order to identify sub-corpora. In addition to the global attributes, it has the following attribute:

type: used to identify the type of a sub-corpus (by language, genre, etc.) when nested <cesCorpus> elements are used.
version: provides the version of the cesDoc DTD to which this corpus is compliant. If different parts of the corpus were created using different versions of the DTD (this is possible since any version is upward-compatible with its successor), then the value here reflects the highest version number used in the corpus--i.e., the version with which the corpus can be parsed.
TEIform: provides the TEI element which corresponds to this element.

<cesDoc>

a single document, either forming part of or derived from a corpus, containing a <cesHeader> element, followed by either a <body> element or a <group> element. In addition to n and id, this element has the following attributes:

type: indicates the type of document (text, spoken data, etc.); the default is text.
version: provides the version of the cesDoc DTD to which this text is compliant.
TEIform: provides the TEI element which corresponds to this element.

The <cesDoc> element can contain the following:

<cesHeader>

contains the header for the corpus or text. This element is fully described in section 3.

<text>

contains an individual text.

complete

specifies whether or not this text is complete or a sample.

: Y in principal, all of the original has been transcribed
: N a sample of the original has been taken

decls

specifies one or more IDs associated with elements in the text header or corpus header that apply to this element.

4.5.5. Text contents

The <text> tag may contain one occurrence of one of the following:

<group>

groups together a sequence of distinct texts that are regarded as a unit, such as a sequence of prose essays, poems, etc. Global attributes plus:

decls: specifies one or more IDs associated with elements in the text header or corpus header that apply to this element.

<body>

contains the body of the text, excluding any front or back matter. Global attributes plus:

decls: specifies one or more IDs associated with elements in the text header or corpus header that apply to this element.

The <body> element may contain:

a <group> element, grouping together a sequence of distinct texts that are regarded as a unit, such as a sequence of prose essays, poems, etc.
An optional sequence of paragraph level elements of arbitrary length, followed by one or more elements sub-dividing the text. The body may have no structural divisions within it at all.

Note that there is no provision for the encoding of front matter such as cover page, table of contents, appendixes, etc., in the current CES recommendations. For the most part, such material is unnecessary for corpus linguistics and should not be included. However, where desired, front and back matter can be encoded using the TEI elements <front> and <back>; see Annex 7 for the means to accomplish this.

4.5.6. Text divisions

Written texts exhibit a variety of different structural forms. Some have very little organization at levels higher than the paragraphs, while others have a complex hierarchy of parts, sections, chapters etc. Novels are divided into chapters, newspapers into sections, reference works into articles, etc.

The following element is used to represent textual divisions of all kinds:

<div>: any subdivision of a written text, e.g. chapter, section, sub-section, article, etc.

If a text has any structural subdivision, then at least those at the highest level should be identified.

The <div> element has the following attributes:

type

categorises the division in some respect, e.g. as a chapter, section etc.

complete

specifies whether or not this division is complete or a sample.

Y* the full text of the original has been transcribed
N a sample of the original text has been taken

decls: specifies one or more IDs associated with elements in the text header or corpus header that apply to this element.

The n global attribute can be used to carry an identifying name or number used within the text for a given division, for example, a chapter number, as in the following example:

<div type=CHAPTER n=5>

The type attribute is required and is used to characterize the division. A set of precise values will be provided by EAGLES/PAROLE.

The content of the <div> tag is defined to consist of one or more division head elements (optional) followed by a sequence of paragraph-level elements, followed by one or more division closing elements (optional).

4.5.7. Contents of text divisions

Below the level of text divisions, there are three general groups of elements which may appear:

Division head elements: information such as section titles, bylines, etc. that often appears at the beginning of text sections.
Paragraph-level elements: further division of the text, into paragraphs, etc.
Division closer elements: information such as datelines, bylines, etc. that can appear at the end of a text section, especially in newspapers, etc.

Division head elements include:

<opener>

groups together any opening material that is not a heading at the start of a division, including in particular <dateline> and <keywords>.

<head>

contains any heading, for example, the title of a section. This element can also appear inside the <list> and <poem> elements to mark the title of a list or poem. It can contain any phrase-level element.

type: gives the type of header, e.g., main, sub, unspecified, etc.

<byline>

contains the primary statement of responsibility given for a work on its title page or at the head or ending of the work, most often applicable to newspapers. Can contain any phrase-level element plus the tag <docAuthor> for the author's name.

Division closing elements include:

<closer>: groups together material appearing at the end of a division, including in particular <dateline> and <keywords>.
<byline>: same as above.

The <keywords> element can contain terms and lists of terms that may appear at the beginning or end of a text as identifying material.

The <dateline> element can contain untagged prose intermixed with markup for dates, times, names, addresses, abbreviations, and numbers.

4.5.8. Paragraph-level elements

A number of divisons of text occur at what is called the paragraph-level, since the most common such division at this level is <p> (paragraph). There are in addition several other elements which may appear directly within structural divisions (that is, not nested within some other element).

<p>: a paragraph in a written text.
<sp>: contains material marked as "written to be spoken'' or "written as spoken", usually by the presence of a speaker prefix, for example in a play script or printed interview.
<caption>: (1) a heading, title etc. attached to a picture or diagram (2) a "pull quote" or other text about or extracted from a text and superimposed upon it to draw attention to it.
<quote>: a quotation from some author other than that of the surrounding text, usually either embedded or displayed.
<poem>: a poem, or an extract from one, embedded or quoted within a text.
<list>: a collection of distinct items flagged as such by special layout in written texts, often functioning as a single syntactic unit.
<figure>: indicates the location of a graphic, illustration, or figure.
<bibl>: a loosely-structured bibliographic citation appearing within a corpus text.
<note>: any form of note, usually a footnote. This tag is used only for notes that are a part of the original data only, not notes which may be added by the encoder, etc.
<table>: contains text displayed in tabular form, in rows and columns.

The paragraph-level elements are discussed in more detail in the following sub-sections.

NB: only the <p> element is required below the division level for minimal Level 1 CES conformance.

4.5.8.1. Captions

We distinguish between <head> elements, which can appear only at the start of a text division and are logically associated with it (for example, chapter titles, newspaper headlines etc.) and <caption> elements, which are logically independent of the position they may have within a textual division (e.g.,, captions attached to pictures or figures, "pull-quotes'' embedded within the text, "by-lines'' identifying authorship and provenance of a newspaper or periodical article.

The type attribute may be used to indicate the function of the caption:

type

categorizes the caption.

BYLINE caption containing authorship of an article

DISPLAY extra-textual caption (displayed box, etc.)

ATTACHED caption describing a figure, photograph, etc.

UNSPEC* not specified or unknown

A caption can be placed at a point other than where it appears, so as not to interrupt the normal flow of a text, by using it with the <ptr> tag. See the section on Pointing and reference.

4.5.8.2. Quotations

A quotation is a (usually long) extract from some other work than the text itself which is embedded within it. It is set off from the paragraphs that surround it typographically, by spacing similar to that for paragraphs (e.g., white space before and after). It may contain paragraphs, s-units, dialogue (marked with <q>) or any other phrase-level element.

In the CES, the use of the <quote> tag is sharply distinguished from that of the <q> tag, which is used to mark quoted material that appears inside a paragraph.

4.5.8.3. Spoken paragraphs

The <sp> element is used to mark parts of a written text which are intended to be spoken, for example the speeches in a dramatic text, or which comprise the transcription of a speech, interview, debates, etc. typically intended for publication (i.e., which have been transcribed to be read as text). Such parts are generally readily identifiable by the use of conventions such as speaker prefixes (the label supplying the name of the speaker) and stage directions. The <sp> element takes the following attribute:

who: name of the speaker

The <sp> element contains:

<speaker>

contains the speech prefix used in the original source to identify the speaker of a passage written to be spoken.

<stage>

contains any kind of stage direction within a dramatic text.

type: indicates the kind of stage direction.

The <sp> element is not intnded to identify speaker turns identified in a spoken text, i.e. one which has been transcribed from audio tape. The <sp> element is used only for speaker turns identified as such in a written text.

The <speaker> element is used to tag a label or prefix identifying the speaker or speakers, and is followed by a sequence of paragraphs.

The <stage> element, when it appears, will normally be relocated to the end of a paragraph in which it occurs. The <ptr> element can be used to indicate its original position; see the section on Pointing and reference.

4.5.8.4. Poems

Poems or fragments of verse or song may appear between paragraphs or. Where they are distinguished from the surrounding text, they are marked using the <poem> element, which contains an optional series of <head> elements followed by one or more <lg> or <l> (for line) elements, which is used to mark metrical lines, rather than typographic lines:

<lg>

groups verse lines (marked by <l>), most often into stanzas. Use the type attribute to identify the reason for the grouping.

<l>

a line of verse.

part

indicates whether the verse line is metrically complete.

U* metricality is not known or inapplicable

Y the line is metrically complete

N the line is metrically incomplete

Note that the <lg> element may be recursively nested, in order to provide for sub-groupings of lines. In this case, the n attribute should be used to indicate the nesting level (e.g., n=1 for outer level, n=1.1 for nested sub-level, etc.; see the section on Reference systems.

4.5.8.5. Lists

A list consists of an optional <head> element, followed by one or more <item> elements, each of which may optionally be prefixed by a <label> element:

<item>: an item within a list.
<label>: an enumerator or other label attached to a list item. Lists may or may not be marked. Where marked, they may appear within or between paragraphs.

The <label> element is used to hold the identifier or tag sometimes attached to a list item, for example "(a)'', or a word or phrase used for a similar purpose. However, note that for the purposes of corpus-based work, it is usually preferable to regard list labels as rendition information and to encode them in the n attribute, rather than as part of the document content.

The <item> element may appear only inside lists. It contains the same elements as a paragraph, and may therefore contain one or more nested lists.

4.5.8.6. Figures

Figures are marked with the following tag, which enables a reference to a stored image in another file:

<figure>

indicates the location of a graphic, illustration, or figure.

entity: names the external entity within which the graphic image of the figure is stored.

The <figure> element contains an optional <head> element for the figure title or heading, followed by an optional sequence of paragraphs for commentary or caption, an optional <figdesc> element, and an optional <body> element for including the graphic itself, where desired. The <figure> element can be empty, serving only to mark the presence of a figure in the text.

<figDesc>: contains a brief prose description of the appearance or content of a graphic figure, for use when documenting an image without displaying it.

Note that in many instances, figures will not be retained at all in the encoded version of the text. In this case, the <gap> element should be used to indicate the omission.

4.5.8.7. Annotations (<note> and <bibl>)

Annotations and bibliographic citations or references are marked using the following elements:

<note>

any form of note, usually a footnote. This tag marks only notes that are a part of the original text, not notes that may be added by the encoder, etc.

place

for a written text, specifies the location of an original note in the source text.

FOOT note at foot of page.

END note at end of current division or text.

SIDE note in left or right margin.

UNSPEC* placement unknown or unspecified.

<bibl>

a loosely-structured bibliographic citation appearing within a corpus text.

Original notes may contain paragraphs, s-units, dialogue, and any other phrase-level element. The global n attritbute can be used to indicate the value of a numbered note.

Like captions, notes are often moved from their original location in the original data and placed at another point so as not to interrupt the normal flow of a text, by using the <ptr> tag as follows (see the section on Pointing and reference):

      Here is a text, with a "1" at the end for a
      footnote. [1].
      <<Then, this note appears at
      this point in the original.>>
      But we would like to keep the text together.

This can be encoded as

      <p>Here is a text.
      <ptr target=N1 n=1 rend=bracketed>
      But we would like to keep the text together.</p>
      <note id=N1 place=foot>Then, this note appears at
      this point in the original.</note>

Bibliographic citations or references within running texts are marked using the <bibl> element, which can contain any phrase-level element plus the <author> element.

4.5.8.8. Tables

The <table> element is used to include tables in the text. It takes the attributes:

rows: indicates the number of rows in the table.
cols: indicates the number of columns in the table.

Note that in many instances, tables will not be retained at all in the encoded version of the text. In this case, the <gap> element should be used to indicate the omission.

4.5.9. Sub-paragraph (phrase-level) elements

The cesDoc DTD also includes tags for marking sub-paragraph-level elements. Marking sub-paragraph elements is not required for Level 1 documents, but some are required for Level 2 and Level 3 documents.

Certain phrase-level elements are commonly tagged in the early stages of the markup process, since they are signalled by the typography in legacy data or in printed versions serving as the copy. It is therefore desirable to provide some guidance for the inclusion of sub-paragraph markup in Level 1 documents.

The phrase-level elements that are provided for in the cesDoc DTD are selected on the basis of their relevance for corpus-based work. There are five main categories of phrase-level elements:

elements of linguistic interest;
elements indicating editorial changes to the original text;
the <hi> element for marking typographically distinct words or phrases, especially when the purpose of the highlighting is not yet determined;
elements for identifying s-units (typically orthographic sentences) and quoted dialogue;
elements for pointing and reference.

The cesDoc DTD imposes a relatively strict structure on sub-paragraph elements, intended to disallow options and impose a structure which is most suited to the needs of corpus-handling tools. Adherence to this structure for Level 1 documents is recommended, but not required.

4.5.9.1. Linguistic elements

There have been two main defining forces behind the choice of elements:

the needs of corpus-annotation tools, such as morpho-syntactic taggers, whose performance can often be improved by pre-identification of elements such as names, addresses, title, dates, measures, foreign words and phrases, etc.
the need to identify objects which have intrinsic linguistic interest, or are often useful for the purposes of translation, text alignment, etc., such as abbreviations, names, terms, linguistically distinct words and phrases, etc.

The phrase-level elements identifying linguistically relevant elements are:

<abbr>

contains an abbreviation of any sort. Consult Handling Punctuation for guidelines for encoding abbreviations.

expan: contains the expansion of the abbreviation

<date>

contains a date in any format.

ISO8601: ISO 8601 normalized form of the date

<list>

a collection of distinct items flagged as such by special layout in written texts, often functioning as a single syntactic unit. Note that <list> is the only phrase-level element which is also a paragraph-level element; its content model is exactly the same in both instances. For its full definition see section 4.5.8.5.

<measure>

contains a number, word, phrase indicating a quantity.

type

the type attribute takes one of the following values:

WEIGHT

LENGTH

COUNT

AREA

VOLUME

CURRENCY

TEMPERATURE

value

contains the the ISO 4217 codes for currency representation when the type attribute specifies currency.

<name>

contains a proper noun or noun phrase.

type

indicates the type of proper noun. Suggested values include:

PERSON

PLACE

ORG

LANGUAGE

See Encoding Names.

<num>

contains a number, written in any form.

value: contains the normalized value of the number.

<term>

contains a single-word, multi-word or symbolic designation which is regarded as a technical term.

<time>

contains a phrase defining a time of day in any format.

ISO8601

ISO 8601 normalized form of the time.

type

the type attribute takes one of the following values:

24HOUR

DESCRIPTIVE

<distinct>

identifies a word or phrase regarded as linguistically distinct (e.g., archaic, technical, dialect, etc.).

<foreign>

identifies a word or phrase as belonging to some language other than that of the surrounding text. Use the global lang attribute to indicate the language.

<mentioned>

marks words or phrases mentioned, not used.

<title>

contains the title of a work, whether article, book, journal, or series, including any alternative titles or subtitles.

The linguistic elements fall into two groups, which determine their content models:

elements which are, for many purposes of language engineering such as morpho-syntactic tagging, regarded as individual tokens, even when they may contain sub-constituents. In the CES this group includes names, dates, times, measures, abbreviations, and terms. These elements therefore may contain PCDATA. They may also contain the <abbr> and <num> elements; abbreviations and numbers are frequently identified and tagged automatically, and therefore their placement must be relatively free. Note that to avoid unnecessary recursive nesting of elements, the<abbr> cannot contain another <abbr> tag, and <num> cannot contain another <num>.
This group of elements, which comprise the element class M.TOKEN, includes:
- <abbr>
- <num>
- <name>
- <date>
- <measure>
- <time>
- <term>
elements which may contain sub-constituents which are treated by corpus-analytic tools as tokens, or may be regarded as tokens in themselves. Each of these elements can contain any other phrase-level element, except itself (i.e., there is no recursive nesting of elements allowed). It is assumed that tokenizing tools may further analyze the content of these elements in order to identify constituent tokens where they exist.
This group of elements includes the following elements:
- <title>
- <foreign>
- <mentioned>
- <distinct>

This latter group also includes another tag, the <hi> tag, which is used to mark information which is rendered specially in some original, but for which the function of the highlighting is either unknown or unspecified. In later phases of up-translation when the function of the highlighting is determined, <hi> tags are very often changed to one of the other more descriptive tags in this group. See section 4.5.9.2., below, for a full discussion of the use of the <hi> tag.

4.5.9.2. Rendition information

In general it is not desirable to mark typographic features of a given printing of a text in texts designated for use in corpus-based research. However, there are circumstances under which it is desirable to retain this information. In particular, certain items of linguistic interest may be marked by typography in the original; e.g., linguistic emphasis and foreign words are often rendered in italics. In addition, some applications (e.g., machine translation which attempts to reproduce the format of the original) demand retaining the rendition information.

In the process of up-translation from legacy data, a first step is often to translate relevant typographic information into SGML, with no attempt to interpret the significance of the rendering (e.g., that the italics signify a foreign word). Interpretation is often too costly because it is ambiguous (e.g., italics signify not only foreign words, but also emphasis, titles, etc.). In such cases the <hi> element can be used. Normally, in later phases of up-translation, <hi> tags are changed to more descriptive tags, such as <title>, <foreign>, <mentioned>, or <distinct>.

<hi>

marks a word or phrase as graphically distinct from the surrounding text, for reasons concerning which no claim is made. The rend attribute should provide the original rendition information when its function has not yet been determined.

rend

describes the rendition or presentation of the highlighted item.

BO bold face

BX boxed

IT italic font

RO roman font

UL underlined

CA capital letters

Note: Several values from the list may be specified where appropriate, separated by spaces, e.g., "ro it".

When the <hi> tag is used, no claim about the reason is made. This may be the case in a Level 1 encoding, since determining the reasons for highlighting (e.g., presence of a foreign word, vs. emphasis, vs. a title, etc.) demands human intervention and is therefore too costly in the early stages of up-translation. Note that typographically highlighted phrases and the kind of highlighting used may be recorded in one of two ways:

using the global rend attribute
using the <hi> element with a rend attribute

The first method specifies an attribute on some element which contains all of and only the highlighted phrase. In this case, the function of the highlighting is clear (for example, to mark a heading), and the boundaries of the highlighted phrase therefore coincide with the boundaries of some other element. The rend attribute is given on the tag for that element, for example

<head rend=bo>The world beyond</head>

The second method inserts a new tag indicating that what it contains is highlighted. It is used

when the function of the highlighting is not clear;
where there is no tag identifying the feature concerned;
where the highlighted phrase is not co-terminous with some other element.

The rend attribute must be supplied on the <hi> element. The rend attribute is optional on all other elements.

Note that in cases where the <hi> element often appears with the same value for rend, a default value can be provided on the <tagUsage> element. When this mechanism is used, the rend attribute need be given only when the default does not apply to the given occurrence of the <hi> element.

Both the start and end tag for any SGML element must be contained within the start and end tag of any of its ancestors in the tree for that document. Since by definition <hi> elements can appear only within <p> elements, this means that where, for example, an italicized passage contains more than one paragraph or starts within a paragraph and spans one or more others, the <hi> element must be closed at the end of the enclosing element, and then re-opened within the next. For example, an italicized passage which crosses a <p> boundary must be tagged as follows:

<p>This is the start of a paragraph which <hi
rend=it>switches to
italics here and then goes on for several paragraphs.</hi></p>
<p><hi rend=it>This second paragraph is all in
italics</hi></p>
<p><hi rend=it>This is the last bit of italics</hi> and
the rest is
in roman.</p>

That is, the <hi> element is closed before the end of the first paragraph and re-opened at the start of the next. Note that the following encoding is not acceptable:

<p>This is the start of a paragraph which <hi
rend=it>switches to
italics here and then goes on for several paragraphs.</hi></p>
<p rend=it>This second paragraph is all in italics</p>
<p><hi rend=it>This is the last bit of italics</hi> and
the rest is
in roman.</p>

This second encoding mixes different styles of marking the same feature for a given span of text, which will cause problems for retrieval.

4.5.9.3. Editorial corrections

The following tags are used to mark editorial changes:

<corr>

contains the correct form of a passage apparently erroneous in the copy text.

sic: gives the original form
resp: gives the name of the responsible editor
cert: used to indicate the degree of certainty with which the change has been made.

<gap>

indicates a point where material has been omitted in a transcription, whether for editorial sampling practice, or because the material is illegible.

desc: describes the omitted text
reason: gives the reason for the omission (sampling, illegible, etc.)
resp: gives the name of the responsible editor
cert: used to indicate the degree of certainty with which the change has been made.

Note that the <gap> element is useful for noting the omission of material which is often uninteresting for corpus-based language engineering applications, in particular, figures, tables, etc.

<reg>

contains text which has been regularized or normalized in some sense.

orig: gives the original form
resp: gives the name of the responsible editor
cert: used to indicate the degree of certainty with which the change has been made.

4.5.9.4. S-units and quoted dialogue

The segmentation of texts into s-units, or orthographic sentences, is usually accomplished by special tools. The results of such segmentation are, in the CES model, considered as a type of annotation and stored in a separate file, which has advantages for ease of processing. However, in some cases it is desirable to mark s-units and/or quoted dialogue in the primary data. We therefore provide mechanisms for marking these elements.

In some cases only quoted dialogue is marked in the primary data, because the identification of quoted dialogue can be accomplished automatically (by detecting quotation marks etc.).

<s>

identifies an s-unit within a document, typically an orthographic sentence.

next: gives the id reference of a subsequent <s> element which contains a continuation of the current sentence.
prev: gives the id reference of a previous <s> element which contains the beginning fragment of the current sentence.
type: indicates the type of sentence.
broken: indicates whether this <s> element is broken between two or more <s> elements (linked using the next and prev attributes).

<q>

contains quoted dialogue or other quoted material appearing inside a paragraph.

next: gives the id reference of a subsequent <q> element which contains a continuation of the current quote.
prev: gives the id reference of a previous <q> element which contains the beginning fragment of the current quote.
type: indicates the type of quote.
who: indicates the speaker of the quote.
broken: indicates whether this <q> element is broken between two or more <q> elements (linked using the next and prev attributes).

When s-units are tagged, no split should be made between a colon or semi-colon followed by a word beginning with a capital initial (unless there is an end-of-paragraph marker).

When both <s> and <q> are marked, the problem of overlapping hierarchies can arise. For this reason it has been necessary to allow for mutual recursive nesting of <s> and <q> tags in the cesDoc DTD, a practice which is otherwise avoided. This allows all the following encodings:

<s><q>Indeed yes,</q>she replied.</s>

<q rend="PRE lsquo POST rsquo"><s>I know precisely what you are feeling.</s><s>I know all about your contempt, your hatred, your disgust.</s><s>But don't worry, I am on your side!</s></q><s>And then the flash of intelligence was gone...

However, the CES recommends that the <p> - <s> - <q> hierarchy be retained if possible--that is, the hierarchy of <s> elements is treated as primary, and the hierarchy of <q> elements is treated as secondary. In a case such as the one above, this can be accomplished by breaking the quotes and using the next and prev attributes together with the global id attribute to associate the fregments, as follows:

<s><q id=q1 type=part next=q2>I know precisely what you are feeling.</q></s> <s><q id=q2 type=part prev=q1 next=q3>I know all about your contempt, your hatred, your disgust.</q></s><s><q id=q3 type=part prev=q2>But don't worry, I am on your side!</q></s> <s>And then the flash of intelligence was gone...

In the following case, this method solves the problem of overlapping hierarchies:

<s>According to the visiting leader, the economy of the country is <q id=q1 type=part next=q2>better than ever.</q></s> <q id=q2 type=part prev=q1><s>It is in fact in very good shape.</s>"</q></p>

NOTE: The strategy that retains the <p> - <s> - <q> hierarchy is required for Level 3 conformance.

4.5.9.5. Pointing and reference

References in the text which refer to another part of it can be tagged with

<ref>

a reference to another location in the current document, in terms of one or more identifiable elements, possibly modified by additional text or comment. Attributes include the global attributes plus the following:

corresp

points to elements that correspond to the current element in some way.

next

gives the id reference of an element which contains a continuation of the current element.

prev

gives the id reference of an element which contains the previous portion of the current element.

type

indicates the type of pointer, e.g., aggregating, aligning, etc.

resp

specifies the creator of the pointer.

crdate

specifies when the pointer was created.

targType

indicates the type of data being linked, e.g., paragraph, sentence, etc.

targOrder

specifies whether the order in which the identifiers in the targets list is significant. Values:

Y Yes: the order of the IDREFs specified as the value of the targets attribute should be followed when the elements are combined.

N No: the order of the IDREFs specified as the value of the targets attribute has no significance.

U* Unspecified: no claim is made about the order of the IDREFs specified as the value of the targets attribute.

evaluate

specifies the intended meaning when the target or targets are pointers themselves. Values:

ALL if the element pointed to is itself a pointer, then the target of that pointer will be taken, and so on, until an element is found that is not a pointer.

ONE if the element pointed to is itself a pointer, then its pointer (whether a target or not) is taken as the target of this pointer.

NONE no further evaluation of targets is carried out beyond that needed to find the elemen specified in the pointer's target.

target

provides the IDs of two or more <xptr> elements that point to the locations of the elements to be associated.

In some cases it is desirable to move an element to another location in the encoded text. This is common for footnotes which occur in-line in the electronic text, but which appear as footnotes, endnotes, etc. in a printed version. It is also common for cpations, figures, bibliographic citations, and stage directions.

<ptr>

a pointer to another location in the current document in terms of one or more identifiable elements. Attributes include the global attributes plus the following:

corresp

points to elements that correspond to the current element in some way.

next

gives the id reference of an element which contains a continuation of the current element.

prev

gives the id reference of an element which contains the previous portion of the current element.

type

indicates the type of pointer, e.g., aggregating, aligning, etc.

resp

specifies the creator of the pointer.

crdate

specifies when the pointer was created.

targType

indicates the type of data being linked, e.g., paragraph, sentence, etc.

targOrder

specifies whether the order in which the identifiers in the targets list is significant. Values:

Y Yes: the order of the IDREFs specified as the value of the targets attribute should be followed when the elements are combined.

N No: the order of the IDREFs specified as the value of the targets attribute has no significance.

U* Unspecified: no claim is made about the order of the IDREFs specified as the value of the targets attribute.

evaluate

specifies the intended meaning when the target or targets are pointers themselves. Values:

ALL if the element pointed to is itself a pointer, then the target of that pointer will be taken, and so on, until an element is found that is not a pointer.

ONE if the element pointed to is itself a pointer, then its pointer (whether a target or not) is taken as the target of this pointer.

NONE no further evaluation of targets is carried out beyond that needed to find the elemen specified in the pointer's target.

target

provides the IDs of two or more <xptr> elements that point to the locations of the elements to be associated.

Examples:

     Here is a text.
     This caption appears at this point.
     But we would like to keep the text together.

This can be encoded as

     <p>Here is a text.
     <ptr target=C1>
     But we would like to keep the text together.</p>
     <caption id=C1>This caption appears at this point.</caption>

The note in the following example originally appeared at the location of the <ptr> tag:

The <name type=org>Ministry of Truth</name>, — <name type=org lang=ns>Minitrue</name>, in <name>Newspeak</name><ptr target=N1 rend=asterisk> — was startlingly different from any other object in sight...</p>
<note place=foot id=N1><name>Newspeak</name> was the official language of <name type=place>Oceania</name>. For an account of its structure and etymology see Appendix.</note>

4.5.10. Reference systems

For purposes of alignment or other reference to elements within a text, a reference system can be built up using the id attribute on appropriate elements.

We recommend the following strategy:

supply a unique identifying label in the id attribute of the <body> tag
for each nested division, give each unit an identifier which is built up by successively adding to the identifier of the text; for example

          <body id=ORW1>
            <div type=part id=ORW1.1>
              <div type=chapter id=ORW1.1.1>
                 <div type=section id=ORW1.1.1.1>
                 </div>
              </div>
            </div>
          </body>

for each paragraph, add another layer to the immediately superordinate identifier, as follows:

          <div type=chapter id=ORW1.1.1>
               <p id=ORW1.1.1.1.p1></p>
               <p id=ORW1.1.1.1.p2></p>
          </div>

for each s-unit, add another layer to the superordinate identifer on the enclosing <p> element:

          <div type=chapter id=ORW1.1.1>
               <p id=ORW1.1.1.1.p1>
                 <s id=ORW1.1.1.1.p1.s1></s>
               </p>
          </div>

4.5.11. Encoding names

When a string of characters is tagged as a name, many corpus-handling tools treat the string as a single token (e.g. some morpho-syntactic taggers) and do not perform additional analysis.

Titles and roles

For English, we can state the following rules:

Titles such as "Mr." and role names such as "Secretary" are not considered part of a person name:

Mme. <name>Edith Cresson</name>
(or : <abbr>Mme.</abbr> <name>Edith Cresson</name>)
President <name>Boris Yeltsin</name>
Appositives such as "Jr." are considered part of a person name:
<name>Sammy Davis, Jr.</name>

Where these rules can be used for encoding other languages they should be followed.

Possessives and inflected forms

In English the possessive is formed by the addition of "'s" which is tokenized separately, and should not be encoded as a part of the name:

<name>Winston</name>'s

In English, adjectival forms such as "Estonian" should not be tagged with the <name> tag. More generally, for any language, only nouns or noun phrases should be marked as names.

Forms of names with punctuation

Punctuation is normally considered to be a separate token, and should be encoded outside the <name> tag. See the discussion in the next section.

Examples:

Jaguar is made is <name type=place>Britain</name>.

<name type=place>France</name>-based

<name type=place>U.S.</name>-<name type=place>Japan</name> trade negotations

Forms not to be tagged as names

Laws, diseases, prizes, etc. named after people or saints, etc. should not be tagged with <name type=person>.
Street addresses, street names, adjectival forms of place names should not be tagged as <name type=place>.

4.5.12. Handling punctuation

Punctuation should be left as in the original text, except in the cases noted below.

Note that punctuation and special characters are treated by many corpus-handling tools as separate tokens. For example, a text such as

                  <q>Ignorance is strength.</q>

may be tokenized as

                      TOKEN   Ignorance 
                      TOKEN   is 
                      TOKEN   strength
                      TOKEN   .

Full stops and ellipses

The full stop should be kept as both a part of an abbreviation and as an end-of-sentence indicator. The disambiguation of the two uses is accomplished by the marking of abbreviations and/or s-units, when such markup is provided.

Ellipses should be regularized so that the three periods are contiguous, with no spaces in between.

Full stops appearing as a part of abbreviations should not be separated from the rest of the abbreviation string when the abbreviation is marked with the <abbr> tag, even though the full stop may serve a double function (i.e., also signal end-of-sentence).

Example:

I'm back in the U.S.

should be tagged as

I'm back in the <abbr>U.S.</abbr>

even though the period is both part of the abbreviation and a signal of end-of-sentence.

Hyphens and dashes

Line-end (soft) hyphens should be removed where they are not part of the regular spelling of the word. In cases of doubt, guidance should be sought elsewhere in the same text or in dictionaries. If doubt still remains, a hyphen should be retained rather than removed.

Dashes are marked by an entity reference (—). No distinction should be made between different types of dashes.

Apostrophes

Apostrophes should be left as they are in the original text. Note that the apostrophe can be ambiguous with the single quotation mark (e.g., in English the possessive "Joneses'"). This may be disambiguated by the marking of quotations.

Punctuation and tokens identified by the encoder

There is a small class of tags which mark the presence of tokens that have been isolated and classified by the encoder. Among the elements included in the cesDoc DTD, the following may be used to identify individual tokens:

                      <abbr>
                      <date>
                      <num>
                      <measure>
                      <name>
                      <term>
                      <time>

For many tools, when such an element is identified in the input stream, it is not desirable to further tokenize the string inside the tag; rather, the string inside the tag can be regarded as a single token (possibly with the type indicated by the tag name). For example, in some languages it may be possible be assumed for lexical lookup routines and morpho-syntactic taggers to assume that an element with the tag <name> is a single token with the grammatical category PROPER NOUN (Np). For example,

<name type=person>Big Brother</name>

can be tokenized as

TOKEN(name) Big Brother

Similarly, the string

<date>April 4th, 1984</date>

can be tokenized as

TOKEN(date) April 4th, 1984

Therefore, punctuation that is not a part of an identified token should not appear within the tag (except abbreviations--see below). For example, the text

The Ministry of Love, which maintained law and order.

should be encoded as

The <name type=org>Ministry of Love</name>, which maintained law and order.

Other examples:

<name type=org>Jaguar</name> company in <name type=place>Britain</name>.

...he had been born in <date>1944</date> or <date>1945</date>; but it...

...the three slogans of the <name type=org>Party</name>:...

Punctuation and quotations

When the <q> or <quote> tag is used, any quotation marks or other typographical device for indicating quoted dialogue should be removed from the text. The rend attribute can be used to indicate the means by which the quotation was originally marked in the text (this is not required). In these cases, the value of the rend attribute should be one of the following, which are consistent with entity names in ISOpub and ISOnum:

laquo angle quotation mark, left raquo angle quotation mark, right lsquo single quotation mark, left rsquo single quotation mark, right ldquo double quotation mark, left rdquo double quotation mark, right lsquor rising single quote, left (low) ldquor rising dbl quote, left (low) rdquor rising dbl quote, right (high) rsquor rising single quote, right (high) mdash dash the width of lowercase m

Note that it is required to eliminate quotation marks etc. marking a quotation for Level 2 and 3 conformant encodings, since the rendition conventions for dialogue are language-specific and therefore not a part of the "content" proper.

In principle, encode punctuation as inside or outside the <q> tag according to the position of the quotation marks in the original, as in these examples:

('dealing on the free market', it was called)
(<q rend="PRE lsquo POST rsquo">dealing on the free market</q>, it was called)
The dark-haired girl behind Winston had begun crying out `Swine! Swine! Swine!'
The dark-haired girl behind <name type=person>Winston</name> had begun crying out <q rend="PRE lsquo POST rsquo">Swine! Swine! Swine!</q>
'I am with you,' O'Brien seemed to be saying to him.
<q rend="PRE lsquo POST rsquo">I am with you,</q><name type=person>O'Brien</name>seemed to be saying to him.

In cases where the <q> tag is used for text that is not enclosed in quotation marks in the original, leave punctuation that is not a part of the actual cited text outside the <q> tags:

BIG BROTHER IS WATCHING YOU, the caption beneath it ran.
<q rend=ca type=slogan><name type=person>Big Brother</name> is watching you</q>, the caption beneath it ran.
Never mind, it doesn't matter, he thought. ["Never mind, it doesn't matter" in italics]
<q rend=it>Never mind, it doesn't matter</q>, he thought.
Eureka! he shouted. ["Eureka!" in italics]
<q rend=it>Eureka!</q> he shouted.

Note, however, that the tokenization of the text should not be affected by the position of the punctuation relative to the closing tag; the same set of tokens is ultimately generated in either case.

Punctuation in <s> tags

Sentence terminating punctuation should always appear within an enclosing set of <s> and </s> tags:

<s><q rend=it>Eureka!</q> he shouted.</s>
<s>The dark-haired girl behind <name type=person>Winston</name> had begun crying out <q rend="PRE lsquo POST rsquo">Swine! Swine! Swine!</q></s>

Punctuation in other tags

Because tokenizers typically treat text within tags such as <hi> and <foreign>, punctuation can appear either inside or outside the closing tag without effect. Therefore, given this text:

She ordered a croque monsieur. ["croque monsieur" in italics]

either of the two following encodings is acceptable:

She ordered a <foreign rend="it">croque monsieur</foreign>.

She ordered a <foreign rend="it">croque monsieur.</foreign>

4.5.13. Encoding morpho-syntactic annotation in the primary data

The CES recommends that linguistic annotation be encoded in a separate SGML document with its own DTD, which is linked to the primary data (see section 5). However, we recognize that for some applications it is still desirable to retain morpho-syntactic annotation in the same SGML document as the primary data. Therefore, the CES provides means to accomplish this in-file tagging. To implement it, a pre-defined module containing all the required definitions for the morpho-syntactic information must be brought in at the beginning of the document.

Using this method, the CES provides two different types of in-file annotation:

annotation using the same elements for marking tokens and associated base forms, part-of-speech tags, morphosyntactic descriptions, etc. as defined in section 5.2.4 and section 5.2.5, describing the cesAna DTD.
annotation using TEI-like elements for marking linguistic segments, such as <w> (word), <phr> (phrase), etc. Withthis option morphosyntactic tagging is accomplished using attributes on the <w> element.

Using the cesAna elements for tokens and morphosyntactic annotation

The full description of elements for this kind of encoding is provided in section 5.2.4 and section 5.2.5. They include:

<tok>: contains a token, consisting of its orthographic form in the original document, followed optionally by disambiguated corpus tag and/or one or more alternative sets of morphosyntactic information associated with the token.
<orth>: contains the orthographic form of the token as it appears in the original.
<disamb>: contains one or more disambiguated corpus tags associated with the token.
<lex>: contains one or more alternative sets of morphosyntactic information associated with the token.
<base>: the base or lemmatized form for the morphosyntactic information given in the associated <msd> element.
<msd>: the morphosyntactic description, specified in EAGLES-complaint format.
<ctag>: contains the corpus tag associated with the morphosyntactic information.

To enable the inclusion of these elements and provide for their appearance in appropriate locations, it is necessary to include the following in the SGML document in which they are used:

       <!doctype cesDoc PUBLIC "-//CES//DTD//cesDoc//EN"
       [
       <!ENTITY % token.elt PUBLIC "-//CES//DTD//ENTITIES Token//EN">
       %token.elt;
       ]>
       <cesdoc version="4.1">
          ...

The definitions between the "[" and "]" bring in all the required additional elements and modify the definition of M.TOKEN to consist only of the element <tok>, thus replacing the definition in the main cesDoc DTD. This results in the modification of the content model PHRASE.SEQ to consist of a series of <tok> elements, possibly intermixed with the elements <foreign>, <title>, <distinct>, <mentioned>, and <hi>.

As a result, any element whose content model is %PHRASE.SEQ now may contain a series of tokens rather than elements such as <abbr>, <num>, <date>, etc. The elements <foreign>, <title>, <distinct>, <mentioned>, and and <hi> may be interspersed with tokens, when they themselves contain tokens. For example, the multi-word highlighted phrase ("red house") in this sentence could be encoded as

       <tok>
         <orth>He</orth>
         <ctag>NP</ctag>
       </tok>
       <tok>
         <orth>bought</orth>
         <ctag>VB</ctag>
       </tok>
       <tok>
         <orth>a</orth>
         <ctag>DT</ctag>
       </tok>
       <hi rend=it>
          <tok>
            <orth>red</orth>
            <ctag>AD</ctag>
          </tok>
          <tok>
            <orth>house</orth>
            <ctag>NN</ctag>
          </tok>
       </hi>
       ...

This strategy is intended to reflect the fact that most morphosyntactic tagging systems have specific tags for abbreviations, names, etc., and that therefore primary data which includes morpho-syntactic tagging would not inlcude explicit CES tags for such elements, but would instead associate tokens with appropriate morpho-syntactic tags using the <tok> element. However, the interaction between the phrasal CES tags in the standard cesDoc DTD and morpho-syntactic tags in annotated primary data needs more consideration to find the optimal system. As it stands, the proposed system should allow for most possibilities we currently envision; we welcome user input on this matter.

Using the linguistic segmentation elements

This option works exactly as the one above, except that it brings in a different set of elements:

<cl>

represents a grammatical clause. Attributes include:

type: indicates the type of clause.

<phr>

represents a grammatical phrase. Attributes include:

type: indicates the type of phrase.

<w>

represents a grammatical (not necessarily orthographic) word. Attributes include:

type: indicates the type of word.
base: identifies the word's lemma.
ctag: provides part-of-speech information in the form of a corpus tag.

<m>

represents a grammatical morpheme. Attributes include:

type: indicates the type of morpheme.

<c>

represents a character. Attributes include:

type: indicates the type of character.

To use this option, include the following at the beginning of the document:

       <!doctype cesDoc PUBLIC "-//CES//DTD//cesDoc//EN"
       [
       <!ENTITY % word.elt PUBLIC "-//CES//ENTITIES Word//EN">
       %word.elt;
       ]>
       <cesdoc version="4.1">
          ...

In this case, the definitions between the "[" and "]" bring in the additional linguistic segmentation elements and modify the definition of M.TOKEN to consist only of these elements, resulting in the modification of the content model PHRASE.SEQ to consist of a series of any of the linguistic segmentation elements, possibly intermixed with the elements <foreign>, <title>, <distinct>, <mentioned>, and <hi>. As a result, any element whose content model is %PHRASE.SEQ now may contain a series of any or all of the linguistic segmentation elements, possibily interspersed with <foreign>, <title>, <distinct>, <mentioned>, and and <hi>.

Using this option, a very concise representation of word/part-of-speech annotation can be obtained. Consider the following alternative encoding of the above example:

       <w base=he ctag=NP>He</w>
       <w base=buy ctag=VB>bought</w>
       <w base=a ctag=DT>a</w>
       <hi rend=it>
          <w base=red ctag=AD>red</w>
          <w base=house ctag=NN>house</w>
       </hi>
       ...

The cesDoc DTD

The cesDoc DTD in hypertext navigable format

The cesDoc DTD intantiated as a TEI customization

Part 4.5

The cesDoc DTD for primary data

Contents

Full stops and ellipses

Hyphens and dashes

Apostrophes

Punctuation and tokens identified by the encoder

The cesDoc DTD
for primary data