Corpus
Encoding Standard
-
Document CES 1. Part 4.5. Version 1.9. Last modified 5 December 1996.
Part 4.5
The cesDoc DTD
for primary data
Contents
| Prev
| Next
| CES Contents
|
This section defines the cesDoc DTD, which is used for Level 1, Level 2,
and and Level 3 CES-conformant encodings. The cesDoc DTD defines the
required structure for marking Level 1 conformant
documents down to the paragraph level. It also defines additional elements
at the sub-paragraph level which may appear, but are not required, in a
Level 1 encoding, and which are used in Level 2 and Level 3 encodings.
The cesDoc DTD specifies rules which determine where the included elements may
legally appear in a document conforming to this DTD. The rules are
expressed formally in the DTD for
the document, which is given at the end of the section. This section also
provides informal semantics for the use of the defined elements.
Five global attributes are defined in the cesDoc DTD:
-
- id
- a unique identifier for the element bearing the ID value.
- n
- a number or other label for the element, not necessarily
unique within the corpus.
- lang
- indicates that the tag's content is in the specified
language. The value of the lang attribute which should be the same
as that appearing on a <language> element in the header
document which describes that character set, composed of one of the
following:
- a two-letter code from ISO 639 (e.g., "en" for English;
- a three-letter code from ISO 639-2 (e.g., "eng" for English);
- one of the above extended by a country code from ISO 3166 (e.g.,
"en.uk" or "eng.uk" for English as spoken in the United Kingdom).
- wsd
- indicates that the tag's content is encoded in the
specified character set. The value of the attribute is the character set
name (ISO-8859-1, etc.) which should be the same as that appearing on a
<writingSystem> element in the header document which describes
that character set.
- rend
- provides information about rendition in an original
printed version. The value of the rend attribute may take one of the
following attributes, although other values are also valid:
- BO bold face
- BX boxed
- IT italic font
- RO roman font
- UL underlined
- CA capital letters
The global attributes are defined at the top of the cesDoc DTD and
represented by an entity, A.TEXT. This entity is used to
represent the list of global attributes on the attribute declarations for
most elements in the document.
For modularity and readability, the cesDoc DTD follows the TEI model of creating
element classes for groups of elements which commonly appear together in
content models. These element classes differ from the TEI's in two major
ways:
- elements are grouped into classes only on the basis of common appearance
in content models, and not on the basis of shared attributes;
- the CES element classes are far more simplified, comprising a shallow
hierarchy with no common elements.
Element classes are defined in the cesDoc
DTD by declaring an entity that represents a group of elements. This entity can
then appear in the content model of some element and indicates that all of the
members of that class may appear at a common location.
The cesDoc DTD defines the following element classes (class names
consistent with
similar TEI classes):
-
- M.INTER
- paragraph-level elements, i.e., elements which can appear inside
<div> elements at the paragraph
level, or between paragraphs
- M.PHRASE
- phrase-level elements, i.e., elements which can appear inter-mixed with
PCDATA at the sub-paragraph level. M.PHRASE
contains the class M.TOKEN
- M.TOKEN
- elements which are regarded as
individual tokens even when they may contain sub-constituents.
It is similarly useful to define entities that represent content models which
are frequently used in defining elements, since common content models are
readily obvious, and modification is simple. The content models defined by
entities in the cesDoc DTD are:
-
- BASE.SEQ
- the base content model for token level elements, including
PCDATA, possibly inter-mixed with <abbr> and
<num> elements.
- PAR.SEQ
- elements that can appear at the paragraph level--i.e., in between
paragraphs, at the same level as <p>. This includes the
elements in class M.INTER plus <p> and
<sp>.
- PHRASE.SEQ
- phrase-level elements, consisting of PCDATA
inter-mixed with the elements in class M.PHRASE.
The top level structure of the cesDoc DTD is as follows:
-
- <cesCorpus>
- contains the whole of a CES encoded corpus, comprising a single corpus
header and one or more cesDoc elements, each containing a single text
header and a text. Additionally, the <cesCorpus> element can be recursively nested, and sequences of this element can appear at any nested level, in order to identify sub-corpora. In addition to the global attributes, it has the following attribute:
- type
- used to identify the type of a sub-corpus (by language, genre, etc.) when nested <cesCorpus> elements are used.
- version
- provides the version of the cesDoc DTD to which this corpus is compliant. If different parts of the corpus were created using different versions of the DTD (this is possible since any version is upward-compatible with its successor), then the value here reflects the highest version number used in the corpus--i.e., the version with which the corpus can be parsed.
- TEIform
- provides the TEI element which corresponds to this element.
- <cesDoc>
- a single document, either forming part of or derived from a corpus,
containing a <cesHeader> element, followed by either a
<body> element or a <group> element. In addition
to n and id, this element has the following attributes:
- type
- indicates the type of document (text, spoken data, etc.); the default
is text.
- version
- provides the version of the cesDoc DTD to which this text is compliant.
- TEIform
- provides the TEI element which corresponds to this element.
The <cesDoc> element can contain the following:
-
- <cesHeader>
- contains the header for the corpus or text. This element is fully
described in section 3.
- <text>
- contains an individual text.
- complete
- specifies whether or not this text is complete or a
sample.
-
- Y in principal, all of the original has been
transcribed
-
- N a sample of the original has been taken
- decls
- specifies one or more IDs associated with elements in the text header or corpus header that apply to this element.
The <text> tag may contain one occurrence of one of the following:
-
- <group>
- groups together a sequence of distinct texts that are regarded as a unit,
such as a sequence of prose essays, poems, etc. Global attributes plus:
- decls
- specifies one or more IDs associated with elements in the text header or corpus header that apply to this element.
- <body>
- contains the body of the text, excluding any front or back matter.
Global attributes plus:
- decls
- specifies one or more IDs associated with elements in the text header or corpus header that apply to this element.
The <body> element may contain:
- a <group> element, grouping together a sequence of
distinct texts that are regarded as a unit,
such as a sequence of prose essays, poems, etc.
- An optional sequence of paragraph level elements of arbitrary length,
followed by one or more elements sub-dividing the text. The body may have
no structural
divisions within it at all.
Note that there is no provision for the encoding of front matter such as
cover page, table of contents, appendixes, etc., in the current CES
recommendations. For the most part, such material is unnecessary for corpus
linguistics and should not be included. However, where desired, front and
back matter can be encoded using the TEI elements <front> and
<back>; see Annex 7 for the
means to accomplish this.
Written texts exhibit a variety of different structural forms. Some have very
little organization at levels higher than the paragraphs, while others have a
complex hierarchy of parts, sections, chapters etc. Novels are divided into
chapters, newspapers into sections, reference works into articles, etc.
The following element is used to represent textual divisions of all kinds:
-
- <div>
- any subdivision of a written text, e.g. chapter, section, sub-section,
article, etc.
If a text has any structural subdivision, then at least those at the highest
level should be identified.
The <div> element has the following attributes:
-
- type
- categorises the division in some respect, e.g. as a chapter,
section etc.
- complete
- specifies whether or not this division is complete or a
sample.
- Y* the full text of the original has been
transcribed
- N a sample of the original text has been taken
- decls
- specifies one or more IDs associated with elements in the text header or corpus header that apply to this element.
The n global attribute can be used to carry an identifying name or
number used within the text for a given division, for example, a chapter
number, as in the following example:
<div type=CHAPTER n=5>
The type attribute is required and is used to characterize the
division. A set
of precise values will be provided by EAGLES/PAROLE.
The content of the <div> tag is defined to consist of one
or more division head elements (optional) followed by a sequence of
paragraph-level elements, followed by one or more division closing elements
(optional).
Below the level of text divisions, there are three general groups of elements
which may appear:
-
- Division head elements
- information such as section titles, bylines, etc. that often appears at the
beginning of text sections.
- Paragraph-level elements
- further division of the text, into paragraphs, etc.
- Division closer elements
- information such as datelines, bylines, etc. that can appear at the end
of a text section, especially in newspapers, etc.
Division head elements include:
-
- <opener>
- groups together any opening material that is not a heading at the
start of a division, including in particular <dateline> and
<keywords>.
- <head>
- contains any heading, for example, the title of a section. This element
can also appear inside the <list> and <poem> elements
to mark the title of a list or poem. It can contain any
phrase-level element.
- type
- gives the type of header, e.g., main, sub, unspecified, etc.
- <byline>
- contains the primary statement of responsibility given for a work on
its title page or at the head or ending of the work, most often applicable
to newspapers.
Can contain any phrase-level element plus the tag <docAuthor> for
the author's name.
Division closing elements include:
- <closer>
- groups together material appearing at the end of a division, including
in particular <dateline> and
<keywords>.
- <byline>
- same as above.
The <keywords> element can contain
terms and lists of terms that may appear at the beginning or end of a text as
identifying material.
The <dateline> element can contain untagged prose intermixed
with markup for dates, times, names, addresses, abbreviations, and numbers.
A number of divisons of text occur at what is called the paragraph-level, since
the most common such division at this level is <p> (paragraph).
There are in addition several other elements which may appear directly within
structural divisions (that is, not nested within some other element).
- <p>
- a paragraph in a written text.
- <sp>
- contains material marked as "written to be spoken'' or "written as
spoken", usually by the presence of a speaker prefix, for example in a play
script or printed interview.
- <caption>
- (1) a heading, title etc. attached to a picture or diagram (2) a "pull
quote" or other text about or extracted from a text and superimposed upon it
to draw attention to it.
- <quote>
- a quotation from some author other than that of the surrounding text,
usually either embedded or displayed.
- <poem>
- a poem, or an extract from one, embedded or quoted within a text.
- <list>
- a collection of distinct items flagged as such by special layout in
written texts, often functioning as a single syntactic unit.
- <figure>
- indicates the location of a graphic, illustration, or figure.
- <bibl>
- a loosely-structured bibliographic citation appearing within a corpus
text.
- <note>
- any form of note, usually a footnote. This tag is used only for notes
that are a part of the original data only, not notes which may be added by the
encoder, etc.
- <table>
- contains text displayed in tabular form, in rows and columns.
The paragraph-level elements are discussed in more detail in the following
sub-sections.
NB: only the <p> element is required below the division
level for minimal Level 1 CES conformance.
We distinguish between <head> elements, which can appear only at
the start of a text division and are logically associated with it (for example,
chapter titles, newspaper headlines etc.) and <caption> elements,
which are logically independent of the position they may have within a textual
division (e.g.,, captions attached to pictures or figures, "pull-quotes''
embedded within the text, "by-lines'' identifying authorship and provenance of
a newspaper or periodical article.
The type attribute may be used to indicate the function of the caption:
- type
- categorizes the caption.
- BYLINE caption containing authorship of an article
- DISPLAY extra-textual caption (displayed box,
etc.)
- ATTACHED caption describing a figure,
photograph, etc.
- UNSPEC* not specified or unknown
A caption can be placed at a point other than where it appears, so as not to
interrupt the normal flow of a text, by using it with the <ptr>
tag. See the section on Pointing and reference.
A quotation is a (usually long) extract from some other work than the text
itself which is embedded within it. It is set off from the paragraphs
that surround it typographically, by spacing similar to that for paragraphs
(e.g., white space before and after). It
may contain paragraphs, s-units, dialogue (marked with <q>) or any
other phrase-level element.
In the CES, the use of the <quote> tag is sharply distinguished
from that of the <q> tag, which is used to mark quoted material
that appears inside a paragraph.
The <sp> element is used to mark parts of a written text which are
intended to be spoken, for example the speeches in a dramatic text, or which
comprise the transcription of a speech, interview, debates, etc. typically
intended for publication (i.e., which have been transcribed to be read as
text). Such parts are generally readily identifiable by the use of conventions
such as speaker prefixes (the label supplying the name of the speaker) and
stage directions. The <sp> element takes the following attribute:
- who
- name of the speaker
The <sp> element contains:
- <speaker>
- contains the speech prefix used in the original source to identify the
speaker of a passage written to be spoken.
- <stage>
- contains any kind of stage direction within a dramatic text.
- type
- indicates the kind of stage direction.
The
<sp> element is not intnded to identify speaker turns identified
in a spoken text, i.e. one which has been transcribed from audio tape. The
<sp> element is used only for speaker turns identified as such in
a written text.
The <speaker> element is used to tag a label or prefix identifying
the speaker or speakers, and is followed by a sequence of paragraphs.
The <stage> element, when it appears, will normally be relocated
to the end of a paragraph in which it occurs. The <ptr>
element can be used
to indicate its original position; see the section on Pointing and reference.
Poems or fragments of verse or song may appear between paragraphs or. Where they
are distinguished from the surrounding text, they are marked using the
<poem> element, which contains an optional series of
<head>
elements followed by one or more <lg> or <l> (for
line) elements, which is used to mark metrical lines, rather than typographic
lines:
-
- <lg>
- groups verse lines (marked by <l>), most often into stanzas.
Use the type attribute to identify the reason for the grouping.
- <l>
- a line of verse.
- part
- indicates whether the verse line is metrically complete.
- U* metricality is not known or inapplicable
- Y the line is metrically complete
- N the line is metrically incomplete
Note that the <lg> element may be recursively nested, in order
to provide for sub-groupings of lines. In this case, the n attribute
should be used to indicate the nesting level (e.g., n=1 for outer level,
n=1.1 for nested sub-level, etc.; see the section on Reference systems.
A list consists of an optional <head> element, followed by one or
more <item> elements, each of which may optionally be prefixed by
a <label> element:
- <item>
- an item within a list.
- <label>
- an enumerator or other label attached to a list item. Lists may or
may not be marked. Where marked, they may appear within or between
paragraphs.
The <label> element is used to hold the identifier
or tag sometimes attached to a list item, for example "(a)'', or a word or
phrase used for a similar purpose.
However, note that for the purposes of corpus-based work, it is usually
preferable to regard list labels as rendition information and to encode
them in the n attribute, rather than as part of the document
content.
The <item> element may appear only inside lists. It contains the
same elements as a paragraph, and may therefore contain one or more nested
lists.
Figures are marked with the following tag, which enables a reference to a
stored image in another file:
- <figure>
- indicates the location of a graphic, illustration, or figure.
- entity
- names the external entity within which the graphic image of
the figure is stored.
The
<figure> element contains an optional <head> element
for the figure title or heading, followed by an optional sequence of paragraphs
for commentary or caption, an optional <figdesc> element,
and an optional <body> element for including the graphic
itself, where desired. The <figure> element can be empty, serving
only to mark the presence of a figure in the text.
- <figDesc>
- contains a brief prose description of the appearance or content of a
graphic figure, for use when documenting an image without displaying
it.
Note that in many instances, figures will not be retained at all in the
encoded version of the text. In this case, the <gap> element should be used to indicate the
omission.
Annotations and bibliographic citations or references are marked using the
following elements:
- <note>
- any form of note, usually a footnote. This tag marks only notes that
are a part of the original text, not notes that may be added by the encoder,
etc.
- place
- for a written text, specifies the location of an original
note in the source text.
- FOOT note at foot of page.
- END note at end of current division or
text.
- SIDE note in left or right margin.
- UNSPEC* placement unknown or
unspecified.
- <bibl>
- a loosely-structured bibliographic citation appearing within a corpus
text.
Original notes may contain paragraphs, s-units, dialogue, and any
other phrase-level element. The global n attritbute can be used to
indicate the value of a numbered note.
Like captions, notes are often moved from their original location in the
original data and placed at another point so as not to
interrupt the normal flow of a text, by using the <ptr>
tag as follows
(see the section on Pointing and reference):
Here is a text, with a "1" at the end for a
footnote. [1].
<<Then, this note appears at
this point in the original.>>
But we would like to keep the text together.
This can be encoded as
<p>Here is a text.
<ptr target=N1 n=1 rend=bracketed>
But we would like to keep the text together.</p>
<note id=N1 place=foot>Then, this note appears at
this point in the original.</note>
Bibliographic citations or references within running texts are marked using the
<bibl> element, which can contain any phrase-level element plus
the <author> element.
The <table> element is used to include tables in the text. It
takes the attributes:
-
- rows
- indicates the number of rows in the table.
- cols
- indicates the number of columns in the table.
Note that in many instances, tables will not be retained at all in the
encoded version of the text. In this case, the <gap> element should be used to indicate the
omission.
The cesDoc DTD also includes tags for marking sub-paragraph-level elements.
Marking sub-paragraph elements is not required for Level 1 documents, but
some are required for Level 2 and Level 3 documents.
Certain phrase-level elements are commonly tagged in the early stages of the
markup process, since they are signalled by the typography in legacy data or in
printed versions serving as the copy. It is therefore desirable to provide some
guidance for the inclusion of sub-paragraph markup in Level 1 documents.
The phrase-level elements that are provided for in the cesDoc DTD are
selected on
the basis of their relevance for corpus-based work. There are five main
categories of phrase-level elements:
- elements of linguistic interest;
- elements indicating editorial changes to the original text;
- the <hi> element for marking typographically distinct words or
phrases, especially when the purpose of the highlighting is not yet
determined;
- elements for identifying s-units (typically orthographic sentences) and
quoted dialogue;
- elements for pointing and reference.
The cesDoc DTD imposes a relatively strict structure on sub-paragraph elements,
intended to disallow options and impose a structure which is most suited to the
needs of corpus-handling tools. Adherence to this structure for Level 1
documents is recommended, but not required.
There have been two main defining forces behind the choice of elements:
- the needs of corpus-annotation tools, such as morpho-syntactic taggers,
whose performance can often be improved by pre-identification of elements such
as names, addresses, title, dates, measures, foreign words and phrases, etc.
- the need to identify objects which have intrinsic linguistic interest, or
are often useful for the purposes of translation, text alignment, etc., such as
abbreviations, names, terms, linguistically distinct words and phrases,
etc.
The phrase-level elements identifying linguistically relevant elements are:
- <abbr>
- contains an abbreviation of any sort. Consult Handling
Punctuation for guidelines for encoding abbreviations.
- expan
- contains the expansion of the abbreviation
- <date>
- contains a date in any format.
- ISO8601
- ISO 8601 normalized form of the date
- <list>
- a collection of distinct items flagged as such by special layout in
written texts, often functioning as a single syntactic unit.
Note that <list> is the only phrase-level element which is also a paragraph-level element; its content model is exactly the same in both instances. For its full definition see section 4.5.8.5.
- <measure>
- contains a number, word, phrase indicating a quantity.
- type
- the type attribute takes one of the following values:
- WEIGHT
- LENGTH
- COUNT
- AREA
- VOLUME
- CURRENCY
- TEMPERATURE
- value
- contains the the ISO 4217 codes for currency representation when
the type attribute specifies currency.
- <name>
- contains a proper noun or noun phrase.
- type
- indicates the type of proper noun. Suggested values include:
- PERSON
- PLACE
- ORG
- LANGUAGE
See Encoding Names.
- <num>
- contains a number, written in any form.
- value
- contains the normalized value of the number.
- <term>
- contains a single-word, multi-word or symbolic designation which is
regarded as a technical term.
- <time>
- contains a phrase defining a time of day in any format.
- ISO8601
- ISO 8601 normalized form of the time.
- type
- the type attribute takes one of the following values:
- AM
- PM
- 24HOUR
- DESCRIPTIVE
- <distinct>
- identifies a word or phrase regarded as linguistically distinct (e.g.,
archaic, technical, dialect, etc.).
- <foreign>
- identifies a word or phrase as belonging to some language other than
that of the surrounding text. Use the global lang attribute to
indicate the language.
- <mentioned>
- marks words or phrases mentioned, not used.
- <title>
- contains the title of a work, whether article, book, journal, or
series, including any alternative titles or subtitles.
The linguistic elements fall into two groups, which determine their content
models:
- elements which are, for many purposes of language engineering such as
morpho-syntactic tagging, regarded as individual tokens, even when they may
contain sub-constituents. In the CES this group includes names, dates,
times, measures, abbreviations, and terms. These elements therefore may
contain PCDATA. They may also contain the <abbr> and
<num> elements; abbreviations and numbers are frequently
identified and tagged automatically, and therefore their placement must be
relatively free. Note that to avoid unnecessary recursive nesting of
elements, the<abbr> cannot contain another
<abbr> tag, and <num> cannot contain another
<num>.
This group of elements, which comprise the element class M.TOKEN, includes:
- <abbr>
- <num>
- <name>
- <date>
- <measure>
- <time>
- <term>
- elements which may contain sub-constituents which are treated by
corpus-analytic tools as tokens, or may be regarded as tokens in
themselves. Each of these elements can contain any other phrase-level
element, except itself (i.e., there is no recursive nesting of elements
allowed). It is assumed that tokenizing tools may further analyze the
content of these elements in order to identify constituent tokens where
they exist.
This group of elements includes the following elements:
- <title>
- <foreign>
- <mentioned>
- <distinct>
This latter group also includes another tag, the <hi> tag,
which is used to mark information which is rendered specially in some
original, but for which the function of the highlighting is either unknown
or unspecified. In later phases of up-translation when the function of the
highlighting is determined, <hi> tags are very often changed
to one of the other more descriptive tags in this group. See section 4.5.9.2., below, for a full discussion of the use of
the <hi> tag.
In general it is not desirable to mark typographic features of a given
printing of a text in texts designated for use in corpus-based research.
However, there are circumstances under which it is desirable to retain this
information. In particular, certain items of linguistic interest may be
marked by typography in the original; e.g., linguistic emphasis and foreign
words are often rendered in italics. In addition, some applications (e.g.,
machine translation which attempts to reproduce the format of the original)
demand retaining the rendition information.
In the process of up-translation from legacy data, a first step is often to
translate relevant typographic information into SGML, with no attempt to
interpret the significance of the rendering (e.g., that the italics signify
a foreign word). Interpretation is often too costly because it is ambiguous
(e.g., italics signify not only foreign words, but also emphasis, titles,
etc.). In such cases the
<hi>
element can be used. Normally, in later phases of up-translation,
<hi> tags are changed to more descriptive tags, such as
<title>,
<foreign>,
<mentioned>, or
<distinct>.
-
- <hi>
- marks a word or phrase as graphically distinct from the surrounding
text, for reasons concerning which no claim is made. The rend attribute
should provide the original rendition information when its function has not yet
been determined.
- rend
- describes the rendition or presentation of the
highlighted item.
- BO bold face
- BX boxed
- IT italic font
- RO roman font
- UL underlined
- CA capital letters
Note: Several values from the list may be specified where appropriate,
separated by spaces, e.g., "ro it".
When the <hi> tag is used, no claim about the reason is made.
This may be the case in a Level 1 encoding, since determining the reasons
for highlighting (e.g., presence of a foreign word, vs. emphasis, vs. a
title, etc.) demands human intervention and is therefore too costly in the
early stages of up-translation. Note that typographically highlighted
phrases and the kind of highlighting used may be recorded in one of two
ways:
- using the global rend attribute
- using the <hi> element with a rend attribute
The first method specifies an attribute on some element which contains
all of and only the highlighted phrase. In this case, the function
of the highlighting is clear (for example, to mark a heading), and the
boundaries of the highlighted phrase therefore coincide with the boundaries
of some other element. The rend attribute is given on the tag for
that element, for example
<head rend=bo>The world beyond</head>
The second method inserts a new tag indicating that what it contains is
highlighted. It is used
- when the function of the highlighting is not clear;
- where there is no tag identifying the feature concerned;
- where the highlighted phrase is not co-terminous with some other
element.
The rend attribute must be supplied on the <hi>
element. The rend attribute is optional on all other elements.
Note that in cases where the <hi> element often appears with
the same value for rend, a default value can be provided on the <tagUsage> element. When this
mechanism is used, the rend attribute need be given only when the
default does not apply to the given occurrence of the <hi>
element.
Both the start and end tag for any SGML element must be contained within
the start and end tag of any of its ancestors in the tree for that
document. Since by definition <hi> elements can appear only
within <p> elements, this means that where, for example, an
italicized passage contains more than one paragraph or starts within a
paragraph and spans one or more others, the <hi> element must
be closed at the end of the enclosing element, and then re-opened within
the next. For example, an italicized passage which crosses a
<p> boundary must be tagged as follows:
-
<p>This is the start of a paragraph which <hi
rend=it>switches to
italics here and then goes on for several paragraphs.</hi></p>
<p><hi rend=it>This second paragraph is all in
italics</hi></p>
<p><hi rend=it>This is the last bit of italics</hi> and
the rest is
in roman.</p>
That is, the <hi> element is closed before the end of the
first paragraph and re-opened at the start of the next. Note that the
following encoding is not acceptable:
-
<p>This is the start of a paragraph which <hi
rend=it>switches to
italics here and then goes on for several paragraphs.</hi></p>
<p rend=it>This second paragraph is all in italics</p>
<p><hi rend=it>This is the last bit of italics</hi> and
the rest is
in roman.</p>
This second encoding mixes different styles of marking the same feature for
a given span of text, which will cause problems for retrieval.
The following tags are used to mark editorial changes:
-
- <corr>
- contains the correct form of a passage apparently erroneous in the copy
text.
- sic
- gives the original form
- resp
- gives the name of the responsible editor
- cert
- used to indicate the degree of certainty with which the change has been
made.
- <gap>
- indicates a point where material has been omitted in a transcription,
whether for editorial sampling practice, or because the material is
illegible.
- desc
- describes the omitted text
- reason
- gives the reason for the omission (sampling, illegible, etc.)
- resp
- gives the name of the responsible editor
- cert
- used to indicate the degree of certainty with which the change has been
made.
Note that the <gap> element is useful for noting the omission
of material which is often uninteresting for corpus-based language
engineering applications, in particular, figures, tables, etc.
-
- <reg>
- contains text which has been regularized or normalized in some sense.
- orig
- gives the original form
- resp
- gives the name of the responsible editor
- cert
- used to indicate the degree of certainty with which the change has been
made.
The segmentation of texts into s-units, or orthographic sentences, is
usually accomplished by special tools. The results of such segmentation are, in
the CES model, considered as a type of annotation and stored in a separate
file, which has advantages for ease of processing. However, in some cases it is
desirable to mark s-units and/or quoted dialogue in the primary data. We
therefore provide mechanisms for marking these elements.
In some cases only quoted dialogue is marked in the primary data, because the
identification of quoted dialogue can be accomplished automatically (by
detecting quotation marks etc.).
-
- <s>
- identifies an s-unit within a document, typically an orthographic
sentence.
- next
- gives the id reference of a subsequent <s> element which contains a continuation of the current sentence.
- prev
- gives the id reference of a previous <s> element which contains the beginning fragment of the current sentence.
- type
- indicates the type of sentence.
- broken
- indicates whether this <s> element is broken between two or more <s> elements (linked using the next and prev attributes).
- <q>
- contains quoted dialogue or other quoted material appearing inside a
paragraph.
- next
- gives the id reference of a subsequent <q> element which contains a continuation of the current quote.
- prev
- gives the id reference of a previous <q> element which contains the beginning fragment of the current quote.
- type
- indicates the type of quote.
- who
- indicates the speaker of the quote.
- broken
- indicates whether this <q> element is broken between two or more <q> elements (linked using the next and prev attributes).
When s-units are tagged, no
split should be made between a colon or semi-colon followed by a word beginning
with a capital initial (unless there is an end-of-paragraph marker).
When both <s> and <q> are marked, the problem of
overlapping hierarchies can arise.
For this reason it has been necessary to allow for mutual recursive nesting of
<s> and <q> tags in the cesDoc DTD, a practice
which is otherwise avoided. This allows all the following encodings:
-
<s><q>Indeed yes,</q>she replied.</s>
<q rend="PRE lsquo POST rsquo"><s>I know precisely what you are
feeling.</s><s>I know all about your contempt, your hatred, your
disgust.</s><s>But don't worry, I am on your
side!</s></q><s>And then the flash of intelligence was
gone...
However, the CES recommends that the <p> - <s> - <q>
hierarchy be retained if possible--that is, the hierarchy of
<s> elements is treated as primary, and the hierarchy of
<q> elements is treated as secondary. In a case such as the
one above, this can be accomplished by breaking the quotes and using the
next and prev attributes together
with the global id attribute to associate the fregments, as follows:
-
<s><q id=q1 type=part next=q2>I know precisely what you are
feeling.</q></s> <s><q id=q2 type=part prev=q1
next=q3>I know all about your contempt, your hatred, your
disgust.</q></s><s><q id=q3 type=part prev=q2>But don't
worry, I am on your side!</q></s> <s>And then the flash of
intelligence was gone...
In the following case, this method solves the problem of overlapping
hierarchies:
- <s>According to the visiting leader, the economy of the country is
<q id=q1 type=part next=q2>better than ever.</q></s> <q
id=q2 type=part prev=q1><s>It is in fact in very good
shape.</s>"</q></p>
NOTE: The strategy that retains the <p> - <s> -
<q>
hierarchy is required for Level 3 conformance.
References in the text which refer to another part of it can be tagged
with
-
- <ref>
- a reference to another location in the current document, in terms of
one or more identifiable elements, possibly modified by additional text or
comment.
Attributes include the global attributes plus the following:
- corresp
- points to elements that correspond to the current element in some way.
- next
- gives the id reference of an element which contains a continuation of the current element.
- prev
- gives the id reference of an element which contains the previous portion of the current element.
- type
- indicates the type of pointer, e.g., aggregating, aligning, etc.
- resp
- specifies the creator of the pointer.
- crdate
- specifies when the pointer was created.
- targType
- indicates the type of data being linked, e.g., paragraph, sentence, etc.
- targOrder
- specifies whether the order in which the identifiers in the targets list is significant. Values:
- Y Yes: the order of the IDREFs specified as the value of the targets attribute should be followed when the elements are combined.
- N No: the order of the IDREFs specified as the value of the targets attribute has no significance.
- U* Unspecified: no claim is made about the order of the IDREFs specified as the value of the targets attribute.
- evaluate
- specifies the intended meaning when the target or targets are pointers themselves. Values:
- ALL if the element pointed to is itself a pointer, then the target of that pointer will be taken, and so on, until an element is found that is not a pointer.
- ONE if the element pointed to is itself a pointer, then its pointer (whether a target or not) is taken as the target of this pointer.
- NONE no further evaluation of targets is carried out beyond that needed to find the elemen specified in the pointer's target.
- target
- provides the IDs of two or more <xptr> elements that point to the locations of the elements to be associated.
In some cases it is desirable to move an element to another
location in the encoded text. This is common for footnotes which occur in-line
in the electronic text, but which appear as footnotes, endnotes, etc. in a
printed version. It is also common for cpations, figures, bibliographic
citations, and stage directions.
-
- <ptr>
- a pointer to another location in the current document in terms of one
or more identifiable elements.
Attributes include the global attributes plus the following:
- corresp
- points to elements that correspond to the current element in some way.
- next
- gives the id reference of an element which contains a continuation of the current element.
- prev
- gives the id reference of an element which contains the previous portion of the current element.
- type
- indicates the type of pointer, e.g., aggregating, aligning, etc.
- resp
- specifies the creator of the pointer.
- crdate
- specifies when the pointer was created.
- targType
- indicates the type of data being linked, e.g., paragraph, sentence, etc.
- targOrder
- specifies whether the order in which the identifiers in the targets list is significant. Values:
- Y Yes: the order of the IDREFs specified as the value of the targets attribute should be followed when the elements are combined.
- N No: the order of the IDREFs specified as the value of the targets attribute has no significance.
- U* Unspecified: no claim is made about the order of the IDREFs specified as the value of the targets attribute.
- evaluate
- specifies the intended meaning when the target or targets are pointers themselves. Values:
- ALL if the element pointed to is itself a pointer, then the target of that pointer will be taken, and so on, until an element is found that is not a pointer.
- ONE if the element pointed to is itself a pointer, then its pointer (whether a target or not) is taken as the target of this pointer.
- NONE no further evaluation of targets is carried out beyond that needed to find the elemen specified in the pointer's target.
- target
- provides the IDs of two or more <xptr> elements that point to the locations of the elements to be associated.
Examples:
-
Here is a text.
This caption appears at this point.
But we would like to keep the text together.
This can be encoded as
-
<p>Here is a text.
<ptr target=C1>
But we would like to keep the text together.</p>
<caption id=C1>This caption appears at this point.</caption>
The note in the following example originally appeared at the location of
the <ptr> tag:
-
The <name type=org>Ministry of Truth</name>, —
<name type=org lang=ns>Minitrue</name>, in
<name>Newspeak</name><ptr target=N1 rend=asterisk>
— was startlingly different from any other object in
sight...</p>
<note place=foot id=N1><name>Newspeak</name> was the
official language of <name type=place>Oceania</name>. For an
account of its structure and etymology see Appendix.</note>
For purposes of alignment or other reference to elements within a text, a
reference system can be built up using the id attribute on appropriate
elements.
We recommend the following strategy:
- supply a unique identifying label in the id attribute of the
<body> tag
- for each nested division, give each unit an identifier which is built up
by successively adding to the identifier of the text; for example
<body id=ORW1>
<div type=part id=ORW1.1>
<div type=chapter id=ORW1.1.1>
<div type=section id=ORW1.1.1.1>
</div>
</div>
</div>
</body>
- for each paragraph, add another layer to the immediately superordinate
identifier, as follows:
<div type=chapter id=ORW1.1.1>
<p id=ORW1.1.1.1.p1></p>
<p id=ORW1.1.1.1.p2></p>
</div>
- for each s-unit, add another layer to the superordinate identifer on the
enclosing <p> element:
<div type=chapter id=ORW1.1.1>
<p id=ORW1.1.1.1.p1>
<s id=ORW1.1.1.1.p1.s1></s>
</p>
</div>
When a string of characters is tagged as a name, many corpus-handling
tools treat the string as a single token (e.g. some morpho-syntactic
taggers) and do not perform additional analysis.
For English, we can state the following rules:
- Titles such as "Mr." and role names such as "Secretary" are not considered
part of a person name:
-
Mme. <name>Edith Cresson</name>
(or : <abbr>Mme.</abbr> <name>Edith
Cresson</name>)
President <name>Boris Yeltsin</name>
- Appositives such as "Jr." are considered part of a person name:
-
<name>Sammy Davis, Jr.</name>
Where these rules can be used for encoding other languages they should be
followed.
In English the possessive is formed by the addition of "'s" which is
tokenized separately, and should not be encoded as a part of the name:
<name>Winston</name>'s
In English, adjectival forms such as "Estonian" should
not be tagged with the <name> tag. More generally, for any
language, only nouns or noun phrases should be marked as names.
Punctuation is normally considered to be a separate token, and should be
encoded outside the <name> tag. See the discussion in the next
section.
Examples:
-
Jaguar is made is <name type=place>Britain</name>.
<name type=place>France</name>-based
<name type=place>U.S.</name>-<name
type=place>Japan</name> trade negotations
- Laws, diseases, prizes, etc. named after people or saints, etc. should not
be tagged with <name type=person>.
- Street addresses, street names, adjectival forms of place names should not
be tagged as <name type=place>.
Punctuation should be left as in the original text, except in the cases
noted below.
Note that punctuation and special characters are treated by many corpus-handling
tools as separate tokens. For example, a text such as
<q>Ignorance is strength.</q>
may be tokenized as
TOKEN Ignorance
TOKEN is
TOKEN strength
TOKEN .
Full stops and ellipses
The full stop should be kept as both a part of an abbreviation and as an
end-of-sentence indicator. The disambiguation of the two uses is
accomplished by the marking of abbreviations and/or s-units, when such
markup is provided.
Ellipses should be regularized so that the three periods are contiguous,
with no spaces in between.
Full stops appearing as a part of abbreviations should not be separated from
the rest of the abbreviation string when the abbreviation is marked with
the <abbr> tag, even though the full stop may serve a double
function (i.e., also signal end-of-sentence).
Example:
I'm back in the U.S.
should be tagged as
I'm back in the <abbr>U.S.</abbr>
even though the period is both part of the abbreviation and a signal of
end-of-sentence.
Hyphens and dashes
Line-end (soft) hyphens should be removed where they are not part of the
regular spelling of the word. In cases of doubt, guidance should be
sought elsewhere in the same text or in dictionaries. If doubt still
remains, a hyphen should be retained rather than removed.
Dashes are marked by an entity reference (—). No
distinction should be made between different types of dashes.
Apostrophes
Apostrophes should be left as they are in the original text. Note that the
apostrophe can be ambiguous with the single quotation mark (e.g., in
English the possessive "Joneses'"). This may be disambiguated by the
marking of quotations.
Punctuation and tokens identified by the encoder
There is a small class of tags which mark the presence of tokens that have
been isolated and classified by the encoder. Among the elements included in the
cesDoc DTD, the following may be used to identify individual tokens:
<abbr>
<date>
<num>
<measure>
<name>
<term>
<time>
For many tools, when such an element is identified in the input stream, it
is not desirable to further tokenize the string inside the tag; rather, the
string inside the tag can be regarded as a single token (possibly with the type
indicated by the tag name). For example, in some languages it may be
possible be assumed for lexical lookup routines and morpho-syntactic
taggers to assume that an element with the tag <name>
is a
single token with the grammatical category PROPER NOUN (Np). For example,
<name type=person>Big Brother</name>
can be tokenized as
TOKEN(name) Big Brother
Similarly, the string
<date>April 4th, 1984</date>
can be tokenized as
TOKEN(date) April 4th, 1984
Therefore, punctuation that is not a part of an identified token should not
appear
within the tag (except abbreviations--see below). For example, the text
The
Ministry of Love, which maintained law and order.
should be encoded as
-
The <name type=org>Ministry of Love</name>, which
maintained law and order.
Other examples:
-
<name type=org>Jaguar</name> company in <name
type=place>Britain</name>.
...he had been born in <date>1944</date> or
<date>1945</date>; but it...
...the three slogans of the <name
type=org>Party</name>:...
When the
<q> or <quote> tag is used, any quotation marks
or other typographical device
for indicating quoted dialogue should be removed from the text. The
rend attribute can be used to indicate the means by which the
quotation was
originally marked in the text (this is not required). In these cases, the
value of the rend
attribute should be one of the following, which are consistent with entity
names in ISOpub and ISOnum:
-
laquo angle quotation mark, left
raquo angle quotation mark, right
lsquo single quotation mark, left
rsquo single quotation mark, right
ldquo double quotation mark, left
rdquo double quotation mark, right
lsquor rising single quote, left (low)
ldquor rising dbl quote, left (low)
rdquor rising dbl quote, right (high)
rsquor rising single quote, right (high)
mdash dash the width of lowercase m
Note that it is required to eliminate quotation marks etc. marking a
quotation for Level 2 and 3 conformant encodings, since the rendition
conventions for dialogue are language-specific and therefore not a part of
the "content" proper.
In principle, encode punctuation as inside or outside the <q>
tag according to the position of the quotation marks in the original, as in
these examples:
- ('dealing on the free market', it was called)
(<q rend="PRE lsquo POST rsquo">dealing on the free
market</q>, it was called)
-
- The dark-haired girl behind Winston had begun crying out `Swine!
Swine! Swine!'
The dark-haired girl behind <name
type=person>Winston</name> had begun crying out <q rend="PRE lsquo
POST rsquo">Swine! Swine! Swine!</q>
- 'I am with you,' O'Brien seemed to be saying to him.
<q rend="PRE lsquo POST rsquo">I am with you,</q><name
type=person>O'Brien</name>seemed to be saying to him.
In cases where the <q> tag is used for text that is not
enclosed in quotation marks in the original, leave punctuation that is not a
part of the actual cited text outside the <q> tags:
- BIG BROTHER IS WATCHING YOU, the caption beneath it ran.
<q rend=ca type=slogan><name type=person>Big
Brother</name> is watching you</q>, the caption beneath it ran.
- Never mind, it doesn't matter, he thought. ["Never mind, it doesn't
matter" in italics]
<q rend=it>Never mind, it doesn't matter</q>, he
thought.
- Eureka! he shouted. ["Eureka!" in italics]
<q rend=it>Eureka!</q> he
shouted.
Note, however, that the tokenization of the text should not be affected by
the position of the punctuation relative to the closing tag; the same set of
tokens is ultimately generated in either case.
Sentence terminating punctuation should always appear within an enclosing
set of <s> and </s> tags:
- <s><q rend=it>Eureka!</q> he
shouted.</s>
- <s>The dark-haired girl behind <name
type=person>Winston</name> had begun crying out <q rend="PRE
lsquo POST rsquo">Swine! Swine!
Swine!</q></s>
Because tokenizers typically treat text within tags such as
<hi> and <foreign>, punctuation can appear either
inside or outside the closing tag without effect. Therefore, given this
text:
She ordered a croque monsieur. ["croque monsieur" in italics]
either of the two following encodings is acceptable:
She ordered a <foreign rend="it">croque
monsieur</foreign>.
She ordered a <foreign rend="it">croque
monsieur.</foreign>
The CES recommends that linguistic annotation be encoded in a separate SGML
document with its own DTD, which is linked to the primary data (see section 5). However, we recognize that for some
applications it is still desirable to retain morpho-syntactic annotation in
the same SGML document as the primary data. Therefore, the CES provides means
to accomplish this in-file tagging.
To implement it, a pre-defined
module containing all the required definitions for the morpho-syntactic
information must be brought in at the beginning of the document.
Using this method, the CES provides two different types of in-file annotation:
- annotation using the same elements for marking tokens and associated base forms, part-of-speech tags, morphosyntactic descriptions, etc. as defined in section 5.2.4 and section 5.2.5, describing the cesAna DTD.
- annotation using TEI-like elements for marking linguistic segments, such as <w> (word), <phr> (phrase), etc. Withthis option morphosyntactic tagging is accomplished using attributes on the <w> element.
The full description of elements for this kind of encoding is provided in section 5.2.4 and section 5.2.5. They include:
- <tok>
- contains a token, consisting of its orthographic form in the original
document, followed optionally by disambiguated corpus tag and/or one or
more alternative sets of morphosyntactic information associated with the
token.
- <orth>
- contains the orthographic form of the token as it appears in the
original.
- <disamb>
- contains one or more disambiguated corpus tags associated with the token.
- <lex>
- contains one or more alternative sets of morphosyntactic information
associated with the token.
- <base>
- the base or lemmatized form for the morphosyntactic information given
in the associated <msd> element.
- <msd>
- the morphosyntactic description, specified in EAGLES-complaint format.
- <ctag>
- contains the corpus tag associated with the morphosyntactic information.
To enable the inclusion of these elements and provide for their appearance
in appropriate locations, it is necessary to include the following in the
SGML document in which they are used:
<!doctype cesDoc PUBLIC "-//CES//DTD//cesDoc//EN"
[
<!ENTITY % token.elt PUBLIC "-//CES//DTD//ENTITIES Token//EN">
%token.elt;
]>
<cesdoc version="4.1">
...
The definitions between the "[" and "]" bring in all the required
additional elements and modify the definition of M.TOKEN to consist only of the element <tok>, thus replacing the definition in the main cesDoc DTD. This results in the modification of the content model
PHRASE.SEQ to consist of a series of <tok> elements, possibly intermixed with the
elements <foreign>, <title>,
<distinct>, <mentioned>, and <hi>.
As a result, any element whose content model is
%PHRASE.SEQ now may contain a series of tokens
rather than elements such as <abbr>, <num>,
<date>, etc. The elements <foreign>,
<title>, <distinct>, <mentioned>,
and and <hi> may be interspersed with tokens, when they
themselves contain tokens. For example, the multi-word highlighted phrase
("red house") in this sentence could be encoded as
<tok>
<orth>He</orth>
<ctag>NP</ctag>
</tok>
<tok>
<orth>bought</orth>
<ctag>VB</ctag>
</tok>
<tok>
<orth>a</orth>
<ctag>DT</ctag>
</tok>
<hi rend=it>
<tok>
<orth>red</orth>
<ctag>AD</ctag>
</tok>
<tok>
<orth>house</orth>
<ctag>NN</ctag>
</tok>
</hi>
...
This strategy is intended to reflect the fact that most morphosyntactic
tagging systems have specific tags for abbreviations, names, etc., and that
therefore primary data which includes morpho-syntactic tagging would not
inlcude explicit CES tags for such elements, but would instead associate
tokens with appropriate morpho-syntactic tags using the <tok>
element. However, the interaction between the phrasal CES tags in the
standard cesDoc DTD and morpho-syntactic tags in annotated primary data
needs more consideration to find the optimal system. As it stands, the
proposed system should allow for most possibilities we currently envision;
we welcome user input on this matter.
This option works exactly as the one above, except that it brings in a different set of elements:
- <cl>
- represents a grammatical clause. Attributes include:
- type
- indicates the type of clause.
- <phr>
- represents a grammatical phrase. Attributes include:
- type
- indicates the type of phrase.
- <w>
- represents a grammatical (not necessarily orthographic) word. Attributes include:
- type
- indicates the type of word.
- base
- identifies the word's lemma.
- ctag
- provides part-of-speech information in the form of a corpus tag.
- <m>
- represents a grammatical morpheme. Attributes include:
- type
- indicates the type of morpheme.
- <c>
- represents a character. Attributes include:
- type
- indicates the type of character.
To use this option, include the following at the beginning of the document:
<!doctype cesDoc PUBLIC "-//CES//DTD//cesDoc//EN"
[
<!ENTITY % word.elt PUBLIC "-//CES//ENTITIES Word//EN">
%word.elt;
]>
<cesdoc version="4.1">
...
In this case, the definitions between the "[" and "]" bring in the
additional linguistic segmentation elements and modify the definition of M.TOKEN to consist only of these elements, resulting in the modification of the content model
PHRASE.SEQ to consist of a series of any of the linguistic segmentation elements, possibly intermixed with the
elements <foreign>, <title>,
<distinct>, <mentioned>, and <hi>.
As a result, any element whose content model is
%PHRASE.SEQ now may contain a series of any or all of the linguistic segmentation elements, possibily interspersed with <foreign>,
<title>, <distinct>, <mentioned>,
and and <hi>.
Using this option, a very concise representation of word/part-of-speech annotation can be obtained. Consider the following alternative encoding of the above example:
<w base=he ctag=NP>He</w>
<w base=buy ctag=VB>bought</w>
<w base=a ctag=DT>a</w>
<hi rend=it>
<w base=red ctag=AD>red</w>
<w base=house ctag=NN>house</w>
</hi>
...
The cesDoc DTD
The cesDoc DTD in
hypertext navigable format
The cesDoc DTD intantiated as a
TEI customization
| Top
| Prev
| Next
| CES Contents
| CES Annexes
|