Document CES 1. Part 5. Version 1.5. Last modified 18 August 1996.

Part 5

Encoding Linguistic
Annotation

5.0. Overview
5.1. Locators
5.2. Encoding conventions for segmentation and grammatical annotation
5.3. Encoding conventions for parallel text alignment

5.0. Overview

The classical view of a document prepared for use in corpus-based research is one in which annotation is added incrementally to the original as it is generated. The CES adopts a strategy whereby annotation information is not merged with the original, but rather retained in separate SGML documents (with different DTDs) and linked to the original or other annotation documents. The separation of original data and annotation is consistent with other data architecture models, such as the TIPSTER model.

Linkage between original and annotation documents is accomplished using the HyTime-based TEI addressing mechanisms for element linkage.

The separate markup strategy is in essence a finely linked hypertext format where the links signify a semantic role rather than navigational options. That is, the links signify the locations where markup contained in a given annotation document would appear in the document to which it is linked. As such the annotation information comprises remote markup which is virtually added to the document to which it is linked. In principle, the two documents could be merged to form a single document containing all the markup in each. This approach has several advantages for corpus-based research:

it avoids the creation of potentially unwieldy documents--envision, in a worst case, a single document containing segmentation and part of speech markup, plus markup for alignment with translations in the eleven EU languages, plus alignment with the speech recording, plus variant part of speech taggings from several taggers, etc.!
the original or hub document remains stable and is not modified by any process which may add annotation.
it avoids problems with markup containing overlapping hierarchies, which are not allowable in SGML.
different versions of the same kind of annotation (e.g., different POS annotation) can be associated with the text.
it is very much in line with what is evolving in the SGML world, and it is likely that SGML/HyTime software which handles complex linkages will be available in the near future.
annotation can be accomplished by associating the SGML original or other annotation documents with other, pre-existing documents--e.g., instead of generating a document containing POS markup and linking it to the original, links could be made directly with lexicon entries.
it gives easy access to the original SGML document (or, as mentioned above, any among several versions annotated for certain features) for use by other applications.

The hyper-document comprising each text in the corpus and its annotations will consist of several documents. The base or "hub" document is the unannotated document containing only primary data markup. The hub document is "read only" and is not modified in the annotation process. Each annotation document is a proper SGML document with a DTD, containing annotation information linked to its appropriate location in the hub document or another annotation document.

All annotation documents are linked to the SGML original (containing the primary data) or other annotation documents using one-way links. The exception is output of the aligner for parallel texts, which will consist of an SGML document containing only two-way links associating locations in two documents in different languages. The two linked documents are two documents containing the relevant structural information, such as sentence or word boundaries. The overall architecture is described by the figure below.

Following this model, the CES provides DTDs for the different types of annotation information, described below.

5.1. Locators

There are several different means to specify locations in an SGML document. The most common means is to reference a unique identifier, specified in the ID attribute on the target element. However, it is not always possible to use this method, because:

IDs may not exist on the target elements. Adding IDs may be prohibitively expensive or impossible if, for example, a reference is pointing into a "read-only" document, such as the SGML original in the architecture described in section 5.0.
The referenced location does not comprise the entire content of an SGML element--for example, a sub-string of characters inside a <p> element.

Both the TEI and HyTime provide means to handle situations where ID references cannot be used. TEI locators have the advantage that they are more compact than HyTime location ladders. Additionally, the TEI notation is easily made compatible with HyTime by the use of the Hytime notloc form, in conjunction with an appropriate notation declaration. Therefore, it is recommended that in general, TEI locators are used. See TEI P3, chapter 14, "Linking, Segmentation, and Alignment" for a complete description and explanation of TEI locators.

Within the CES we have developed a more precise and concise notation for locators, which uses locations in the SGML tree to point to specific elements (e.g., the third child of the second child of the first child of the root). Annotation documents are linked to the hub or to another annotation document via one-way links, specified in two steps:

tree addressing to point to the nearest enclosing tag for the location in question;
reference to the position of a specific character from the location given in the first step.

As an example of the CES addressing method, consider the following text:

       <p>
       L'usine, qui devrait être implantée à Eloyes (Vosges) représente 
       un investissement d'environ 3,7 milliards de yens. Elle fabriquera 
       des pièces détachées pour la filiale de Minolta en RFA.
       <p>

The following is the TEI mechanism for pointing to the first two words inside the <p> element :

       <tok from="CHILD (2) (1) (1) (1) (2) (1) STRLOC (1)" 
            to="CHILD (2) (1) (1) (1) (2) (1) STRLOC (2)">
       <tok from="CHILD (2) (1) (1) (1) (2) (1) STRLOC (3)" 
            to="CHILD (2) (1) (1) (1) (2) (1) STRLOC (7)">

The TEI notation is given in two parts:

CHILD (2) (1) (1) (1) (2) (1) starts at the root of the ESIS (Element Structure Information Set) tree (corresponding to the output of the SGML parser) by default, and descends taking at each node, the designated child (i.e., the second child of the root, the first child of this node, etc.).
STRLOC (n) gives a character offset within the element referenced above.

This notation is far more compact than the equivalent HyTime notation, but it is still more bulky than is often practical in annotation documents containing potentially thousands of locators. Because it is very common to refer to an element by indicating its position in the ESIS tree relative to the root, and (optionally) to then access a particular character within that element, the CES has developed a concise notation to accomplish this. A compact notation such as ESIS (2.1.1.1.2.1\1), equivalent to CHILD (2) (1) (1) (1) (2) (1) STRLOC (1), accomplishes this. Therefore, the CES has developed the following notation:

       <tok from="2.1.1.1.2.1\1" to="2.1.1.1.2.1\2">
       <tok from="2.1.1.1.2.1\3" to="2.1.1.1.2.1\7">

The locator 2.1.1.1.2.1\1 is exactly equivalent to the TEI notation CHILD (2) (1) (1) (1) (2) (1) STRLOC (1).

In the example above, locators are used on the from and to attributes on the <tok> element to reference strings of characters to be considered as single tokens, thus accomplishing the addition of remote markup (i.e., the addition of the <tok> elements) to the referenced document. In alignment documents, specification of a character offset is often not required for alignment, which is typically between the entire content of SGML elements (sentences, paragraphs, tokens) in the aligned documents. The same notation can be used in such instances, omitting the character offset:

       <link fromLoc="2.1.1.1.2.1" toLoc="2.1.1.1.2.1">

5.2. Encoding conventions for segmentation and grammatical annotation

The cesAna DTD is used for segmentation and grammatical annotation, including:

sentence boundary markup
tokens, each of which consists of the following:
- the orthographic form of the token as it appears in the corpus
- grammatical annotation, comprising one or more sets of the following:
  - the base form (lemma)
  - a morpho-syntactic specification, in the EAGLES annotation style
  - a corpus tag

Allowing more than one possible set of grammatical annotation enables representing data for which lexical lookup or some other morphosyntactic analysis has been performed, but which has not been disambiguated. When disambiguation has been accomplished, an optional element can be included containing the disambiguated form.

The structure of the DTD constituents is based on the overall principle that one or more "chunks" of a text may be included in the annotation document. These chunks may correspond to parts of the document extracted at different times for annotation, or simply to some subset of the text that has been extracted for analysis. For example, it is likely that within any text, only the paragraph content will undergo morphosyntactic analysis, and titles, footnotes, captions, long quotations, etc. will be omitted or analysed separately.

Elements in cesAna documents will, for the most part, use the notation outlined in the section above on Locators to reference locations in the document which is being annotated, since the identification of sentence boundaries, token boundaries, etc. typically involves pointing to the start and end points of sequences of characters which are not the entire content of an SGML element.

5.2.1. Global attributes

Five global attributes are defined in the cesAna DTD:

id

a unique identifier for the element bearing the ID value.

n

a number or other label for the element, not necessarily unique within the document.

lang

indicates that the tag's content is in the specified language. The value of the lang attribute which should be the same as that appearing on a <language> element in the header document which describes that character set, composed of one of the following:

type

provides more precise information about the element's function or role.

a two-letter code from ISO 639 (e.g., "en" for English;
a three-letter code from ISO 639-2 (e.g., "eng" for English);
one of the above extended by a country code from ISO 3166 (e.g., "en.uk" or "eng.uk" for English as spoken in the United Kingdom).

wsd

indicates that the tag's content is encoded in the specified character set. The value of the attribute is the character set name (ISO-8859-1, etc.) which should be the same as that appearing on a <writingSystem> element in the header document which describes that character set.

The global attributes are defined at the top of the cesAna DTD and represented by an entity, A.ANA. This entity is used to represent the list of global attributes on the attribute declarations for most elements in the document.

5.2.2. Content models represented by entities in the cesAna DTD

One content model is defined by an entity in the cesAna DTD, which describes a sequence of elements used for morpho-syntactic description of tokens:

LEX.SEQ: the base content model for elements used for morphosyntactic description, including one or more <base> elements (used to provide root or base form of a token), followed by one or more <ctag> elements (used to provide the corpus tag associated with the token), followed by one or more <msd> elements (used to provide more complete morpho-syntactic description of the token, such as information which might appear in a lexicon).

5.2.3. Top-level Constituents

The top level structure of the cesAna DTD is as follows:

<cesAna>

a single annotation document, containing a <cesHeader> element, followed by a <chunkList> element. In addition to the global attributes, this element has the following attribute:

version: provides the version of the cesAna DTD to which this document is compliant.

The type attribute should normally be specified on the <cesAna> element, in order to specify the type of annotation contained in the document. Suggested values for the type attribute on the <cesAna> element include:

SENT contains segmentation for orthographic sentences

TOK contains tokenized text

LEX contains morphosyntactic information for tokens

DISAMB contains disambiguated morphosyntactic information

Note that when the document contains more than one type of annotation, a series of values in quotation marks can be given for the attribute, e.g., "type = "SENT TOK".

<cesHeader>

contains the header for this document. See part 1.3 for a full description.

Note: the cesHeader is optional in the cesAna DTD, for convenience during processing. However, for data conforming to this DTD which is in a final form or which may be interchanged, the cesHeader is required.

<chunkList>

contains one or more "chunks" of annotation.

5.2.4. Chunks

<chunk>

contains a series of sentences (marked with <s> tags), a series of tokens (marked with <tok> tags), a series of paragraph-like elements marked with <par> tags), or "plain text" data (PCDATA), which is marked with <data> tags (see below). Attributes include:

doc: provides the name and/or location (URL, path/filename, etc.) of the document to which this chunk is linked.
from: provides, using the notation outlined in the section above on Locators, the starting location of the chunk in the original document.
to: provides, using the notation outlined in the section above on Locators, the ending location of the chunk in the original document. This is optional if it can be computed from the data.
domains: optionally specifies the identifier of the element in the original document within which all locations within this chunk lie, when such an ID exists.

When it appears on the <chunk> element, the type attribute can be used to indicate the type of information with which the chunk is associated, e.g., paragraph data, titles, etc. This is useful when specific portions of a text have been extracted for analysis.

5.2.5. Chunk consituents

<par>

marks paragraph boundaries. Contains a series of <tok> elements, a series of <s> elements, or a series of <data> elements--or any inter-mixture of these elements. Note that this tag may be used to mark boundaries of any paragraph-like element, such as a quote, note, etc. In this way, the user can extract a set of paragraph-like elements, all of which are to receive the same treatment, and they will be marked in the same way in the annotation document, which may be useful for alignment, etc. The information about the original tagging can be retained in the type attribute. Attributes:

from: provides, using the notation outlined in the section above on Locators, the starting location of the corresponding element in the original document.
to: provides, using the notation outlined in the section above on Locators, the ending location of the corresponding element in the original document. This is optional if it can be computed from the data.

<s>

Marks sentence boundaries. Contains a series of tokens or <data> elements; nested sentences may also appear. This element may also contain PCDATA in cases where tokens have not been marked. Attributes:

from: provides, using the notation outlined in the section above on Locators, the starting location of the sentence in the original document.
to: provides, using the notation outlined in the section above on Locators, the ending location of the sentence in the original document. This is optional if it can be computed from the data.
next: gives the id reference of a subsequent <s> element which contains a continuation of the current sentence.
prev: gives the id reference of a previous <s> element which contains the beginning fragment of the current sentence.
broken: indicates whether this <s> element is broken between two or more <s> elements (linked using the next and prev attributes).

<tok>

contains a token, consisting of its orthographic form in the original document, followed optionally by disambiguated corpus tag and/or one or more alternative sets of morphosyntactic information associated with the token. Attributes:

from: provides, using the notation outlined in the section above on Locators, the starting location of the token in the original document.
to: provides, using the notation outlined in the section above on Locators, the ending location of the token in the original document. This is optional if it can be computed from the data.

Note that the type attribute is used on the <tok> element to provide the type or class of the token (e.g., name, date, abbr, etc.).

<data>: contains "plain text" (PCDATA) extracted from a document. This element can be used to distinguish plain text when interspersed with elements marked with <s>, <par>, etc. elements. This is useful when using tools which work with the SGML ESIS tree, in which PCDATA is treated in the same way as a proper SGML element--that is, it exists at a (leaf) node of the tree. Marking this data with a <data> tag identifies the data as PCDATA and enables uniform treatment of nodes of the ESIS tree.

5.2.6. Token consituents

<orth>

contains the orthographic form of the token as it appears in the original, and as it may appear in a lexicon, possibly modified by processing (e.g., a compound may appear as "in_spite_of").

<disamb>

groups one or more disambiguated corpus tags and/or full morphosyntactic descriptions associated with the token.

<lex>

groups one or more alternative sets of morphosyntactic information associated with the token.

<base>

the base or lemmatized form for the morphosyntactic information given in the associated <msd> element.

<msd>

the morphosyntactic description, specified in EAGLES-complaint format.

<ctag>

contains a corpus tag, when this tag appears within the <lex> element, it gives the corpus tag associated with the accompanying morphosyntactic information.

certainty: provides the level of certainty associated with this corpus tag assignment for the token in question, usually expressed as a percentage.

5.2.7. Example

The following example shows the use of most of the options provided in the cesAna DTD. This set of annotation data could be the final result after tokenization, segmentation, lexical lookup or morphosyntactic analysis, and part of speech disambiguation. All the original options for morphosyntactic class are retained here, and the disambiguated tag is provided in the <disamb> element.

Note also that the header for this text is stored in another file and included in this document as an entity.


     <!doctype cesAna PUBLIC "-//CES//DTD cesAna//EN" 
     <cesAna version="1.5" type="SENT TOK LEX DISAMB" doc=MyText1>
     <cesHeader version="2.3">
         ...
     </cesHeader>
       <chunkList>
         <chunk doc="MyText1" from='1.2.1\1'>
           <s >
             <tok class='tok' from='1.2.1\1'>
               <orth>Les</orth>
               <disamb>
                   <ctag>DMP</ctag>
               </disamb>         
               <lex>
                   <base>le</base>
                   <msd>Da-fp--d</msd>
                   <ctag>DFP</ctag>
               </lex>
               <lex>
                   <base>le</base>
                   <msd>Da-mp--d</msd>
                   <ctag>DMP</ctag>
               </lex>
               <lex>
                   <base>le</base>
                   <msd>Pp3fpj-</msd>
                   <ctag>PPJ</ctag>
               </lex>
               <lex>
                   <base>le</base>
                   <msd>Pp3mpj-</msd>
                   <ctag>PPJ</ctag>
               </lex>
             </tok>
             <tok class='tok' from='1.2.1\5'>
               <orth>critères</orth>
               <disamb>
                   <ctag>NCMP</ctag>
               </disamb>         
               <lex>
                   <base>critère</base>
                   <msd>Ncmp-</msd>
                   <ctag>NCMP</ctag>
               </lex>
             </tok>
             <tok  class='tok' from='1.2.1\14'>
               <orth>se</orth>
               <disamb>
                   <ctag>PPJ</ctag>
               </disamb>         
               <lex>
                   <base>se</base>
                   <msd>Pp3msj-</msd>
                   <ctag>PPJ</ctag>
               </lex>
               <lex>
                   <base>se</base>
                   <msd>Pp3fpj-</msd>
                   <ctag>PPJ</ctag>
               </lex>
               <lex>
                   <base>se</base>
                   <msd>Pp3fsj-</msd>
                   <ctag>PPJ</ctag>
               </lex>
               <lex>
                   <base>se</base>
                   <msd>Pp3mpj-</msd>
                   <ctag>PPJ</ctag>
               </lex>
             </tok>
             <tok  class='tok' from='1.2.1\17'>
               <orth>basent</orth>
               <disamb>
                   <ctag>VM3P</ctag>
               </disamb>         
               <lex>
                   <base>baser</base>
                   <msd>Vmip3p--</msd>
                   <ctag>VM3P</ctag>
               </lex>
               <lex>
                   <base>baser</base>
                   <msd>Vmsp3p--</msd>
                   <ctag>VM3P</ctag>
               </lex>
             </tok>
             <tok  class='tok' from='1.2.1\24'>
               <orth>sur</orth>
               <disamb>
                   <ctag>SP</ctag>
               </disamb>         
               <lex>
                   <base>sur</base>
                   <msd>Afpms-</msd>
                   <ctag>AMS</ctag>
               </lex>
               <lex>
                   <base>sur</base>
                   <msd>Sp</msd>
                   <ctag>SP</ctag>
               </lex>
             </tok>
             ...
           </s>
         </chunk>
       </chunkList>
    </cesAna>

Alternatively, if a more concise set of information is desired, the following could be provided for the first token in the example above:

       <tok  class='tok' from='1.2.1\1'>
         <orth>Les</orth><base>le</base><ctag>DMP</ctag></tok>

The cesAna DTD
The cesAna DTD in hypertext navigable format

5.3. Encoding conventions for parallel text alignment

The annotation document containing alignment information consists entirely of links between the documents that have been aligned.

Alignment may be between primary data documents or between annotation documents containing segmentation information for the aligned units (paragraphs, sentences, tokens etc.). Alignment may be between two or more such documents, which should be identified in the cesHeader of the alignment document (see section 5.3.2).

5.3.1. Global attributes

Four global attributes are defined in the cesAlign DTD:

id

a unique identifier for the element bearing the ID value.

n

a number or other label for the element, not necessarily unique within the document.

lang

a two-letter code from ISO 639 (e.g., "en" for English;
a three-letter code from ISO 639-2 (e.g., "eng" for English);
one of the above extended by a country code from ISO 3166 (e.g., "en.uk" or "eng.uk" for English as spoken in the United Kingdom).

wsd

The global attributes are defined at the top of the cesAna DTD and represented by an entity, A.ALIGN. This entity is used to represent the list of global attributes on the attribute declarations for most elements in the document.

5.3.2. Top-level Constituents

The top level structure of the cesAlign DTD is as follows:

<cesAlign>

a single annotation document, containing a <cesHeader> element, followed by a <linkList> element. In addition to n and id, this element has the following attributes:

type: indicates the type of alignment:; PAR alignment by paragraphs; SENT alignment by orthographic sentences; TOK alignment by tokens
fromDoc: provides the location (URL, path/filename, etc.) of the first file containing the aligned data.
toDoc: provides the location (URL, path/filename, etc.) of the second file containing the other set of aligned data.
version: provides the version of the cesAna DTD to which this document is compliant.

Note that the fromDoc and toDoc attributes are provided for the common case where only two files are being aligned. When three or more files are aligned, it is necessary to identify the files using the <translations> element in the header (see below).

<cesHeader>

contains the header for this document. See section 1.3 for a full description.

For alignment documents, an important part of the header is the <translations> element, which should contain, for each document being aligned, a translation element identifying and locating the document. The <translations> element is required in order to identify the aligned documents when three or more files are being aligned. When only two files are being aligned, the the fromDoc and toDoc attributes on <cesAlign> element can be used to identify the aligned files.

The n attribute on <translation> elements in the cesHeader may be used to indicate the order in which the aligned documents are referenced in the xtargets attribute on <link> element (see section 5.3.4.2). When three or more files are being aligned using xtargets, this method of indicating the order of file reference is required. When only two documents are being aligned, the order can be indicated using the fromDoc and toDoc attributes on <cesAna>, <linkGrp> and/or <link> elements.

Note: the cesHeader is optional in the cesAlign DTD, for convenience during processing. However, for data conforming to this DTD which is in a final form or which may be interchanged, the cesHeader is required.

<linkList>: contains one or more occurrences of the element <linkGrp>, defined below.

5.3.3. Groups of links

The <linkGrp> element is used to group sets of links. In most cases, a group of links apply to data within a particular text division, paragraph, etc. This can be indicated using the domains attribute.

<linkGrp>

contains a series of links considered to be a group. Attributes include:

type: indicates the type of data with which the group is associated, e.g., paragraph data, titles, etc.
targType: indicates the type of data being linked, e.g., paragraph, sentence, etc.
domains: optionally specifies the identifiers of the elements within which all elements indicated by the contents of this element lie. Its value must consist of at least two valid SGML identifiers in the linked documents.
fromDoc: optionally provides the location (URL, path/filename, etc.) of the first file containing one set of aligned data, when only two files are being aligned.
toDoc: optionally provides the location (URL, path/filename, etc.) of the second file containing the other set of aligned data, when only two files are being aligned.
fromLoc: provides, using the notation outlined in the section above on Locators, the location in the document described in fromDoc that is being linked.
toLoc: provides, using the notation outlined in the section above on Locators, the location in the document described in toDoc that is being linked.

In most instances, the documents being aligned in a cesAlign document will be indicated in the fromDoc and toDoc attributes on the <cesAlign> element (when only two documents are aligned), or using the <translations> element in the cesHeader. However, it is also possible to use the fromDoc and toDoc attributes on the <linkGrp> and <link> elements to indicate the documents being aligned. This may be necessary if a single alignment document contains alignment information for more than one pair of files. Therefore, the attributes fromDoc and toDoc are provided on the <linkGrp> and <link> (see below) elements for use where desired or needed.

5.3.4. Links

The following section defines the elements in the cesAlign DTD that are used to link data in two or more SGML documents for the purposes of parallel alignment. Subsequent sections provide a discussion of the various methods available and give examples of their use.

5.3.4.1. Elements for linking in the cesAlign DTD

The following elements in the cesAlign DTD are used for establishing alignment links:

<link>

a link specifying the SGML elements in documents that have been aligned. Attributes include:

targType

indicates the type of data being linked, e.g., paragraph, sentence, etc.

targOrder

specifies whether the order in which the identifiers in the targets list is significant.Values:

Y Yes: the order of the IDREFs specified as the value of the targets attribute should be followed when the elements are combined.

N No: the order of the IDREFs specified as the value of the targets attribute has no significance.

U* Unspecified: no claim is made about the order of the IDREFs specified as the value of the targets attribute.

evaluate

specifies the intended meaning when the target or targets are pointers themselves. Values:

ALL if the element pointed to is itself a pointer, then the target of that pointer will be taken, and so on, until an element is found that is not a pointer.

ONE if the element pointed to is itself a pointer, then its pointer (whether a target or not) is taken as the target of this pointer.

NONE no further evaluation of targets is carried out beyond that needed to find the element specified in the pointer's target.

targets

provides the IDs of two or more <xptr> elements that point to the locations of the aligned data in each of the aligned documents.

xtargets

provides the IDs of two or more elements in different SGML documents that point to the locations of the aligned data in each of the aligned documents.

certainty

gives a value indicating the degree of certainty for establishing this link, usually in the form of a percentage.

As for the <linkGrp> element (see above), attributes to handle linkage between two documents are provided on <link>:

fromDoc: optionally provides the location (URL, path/filename, etc.) of the first file containing one set of aligned data, when only two files are being aligned.
toDoc: optionally provides the location (URL, path/filename, etc.) of the second file containing the other set of aligned data, when only two files are being aligned.
fromLoc: provides, using the notation outlined in the section above on Locators, the location in the first document that is being linked.
toLoc: provides, using the notation outlined in the section above on Locators, the location in the second document being linked.

The fromLoc and toLoc attributes are used when the data pointed to in each of these attributes is the entire contents of a single SGML element. For data which is not the entire contents of an SGML element, or when referencing more than two locations (for example, for many-to-one alignments) with the CES locator notation, use the mechanisms outlined in section 5.3.4.3.

<xptr>

a pointer to a location in an external file Attributes include the global attributes, but in this case, the value of id may be the target specified on a <link> tag. The following additional attributes are defined:

targType: indicates the type of data being linked, e.g., paragraph, sentence, etc.
doc: provides the location (URL, path/filename, etc.) of the file containing the data being pointed to.
from: provides, using the notation outlined in the section above on Locators, the starting location of the data in the original document.
to: provides, using the notation outlined in the section above on Locators, the ending location of the data in the original document.

Note that because the doc attribute on the <xptr> element is defined as #CURRENT, once a value has been specified for this attribute on one instance of <xptr>, all subsequent occurrences of that element will use this value as the default unless it is re-specified. Therefore, verbosity can be reduced by placing all the <xptr> elements that point to the same document sequentially within the alignment document.

<ptr>

a pointer to one or more locations in the current document, typically for the purpose of aggregating elements for alignment, i.e., in one-to-many or many-to-many alignments. Attributes include the global attributes plus the following:

type

indicates the type of pointer, e.g., aggregating, aligning, etc.

targType

indicates the type of data being linked, e.g., paragraph, sentence, etc.

targOrder

specifies whether the order in which the identifiers in the targets list is significant. Values:

Y Yes: the order of the IDREFs specified as the value of the targets attribute should be followed when the elements are combined.

N No: the order of the IDREFs specified as the value of the targets attribute has no significance.

U* Unspecified: no claim is made about the order of the IDREFs specified as the value of the targets attribute.

evaluate

specifies the intended meaning when the target or targets are pointers themselves. Values:

ALL if the element pointed to is itself a pointer, then the target of that pointer will be taken, and so on, until an element is found that is not a pointer.

ONE if the element pointed to is itself a pointer, then its pointer (whether a target or not) is taken as the target of this pointer.

NONE no further evaluation of targets is carried out beyond that needed to find the element specified in the pointer's target.

targets

provides the IDs of two or more <xptr> elements that point to the locations of the elements to be associated.

5.3.4.2. Linking data using IDs

The most common situation in aligning parallel translations is to align data which comprises the content of an entire SGML element, such as an <s>, <par>, or <tok> element. Especially when the aligned data is not in the SGML original document, it is likely that the elements to be associated will have id attributes by which they can be referenced in the alignment document, in order to specify the elements to be aligned or "linked".

Note that when the SGML ID and IDref mechanism is used to point from one element to another in the same SGML document, the SGML parser will validate the references to ensure that every IDREF points to a valid ID. In the CES, all alignment documents are separate from the documents that are being aligned, and therefore this validation of IDrefs by the SGML parser is lost. However, other software may be used to validate cross-document references, if necessary.

The CES provides a simple means to point to SGML elements in other SGML documents by referring to IDs or any other unique identifying attribute on those elements, using the xtargets attribute on the <link> element. Here is a simple example:

     DOC1: <s id=p1s1>According to our survey, 1988 sales of
           mineral water and soft drinks were much higher than in 1987, reflecting
           the growing popularity of these products.</s>   
           <s id=p1s2>Cola drink manufacturers in particular achieved above-   
           average growth rates.</s>

           <!-- ... -->

     DOC2: <s id=p1s1>Quant aux eaux minérales et aux limonades, elles
           rencontrent toujours plus d'adeptes.</s>
           <s id=p1s2>En effet, notre sondage fait ressortir des ventes    
           nettement supérieures à celles de 1987, pour les boissons 
           à base de cola notamment.</s>

     ALIGN DOC:
           <linkGrp targType="s"> 
             <link xtargets="p1s1 ; p1s1">
             <link xtargets="p1s2 ; p1s2"> 
           </linkGrp>

The IDrefs of the elements to be aligned are given in the xtargets attribute on the <link> element. A semicolon separates the IDref(s) from each document being linked. Many-to-one alignments are specified by providing a list of IDs from any single document, separated by spaces:


           <link xtargets="s1 ; s1 s2">
           <link xtargets="s23 s24 s25 ; s23 s24">

N-to-zero alignments can also be indicated:


           <link xtargets="s1 ; ">

Additionally, any number of files can be aligned using the xtargets attribute:


           <link xtargets="s1 ; s1 ; s1">
           <link xtargets="s1 ; s1 s2 ; s1">
           <link xtargets="s1 ; ; s1">

When more than two files are being aligned, the ordering must be specified in the cesHeader in the alignment document, as indicated above in the description of the cesHeader.

Here is a more extended example using xtargets:

     DOC1: <cesDoc version="3.24">
           <cesHeader version="2.3">
                ...
           </cesHeader>
           <text>
              <body id="b1">
                 <div type=sample id="d1">
           <p id="d1p1">
             <s id="d1p1s1">J'ai donc dû choisir un autre métier 
             et j'ai appris à piloter des avions.</s>
             <s id="d1p1s2">J'ai volé un peu partout dans le monde.</s>
             <s id="d1p1s3">Et la géographie, c'est exact, m'a beaucoup servi.</s>
             <s id="d1p1s4">Je savais reconnaître, du premier coup d'oeil, la Chine
             de l'Arizona.</s>
             <s id="d1p1s5">C'est très utile, si l'on est égaré pendant la nuit.</s>
           </p>
                </div>
              </body>
           </text>
           </cesDoc>

     DOC2: <cesDoc version="3.24">
           <cesHeader version="2.3">
              ...
           </cesHeader>
           <text>
              <body id="b1">
                 <div type=sample id="d1">
           <p id="d1p1">
             <s id="d1p1s1">So then I chose another profession, and learned to 
             pilot aeroplanes.</s>
             <s id="d1p1s2">I have flown a little over all parts of the world; 
             and it is true that geography has been very useful to me.</s>
             <s id="d1p1s3">At a glance I can distinguish China from Arizona.</s>
             <s id="d1p1s4">If one gets lost in the night, such knowledge is 
             valuable.</s>
           </p>
                </div>
              </body>
           </text>
           </cesDoc>

     ALIGN DOC: 
           <cesAlign type=sent version=1.6>

           <cesHeader version="2.3">
              ...
             <translations>
               <translation trans.loc="text-f.sgml" lang=fr wsd="ISO8859-1" n=1>
               <translation trans.loc="text-e.sgml" lang=en wsd="ISO8859-1" n=2>
             </translations>
           </cesHeader>

           <linkList>

             <!-- sentence alignments -->
             <linkGrp domains="d1 d1" targType="s">
               <link xtargets="d1p1s1 ; d1p1s1">
               <link xtargets="d1p1s2 d1p1s3 ; d1p1s2">
               <link xtargets="d1p1s4 ; d1p1s3">
               <link xtargets="d1p1s5 ; d1p1s4">
             </linkGrp>

           </linkList>

           </cesAlign>

5.3.4.3. Linking data using locators

When the data to be linked does not include IDs on relevant elements (or for some reason it is not desired to use IDrefs for alignment), or when the data to be linked is not the entire content of an SGML element, it is necessary to reference locations in the documents by the methods outlined in section 5.1., Locators. The examples below all utilize the CES concise notation described in that section, which uses a combination of ESIS tree location and character offset to specify location.

If the data to be aligned comprise the content of entire SGML elements (such as <s>, <p>, etc.), and when only two files are to be aligned, the fromLoc and toLoc attributes on the <link> element can be used to accomplish the aliignment. For example:

<link fromLoc="2.1.1.1.2.1" toLoc="2.1.1.1.3.2">

When the data does not comprise the entire content of an SGML element, it must be referenced by the method outlined in section 5.1., Locators. This demands the use of <xptr> elements, since each target must specify a starting and ending location for the referenced string in each of the aligned documents. Therefore it is necessary to specify something like the following:

     <xptr id=En1 doc=EN104 from="2.1.1.1.2.1\1" to="2.1.1.1.2.1\5">
     <xptr id=Fr1 doc=FR413 from="2.1.1.1.2.1\1" to="2.1.1.1.2.1\8">
     <link targets="En1 Fr1">

For alignments involving three or more documents, this same mechanism is used, since any number of IDs can be specified in the value field of the targets attribute on the <link> element. For example:

     <xptr id=En1 doc=EN104 from="2.1.1.1.2.1\1" to="2.1.1.1.2.1\5">
     <xptr id=Fr1 doc=FR413 from="2.1.1.1.2.1\1" to="2.1.1.1.2.1\8">
     <xptr id=De1 doc=DE234 from="2.1.1.1.2.1\4" to="2.1.1.1.2.1\12">
     <link targets="En1 Fr1 De1">

One-to-many and many-to-many alignments are accomplished by using <ptr> elements to associate <xptr> elements, which then may be linked as a group using the mechanisms above. For example, this encoding aligns two sentences in one text with one in another:

     <xptr id=Es43 from="2.1.1.1.2.1" to="2.1.1.1.3.2">
     <xptr id=Es44 from="2.1.1.1.4.1" to="2.1.1.1.4.2">
     <ptr id=Es43.44 targets="Es43 Es44" targOrder=Y>
     <link id=Fs42 fromLoc="2.1.1.1.6.1" toLoc="2.1.1.1.6.2">
     <link targets="Es43.44 Fs42">

In an n-to-zero alignment, only one IDref would appear in the targets attribute on the <link> element.

Part 5

Encoding Linguistic Annotation

Contents

Encoding Linguistic
Annotation