Corpus Encoding Standard - Document CES 1. Part 3. Version 1.6. Last modified 10 August 1996.

Part 3

The Header

3.1. Global attributes
3.2. Document structure
3.3. The File description
3.4. The Encoding description
3.5. The Profile description
3.6. The revision description
3.7. Use of decls attribute
3.8. Examples
3.9. The CES Header element definitions
3.10. The CES Header element definitions in hypertext navigable format

3. The Header

The header provides information about the electronic text that has been encoded, including not only its title, author, etc. but also information about its encoding. The TEI header has provided the first means to document electronic texts, which has been widely adopted and adapted for use in text and corpus encoding.

The CES adopts the use of the TEI header, customizing it using the TEI customization mechanisms to suit the specific needs of corpus-based research. In the CES (as in the TEI) headers are provided for the SGML document containing the entire corpus, as well as for each individual text within a corpus.

The CES header is, for the most part, a subset of the TEI header (see TEI P3, chapter 5, "The Header", and chapter 23, "Language Corpora"). There are the following exceptions:

elements have been added for more precision in the specifications;
attributes have been added to existing elements;
attribute values have been constrained to allow only a given set of values;
element content models are simplified, to contain either a sequence of tags in sub-categories, or plain text (PCDATA).

The minimal requirements of the CES header are the same as those for the TEI header.

The CES header needs attention to determine exactly which elements and information are appropriate for corpora. We intend to develop a more constrained model with a precise template, to facilitate and regularize the creation of corpus and text headers.

3.1. Global attributes

Three global attributes are defined, which may appear on any element in the header:

id

a unique identifier for the element bearing the ID value.

n

a number or other label for the element, not necessarily unique within the corpus.

lang

indicates that the tag's content is in the specified language. The value of the lang attribute is composed of one of the following:

a two-letter code from ISO 639 (e.g., "en" for English;
a three-letter code from ISO 639-2 (e.g., "eng" for English);
one of the above extended by a country code from ISO 3166 (e.g., "en.uk" or "eng.uk" for English as spoken in the United Kingdom).

The global attributes for elements in the header are defined at the top of the header.elt and represented by an entity, A.HEADER. This entity is used to represent the list of global attributes on the attribute declarations for most elements in the header of all CES DTDs.

3.2. Document structure

Each text in the corpus (i.e. each <cesDoc> element) has its own header, referred to as a text header. The whole corpus also has a header, referred to as the corpus header, which contains information applicable to the whole corpus (possibly with some local overriding). Both corpus and text headers are represented by <cesHeader> elements. The type attribute is used to distinguish the two.

The root of the CES header element tree is the <cesHeader> element, defined as follows:

<cesHeader>

contains the descriptive and declarative information making up an "electronic title page" prefixed to every text, or to the corpus as a whole.

type

specifies the kind of document to which the header is attached.

CORPUS the header is attached to the corpus.

TEXT* the header is attached to a single text.

creator

specifies the agency responsible for creating the header.

version

specifies the version and revision of the CES header.elt used to encode this header. This number is found near the top of the header.elt itself.

status

specifies the revision status of the header.

NEW* this is the first version of the header

UPDATE header has been updated.

date.created

specifies the date on which the header content was created.

date.updated

specifies the date on which the header content was last updated.

The <cesHeader> element contains the following four elements:

<fileDesc>: contains a full bibliographic description of the corpus itself or of a text within it.
<encodingDesc>: documents the relationship between an electronic text and the source or sources from which it was derived.
<profileDesc>: provides further information about various aspects of a text, specifically the language used, the situation and date of its production, the participants and their setting, and a descriptive classification for it.
<revisionDesc>: summarizes the revision history for a file.

These elements are tagged as follows:

          <cesHeader>
               <fileDesc></fileDesc>
               <encodingDesc></encodingDesc>
               <profileDesc></profileDesc>
               <revisionDesc></revisionDesc>
          </cesHeader>

3.3. The File description

The file description is the first of the four main constituents of the header and is represented by the <fileDesc> element and the only one that is required. The file description documents the electronic file itself, i.e. (in the case of a corpus header) the whole corpus, or (in the case of a text header) the individual text to which the header applies.

It contains the following elements:

<titleStmt>: groups information concerning the title of the corpus or the individual text and its constituent texts.
<editionStmt>: contains any additional information relating to a particular version of a text.
<extent>: provides the size of the electronic text as stored on some carrier medium.
<publicationStmt>: groups information concerning the publication or distribution of the corpus and its constituent texts.
<sourceDesc>: supplies a bibliographic description of the copy text(s) from which an electronic text was derived or generated. Further detail is given in the following subsections. Note that these relate only to the electronic file (the corpus text itself) --- bibliographic and other details of the written or spoken text from which it derives are given in the source description .

Note that the <titleStmt> describes the machine-readable file, while the source text is specified in the <sourceDesc>. The title in the <titleStmt> should indicate that this is a machine-readable version and should not be identical to the title of the source text.

<titleStmt>, <publicationStmt>, and <sourceDesc> are required.

The minimal header has the following structure:


        <cesHeader version="2.0">
            <fileDesc>
                 <titleStmt>
                     <h.title></h.title>
                 </titleStmt>
                 <publicationStmt>
                     <distributor></distributor>
                     <pubAddress></pubAddress>
                     <availability></availability>
                     <pubDate></pubDate>
                 </publicationStmt>    
                 <sourceDesc>
                     <biblStruct>
                          <monogr>
                               <h.title></h.title>
                               <h.author></h.author>
                               <imprint>
                                    <pubPlace></pubPlace>
                                    <publisher></publisher>
                                    <pubDate></pubDate>
                               </imprint>
                          </monogr>
                     </biblStruct>
                 </sourceDesc>
            </fileDesc>
        </cesHeader>

Note that if the lang or wsd attributes are used on elements in the main text, it is required to include a <profileDesc> element containing <langUsage> (for use of lang) and/or <wsdUsage> (for use of wsd).

3.3.1. Title statement

This element consists of a <h.title> element followed by zero or more <respStmt> elements. These sub-elements are used throughout the header, wherever the title of a work or a statement of responsibility is required.

<h.title>: the title of the electronic file, including alternative titles or subtitles.
<respStmt>: supplies information about any person or institution responsible for the intellectual content of a text, edition, or electronic transcription.

<respStmt> in turn contains the following elements:


<respType>: contains a phrase describing the nature of a person's or institution's intellectual responsibility.
<respName>: the publisher of the corpus or text expressed as the proper name of a person, place or institution.

3.3.2. Edition statement

In the corpus header, the version attribute on the <editionStmt> element is used to indicate both a version number and a revision number, in the form "version.revision", where "version" changes if texts are added to or removed from the corpus, and "revision'' changes if amendments are made within texts or the corpus header.

In individual text headers, the version attribute carries only a revision number.

This tag can be empty. For example:

<editionStmt version='1'>

This element corresponds to the TEI <editionStmt>, except that its content is an unstructured note.

3.3.3. Extent statement

This element corresponds to the TEI <extent> element in that it describes the number of words in the whole corpus or in an individual text. It differs in that it contains specific tags for specifying the size of the text or corpus in terms of words and bytes.

<extent>: describes the approximate size of the electronic text as stored on some carrier medium, specified in words (corpus header) and additionally in Kb (corpus texts).

The <extent> tag contains:

<wordCount>

contains the count of words in the text.

<byteCount>

contains the count of bytes in the file containing the text together with its markup.

units

gives the unit in which the bytecount is measured.

BYTES bytes

KB* kilobytes

MB megabytes

GB gigabytes

<extNote>

a descriptive note supplying additional information of any kind relating to an extent information provided within a corpus or text header.

For the purposes of the word count value, a "word" is considered to be an orthographic word--i.e., a string of characters surrounded by blanks. Punctuation not surrounded by white space is not considered as a word. This criterion is used as a default since this sort of count can be achieved fairly simply by automatic means. If any other definition is used it should be documented in the optional <extNote> tag; e.g.,

<extNote>Punctuation marks counted separately in the wordcount.</extNote>

The <bytecount> tag gives the size of the text including its tags, in its representation as a text file encoded in an 8-bit ISO character set, which is useful for calculating media requirements or file download times.

3.3.4. Publication statement

This corresponds to the TEI <publicationStmt> but has a narrower focus, since it relates only to the public availability of the electronic text.

It contains the following sub-elements:

<distributor>

gives the name of the person or institution who distributes the text or corpus.

<pubAddress>

contains a postal address of the distributor.

<telephone>

gives the telephone number in of the person or institution who distributes the text or corpus, in format conformant to ITU-T/CCITT Recommendation E.123.

<fax>

gives the fax number of the person or institution who distributes the text or corpus, in format conformant to ITU-T/CCITT Recommendation E.123.

<eAddress>

gives an electronic address of the person or institution who distributes the text or corpus. Note that more than one occurrence of this tag can appear, so that multiple addresses (possibly of different types) can be included. Attribute:

type

gives the type of the electronic address (email address, web site, ftp site, etc.). Suggested values include:

EMAIL* the value is an electronic mail address.

WWW the value is a web site address.

FTP the value is an ftp address.

<availability>

supplies information about the availability of a text, for example, any restrictions on its use or distribution, its copyright status, etc.

region

specifies the territories within which rights in the electronic text apply. Suggested values include:

WORLD* the text is freely available.

EU European Union only

status

supplies a code identifying the current availability of the text. Values are:

RESTRICTED the text is not freely available.

UNKNOWN* the status of the text is unknown.

FREE the text is freely available.

<idno>

supplies a number (e.g., ISBN) used to identify a bibliographic item.

<pubDate>

the publication date expressed in any format

value: specifies standard value for this date in ISO 8601 (Representation of dates and times) format

3.3.5. Source description

This element corresponds to the TEI <sourceDesc>, except that its content is constrained to include only the following possible sub-elements:

<biblStruct>: contains a structured bibliographic citation, in which only bibliographic sub-elements appear and in a specified order.
<biblFull>: contains a bibliographic citation for a text which has been previously encoded in electronic form. This element contains the same elements as the <fileDesc> element, and is intended to include the header of the electronic text from which the current document is derived.

The headers of individual texts will each contain at least one of the above elements to specify their source. When a particular text contains items derived from more than one bibliographic source or recording, all relevant sources for which information is available are listed in the text header, and individual <div> elements associated with the correct citation or recording by means of the decls attribute.

If an electronic text has been derived from a previous electronic version of the text, then the source description will contain a <biblFull> element. If this version had itself been derived from another electronic version, then this <biblFull> element could contain yet another <biblFull> element, and so on for as many recursive levels as required. If electronic text described in any <biblFull> element is derived from a print source, it contains a <biblStruct> element describing that source.

The <biblStruct> element

The <biblStruct> element has the following component sub-elements:

<analytic>: contains bibliographic elements describing an item (e.g. an article or poem) published within a monograph, journal, or periodical and not as an independent publication.
<monogr>: contains bibliographic elements describing an item (e.g. a book or journal) published as an independent item (i.e. as a separate physical object).

At least one <monogr> element must be present in a <biblStruct> element. It may contain the following elements:

<h.title>

the title of a work.

<h.author>

in a bibliographic reference, contains the name of an author (personal or corporate) of a work; names should be given in a canonical form, with surnames preceding forenames.

<respStmt>

supplies information about any person or institution responsible for the intellectual content of a text, edition, or electronic transcription.

<edition>

provides bibliographic details for an edition of some text.

<imprint>

groups information relating to the publication or distribution of a bibliographic item.

<idno>

supplies a standard (e.g., ISBN) number used to identify a bibliographic item.

type

a name or abbreviation (e.g., ISBN) identifying what type of identifying number is given. Unless provided explicitly the default value is:

ISBN* the value is an ISBN number.

<biblScope>

defines the scope of a bibliographic reference, for example as a list of page numbers, or a named subdivision of a larger work.

type

identifies the type of information conveyed by the element.

PP the element contains a page number or page range.

VOL the element contains a volume number.

ISSUE the element contains an issue number, or volume and issue numbers.

<biblNote>

a descriptive note supplying additional information of any kind relating to a bibliographic item described within a corpus or text header.

Published texts must contain at least one <imprint> element, which can contain the following elements:

<publisher>

proper name of a person, place or institution.

type

categorises the name. Legal values are:

PERSON name of a person

PLACE name of a place

ORG name of an organization article in a periodical

<pubDate>

a calendar date in any format.

value: specifies standard value for this date in ISO 8601 format

<pubPlace>

place of publication for a book, article, etc.

The <analytic> element is used when multiple monographic records are grouped together into single items. When the item described by a bibliographic citation forms a part of some other bibliographic item (as, for example, a newspaper article within a newspaper, or a journal article within a collection), a monographic description should be given for the newspaper or collection, prefixed by an analytic description for the individual component, enclosed within an <analytic> element. This contains a mixture of the elements <h.author> <respStmt> and <h.title> in any order and repeated as necessary.

3.4. The Encoding description

The second major component of the header, the encoding description, contains information about the relationship between an encoded text and its original source and describes the editorial and other principles employed throughout the corpus.

The <encodingDesc> element has the following six components:

<projectDesc>: describes in detail the purpose for which an electronic file was encoded.
<samplingDecl>: contains a prose description of the rationale and methods used in sampling texts in the creation of the corpus.
<editorialDecl>: provides details of editorial principles and practices applied during the encoding of a text.
<tagsDecl>: provides detailed information about the tagging applied to an SGML document.
<refsDecl>: specifies how canonical references are constructed for this text.
<classDecl>: contains a series of <category> elements, defining the classification codes used for texts within the corpus.

3.4.1. The <projectDesc> element

This element provides information about the project for and by which the text or corpus was created, together with any other relevant information concerning the process by which it was assembled or collected. The content of this element is an unstructured note. Example:

      <projectDesc>
           The MULTEXT project is assembling a corpus consisting of
           mono-lingual texts in seven Eastern and Western European
           languages, together with parallel translations in each of
           these languages. The original texts were acquired in various
           forms and marked up for conformance with the MULTEXT/EAGLES
           Corpus Encoding Standard, to test and validate that scheme.
           
           MULTEXT has also developed a suite of annotation tools which
           have been tested on the texts in the corpus. 
      </projectDesc>

A minimal encoding description can contain only the <projectDesc> element. In this case, a prose description of the encoding methods can be provided. If documentation of encoding principles exists in another location (a manual, etc. in printed form, at a given URL, in an ftp site, etc.) this information should be provided.

If no <conformance> element is provided in an <editorialDecl> element within the encoding description, the CES conformance level must be provided here.

3.4.2. The <samplingDecl> element

This is also an unstructued note, which contains information about the methods for text sampling in the corpus. This element is relevant only in the corpus header. This element provides details about the systematic inclusion or exclusion of portions of texts, the rationale, and the means by which this is noted in the encoding, if any. For example (adapted from English-Norwegian Parallel Corpus Project manual):

      <samplingDecl>
           The texts of the core corpus are mostly extracts from books. 
           The extracts are between 10,000 and 15,000 words long (30 - 40  
           pages), and are taken from the beginning of the texts. The front  
           matter, prefaces, forewords, list of contents, etc., are not  
           included in the extracts. In some cases, introductions have been  
           left out as well, e.g. introductions by scholars to works of  
           fiction.
           
           Omission of passages in the text may be marked by an 
           <omit> tag. 
      </samplingDecl>

3.4.3. The <editorialDecl> element

The <editorialDecl> element contains the following elements, each specifying a particular kind of editorial practice used for some portion of the corpus.

Where the same principles apply across the whole corpus (e.g., for the <segmentation> element), they can be documented only once within the corpus header.

Where different parts of the corpus apply different practices (as for example with the <quotation> or <hyphenation> elements), all possible practices can be defined in the corpus header, and particular parts of the corpus can specify the editorial practices applicable to them by using the decls attribute. When this method is used, if a practice is not explicitly associated with a part of the corpus in this way, it is assumed not to apply to it.

<conformance>

provides the CES level of conformance for the text or corpus.

level: gives the level of CES conformance (legal values are 1, 2, or 3).

<transduction>

describes the principles according to which the text has been transduced, either in transcribing it from audio tape to written form, or in converting from an electronic original.

<correction>

specifies a set of correction practices applied in creating one or more components of the corpus.

<quotation>

specifies editorial practice adopted with respect to quotation marks in the original.

marks

indicates whether or not quotation marks are retained as tag content in the text.

NONE no quotation marks have been retained

SOME some quotation marks have been retained

ALL* all quotation marks have been retained

form

specifies how quotation marks are indicated within the text.

STD use of quotation marks has been standardized; open and close quote marks are distinct.

NONSTD open and close quote marks are represented indiscriminately by the

UNKNOWN* use of quotation marks is unknown.

<hyphenation>

summarizes the way in which end-of-line hyphenation in a source text has been treated in an encoded version of it.

<segmentation>

describes the principles according to which the text has been segmented, for example into sentences, tone-units, graphemic strata, etc.

<normalization>

specifies a set of normalization practices applied in creating one or more components of the corpus.

method

indicates whether normalization made without notation or made by including editorial tags.

TAGS normalization indicated with tags

SILENT* normalization made silently

3.4.4. The <tagsDecl> element

This element is used differently in corpus and in text headers. In the corpus header, it is used to list all the element names actually used within the corpus, together with a brief description of its function. In text headers, the same element is used to specify the number of SGML elements actually tagged within each text. In both cases it consists of a number of <tagUsage> elements, defined as follows:

<tagUsage>

supplies information about the usage of a specific element within the corpus or text with which this header is associated.

gi

the name (generic identifier) of the element indicated by the tag.

occurs

specifies the number of occurrences of this element within the text.

wsd

can be used on a <tagUsage> element to indicate that for every appearance of the described element in the text, the content defaults to the specified character set. Therefore the declaration

<tagUsage gi=term occurs=5 wsd="ISO 8859-5">

indicates that the content of all <term> elements is in the ISO 8859-5 character set.

Note that the global attribute lang can similarly be used in a <tagUsage> element to indicate that for every appearance of the described element in the text, the content defaults to the specified language.

In the corpus header, each <tagUsage> element contains a brief description of the element specified by its gi attribute; the occurs attribute is not supplied. In text headers, the <tagUsage> elements may be empty, but the occurs attribute is always supplied.

A typical written text has a tag declaration like the following:


            <tagsDecl>
               <tagUsage gi=name occurs=256>
               <tagUsage gi=div occurs=7>
               <tagUsage gi=head occurs=7>
               <tagUsage gi=p occurs=705>
               <tagUsage gi=reg occurs=2>
               <tagUsage gi=sic occurs=1>
               <tagUsage gi=body occurs=1>
            </tagsDecl>

A PERL script to automatically generate <tagUsage> elements with appropriate values for tags in any SGML text is available at

<URL: http://www.cs.vassar.edu/~priestdo/research/scripts/tagusage.txt>

3.4.5.<refsDecl>

This element is useful for encoding corpora since it provides information about references which are often used in the alignment of parallel texts. In particular, it is common to use ID values on tags marking paragraphs and sentences as references in links associating two parallel texts. See for example, the English-Norwegian Parallel Corpus Project and The Lingua Parallel Concordancing Project.

     <refsDecl>
          A reference system is built up using the identifiers of the 
          following text units: text, division, paragraph, s-unit.
          Each nested division has an identifier which is built up by 
          successively adding to the identifier of the text. Each  
          paragraph has an identifier which adds yet another layer to the
          immediately superordinate identifier. S-units are numbered  
          within the nearest division, as shown above. After alignment,  
          each s-unit in the core corpus has a "corresp"  
          attribute containing a reference to the corresponding unit(s) in  
          the parallel text. 
  
      </refsDecl>

3.4.6. The <classDecl> element

The following scheme outlines means to define a set of text categories for classifying texts in the corpus. A standardized set of text categories is under development by the EAGLES Corpus Working Group on Text Typology, which may eventually eliminate the need to explicitly provide a descriptive taxonomy in the corpus header.

The <classDecl> element contains the descriptive taxonomy used to classify texts within the corpus. It occurs once, in the corpus header, and consists of one or more <taxonomy> elements. The <taxonomy> element in turn contains either a set of <category> elements, each representing a particular textual classification feature and a value for that feature; or one of the elements <h.bibl> or <biblStruct>, providing a bibliographic citation for documentation of a categorization scheme, followed optionally by a set of <category> elements. The <h.bibl> element contains PCDATA only, for cases where only a very simple citation is required.

<taxonomy>: defines a typology used to classify texts.

<category>: contains an individual descriptive category or feature-value pair.

The global id attribute is required for the <category> element, since it is used to associate a <catRef> within a text header with the descriptive category appropriate to it. The category element contains a set of <catDesc> elements:

<catDesc>: describes a category within the text typology, in the form of a brief prose description.

The <catDesc> element is used to contain the value for a feature within a <category>, unless that category is further subdivided, in which case a nested <category> element may be used.

Within the <textClass> element of the header for each text, a <catRef> element is provided, the target attribute of which lists the identifiers of all <category> elements applicable to that text.

When a standard set of text categories is developed, it is anticipated that an attribute on <textClass> will provide the category. Unless the standard categories are extended, no pointer to <category> elements in the corpus header will be required.

3.5. The Profile description

The third component of the header is the profile description. The <profileDesc> element has the following components:

<creation>: contains information about the origination of a text.
<langUsage>: groups information describing the languages, sublanguages, registers, dialects etc. represented within a text.
<wsdUsage>: groups information describing the character set(s) used within a text.
<textClass>: groups information which describes the nature or topic of a text in terms of a standard classification scheme, thesaurus, etc.
<translations>: groups information about existing translations of the text.
<annotations>: groups information about existing annotation files associated with the text.

These components appear in individual text headers, since they describe features of particular texts.

3.5.1. The <creation> element

This element is used to record details concerning the origination of the text, whether or not covered elsewhere.

3.5.2. The <langUsage> element

This element contains one or more <language> elements, each identifying a language used on the text:

<language>

characterizes a language, sublanguage, register, dialect, etc., used within a single text.

iso639

gives the standard language code from ISO 639 in one of the following forms:

a two-letter code from ISO 639 (e.g., "en" for English;
a three-letter code from ISO 639-2 (e.g., "eng" for English);
one of the above extended by a country code from ISO 3166 (e.g., "en.uk" or "eng.uk" for English as spoken in the United Kingdom).

type

indicates the type of language, e.g., sublanguage, dialect, etc.

Example:


      <langUsage>
          <language id="fr" iso639="fr">French</language>
          <language id="en" iso639="en">English</language>
          <language id="la" iso639="la">Latin</language>
      </langUsage>

The value of the id attribute on any <language> element should be given as a value for the global lang attribute when it is used on a tag in the text or header to refer to this language. For example,

              She ate <foreign lang=fr>croissants</foreign>

When more than one character set is used in a text, the wsd attribute should be used on each <language> tag to associate the language with a particular character set.

3.5.3. The <wsdUsage> element

This element contains one or more <writingSystem> elements, each identifying a character set used on the text:

<writingSystem>: characterizes a character set used within a single text.

Example:

      <wsdUsage>
          <writingSystem id="ISO 8859-1">ISO character set for western 
                   European languages</writingSystem>
          <writingSystem id="ISO 8859-5">ISO character set for 
                   Cyrillic</writingSystem>
      </wsdUsage>

The value of the id attribute on any <writingSystem> element should be given as a value for the global wsd attribute when it is used on a tag in the text or header to refer to this character set. For example,

       This is a patch of Cyrillic: 
       <foreign lang=bu wsd="ISO 8859-5">


       </foreign>

When a Writing System Declaration describing a transcription scheme is provided as an auxiliary document, the value of the wsd attribute on the <writingSystem> element must be an entity that points to this document. Usually, the entity expands to be the name of the file in which the Writing System Declaration is stored. Note that for this reason, the type of the wsd attribute on the <writingSystem> element is ENTITY (indicating that its value must be an SGML entity). In all other instances, whether in the header or text, the type of the wsd attribute is CDATA.

3.5.4. The <textClass> element

This element contains references to the text classification scheme and descriptive keywords which together describe the text concerned. The following elements are used for these purposes:

<catRef>

specifies one or more defined categories within some taxonomy or text typology.

target: identifies the text category or categories, by means of an IDREF pointing to one or more <category> elements defined in the corpus header.
scheme: identifies the classification scheme.

<h.keywords>

contains a list of keywords or phrases identifying the topic or nature of a text, each of which is tagged as a term. A standard list will be provided by EAGLES/PAROLE.

The <h.keywords> element contains one or more technical terms:

<keyTerm> contains a technical term or phrase, particularly in a list of descriptive keywords.

3.5.5. The <translations> element

This element groups information about translations of the text which exist, usually within the same corpus. The following elements are used for these purposes:

<translation>

gives information about a translation of the text. The global lang attribute and the wsd attribute are required on this tag. Additionally, this tag has the following optional attribute:

trans.loc: provides information (path/file name, URL, etc.) about the location of the the translation.

Note that endtag omission is allowed for the <translation> element, since in some cases all relevant information is supplied in attributes only. Thus, where appropriate, this element can function as an empty element, e.g.:

<translation trans.loc="1984.sl.ces" lang=sl wsd="ISO8859-1" n=1> <translation trans.loc="1984.es.ces" lang=es wsd="ISO8859-1" n=2> <translation trans.loc="1984.ro.ces" lang=ro wsd="ISO8859-1" n=3> ...

<translator>: gives the name of the translator.

3.5.6. The <annotations> element

This element groups information about annotation documents associated with the text. The following elements are used for these purposes:

<annotation>

gives information about an annotation file associated with the text. Attributes:

type

indicates the type of annotation. Values include:

SEGMENT annotation file contains segmentation into sentences and words.

GRAM annotation file contains morpho-syntactic category information for the words in the text.

ALIGN annotation file contains alignment links to a parallel translation.

ann.loc

provides information (path/file name, URL, etc.) about the location of the annotation file.

trans.loc

for annotation files containing alignment information, provides information (path/file name, URL, etc.) about the location of the file containing the aligned text.

Note that endtag omission is allowed for this element, since in some cases all relevant information is supplied in attributes only. Thus, where appropriate, this element can function as an empty element (as for <translation>, shown above).

3.6. The revision description

The revision description is the fourth element in the header. It is used to record details of any significant change to the corpus. The <revisionDesc> element has the following component:

<change>: summarizes a particular change or correction made to a particular version of an electronic text which is shared between several researchers.

Multiple <change> elements are provided for; one should appear per change.

Unlike its counterpart in the TEI scheme, the <change> element must here contain

<changeDate>

gives the date of the change.

value: specifies standard value for this date in ISO 8601 format

<respName>

specifies the person responsible for the change.

<h.item>

specifies the nature of the change(s). One or more occurrences of this element may appear within each <change> element.

When any significant change is made to any component of the corpus, the following steps should be taken:

a <change> element is added to the <revisionDesc> of the text affected
the update attribute of the text header is changed to the date of the change
the value of the status attribute of the text header is set to UPDATE
the revision number specified on the version attribute of the <editionStmt> of the corpus header is incremented

3.7. Use of decls attribute

The decls attribute is specified for the element <body> and the larger division elements (<div>).

It is used for two purposes:

to supply a specific title for parts of composite works;
to specify encoding or other declarations applicable to all or part of a text where a number of possibilities have been provided for in the header.

Its value is a list of identifiers, each of which has been supplied elsewhere in a text or corpus header as the identifier for one of the following elements: <biblStruct>, <editorialDecl> and its constituents (<correction>, <hyphenation>, <quotation>, <segmentation> and <transduction>), and <textClass>.

For these elements, the corpus header will normally contain several mutually incompatible options, for example, several editorial declarations. Individual texts, or portions of texts, specify explicitly which of the available options applies to them by using the decls attribute. In cases where the set of declarable elements applies only within portions of a single text, they will be specified in the text header rather than the corpus header.

Declarable elements, once specified, are inherited by all sub-components. That is, if the decls attribute of a <body> element specifies a particular value for some declarable element, that value is understood to apply to all components of the text unless over-ridden. If the decls attribute of a <div> within that text specifies a different value, the new value applies to the contents of that <div> only; the value specified by the <body> applies to all subsequent <div> elements in the same text, unless they also specify a different decls value.

For non-declarable elements, the header of an individual text will specify only those respects (if any) in which it differs from the defaults stated in the corpus header.

This is a simplification of the decls mechanism described in the TEI Guidelines.

3.8. Example


  <cesHeader version="2.0">
      <fileDesc>
           <titleStmt>
               <h.title>Machine-readable version of 1984, ch. 1</h.title>
               <respStmt>
                    <respType>typed in and marked with CES tags </respType>
                    <respName>A. Student</respName>
               </respStmt>
           </titleStmt>
           <extent>
               <wordcount>6571 </wordcount>
               <bytecount units="bytes">6571 </bytecount>
           </extent>
           <publicationStmt>
               <distributor>Laboratoire Parole et Langage, CNRS</distributor>
               <pubAddress>29, avenue Robert Schuman
                        Aix-en-Provence, France</pubAddress>
               <telephone>+33 42 95 36 33</telephone>
               <fax>+33 42 59 50 96</fax>
               <eAddress>phonetic@univ-aix.fr</eAddress>
               <availability status=restricted>
                   internal use only--cannot be distributed</availability>
               <pubDate>6571</pubDate>
           </publicationStmt>
           <sourceDesc>
               <biblStruct>
                    <monogr>
                         <h.title>Nineteen Eighty-four</h.title>
                         <h.author>George Orwell</h.author>
                         <imprint>
                              <pubPlace>New York</pubPlace>
                              <publisher>New American Library</publisher>
                              <pubDate>1949; reprinted 1961</pubDate>
                         </imprint>
                    </monogr>
               </biblStruct>
           </sourceDesc>
      </fileDesc>
      <encodingdesc>
           <projectdesc>
             This English version of the first chapter of Orwell's 1984 is 
             encoded for use in the MULTEXT-EAST project. The English is  
             to serve as the base for the parallel corpus, and will be aligned 
             to versions of the text in Romanian, Bulgarian, Estonian,  
             Slovenian, Czech, and Hungarian.
           </projectdesc>
           <editorialdecl>
               <conformance level=1>CES Level 1</conformance>
               <correction status=medium method=silent></correction>
               <quotation marks=none form=std>Rendition attribute values on Q 
                     and QUOTE tags are adapted from ISOpub and ISOnum standard 
                     entity set names
               </quotation>
               <segmentation>Marked up to the level of paragraph plus 
                     marking of particular sub-paragraph elements: NAME, DATE, 
                     FOREIGN.
               </segmentation>
           </editorialdecl>
           <tagsdecl>
               <tagusage gi=body occurs=1></tagusage>
               <tagusage gi=date occurs=5></tagusage>
               <tagusage gi=div occurs=2></tagusage>
               <tagusage gi=foreign occurs=4></tagusage>
               <tagusage gi=hi occurs=4></tagusage>
               <tagusage gi=name occurs=149></tagusage>
               <tagusage gi=note occurs=1></tagusage>
               <tagusage gi=num occurs=2></tagusage>
               <tagusage gi=p occurs=41></tagusage>
               <tagusage gi=ptr occurs=1></tagusage>
               <tagusage gi=q occurs=22></tagusage>
               <tagusage gi=quote occurs=3></tagusage>
           </tagsdecl>
      </encodingdesc>    
      <profiledesc>
           <langusage>
               <language id="fr" iso639="fr">French</language>
               <language id="en" iso639="en">English</language>
               <language id="la" iso639="la">Latin</language>
               <language id="ns">Newspeak</language>
           </langusage>
      </profiledesc>
  </cesHeader>

The CES Header element definitions

The CES Header element definitions in hypertext navigable format

Part 3

The Header

Contents