Introductory course at ESSLII 2002

Annotation of Language Resources

Lecture I.

The XML Recommendation

Tomaž Erjavec
Department of Intelligent Systems
Institute Jožef Stefan
Jamova 39, SI-1000 Ljubljana


This lecture introduces the eXtended Markup Language, XML, and discuss the motivation for its development, its history and building blocks, i.e. elements, attributes and entities, and how they fit together.

1. XML

The first part of the tutorial deals with XML, the Extensible Markup Language. This part of the tutorial is modelled after A Gentle Introduction to XML, a Chapter of the TEI Guidelines.

1.1. Introduction to XML

1.1.1. What is XML?

  • XML is a definition of device-independent, system-independent methods of storing and processing texts in electronic form
  • XML is a “metalanguage” -- a language for describing other languages -- which lets you design your own customised markup languages for different types of documents
  • XML is a project of the World Wide Web Consortium (W3C), and the development of the specification is being supervised by their XML Working Group; hence, it is an open and non-proprietary specification
  • XML is a subset of SGML, (Standard Generalized Markup Language) the international standard metalanguage for text markup systems (ISO 8879)

1.1.2. What is a markup language?

markup (equivalently, encoding)
making explicit an interpretation of text
markup language
a set of markup conventions used together for encoding texts.
A markup language must specify:
  • how markup is to be distinguished from text,
  • what the markup means,
  • what markup is allowed,
  • what markup is required

1.1.3. What is XML used for?

  • XML is designed to improve the functionality of the Web by providing more flexible and adaptable information identification.
  • However, XML is not just useful for Web pages; below is a list of uses traditionally given to SGML:
    • managing large amounts of valuable, (predominately) textual data;
    • ensuring longevity of the texts;
    • enabling interchange of data between computer platforms;
    • allowing multiple exploitation of texts.

1.1.4. XML and NLP

In Natural Language Processing XML is increasingly used to store language resources and nowadays to act as an inter-process format:
The first area where SGML really became accepted (text analysis is annotation); to a large extent this was due to the Text Encoding Initiative Guidelines.
SGML was first used for machine readable dictionaries, and has been very popular for encoding terminological databases (work by ISO and LISA); a recent initiative is the Open Lexicon Interchange Format.
This area has been given a strong impetus by the “semantic net” buzz, esp. with the introduction of DAML+OIL.
Language Technology toolsets and workbenches
More and more LT software is XML aware, i.e. it allows import and export of XML annotated data (MATE, GATE) or uses XML for interprocess communication (LTG tools).

1.1.5. History of SGML: the evolution of the standard

In 1969, Charles Goldfarb leads an IBM research project on integrated law office information systems. With E. Mosher and R. Lorie he invented the Generalized Markup Language (GML) as a means of allowing text editing, formatting, and information retrieval subsystems to share documents.
The first working draft of the SGML standard was published in 1980 by ANSI. By 1983, the sixth working draft is recommended as an industry standard (GCA 101-1983). Major adopters included US IRS and DoD.
A draft ISO standard was published in October 1985, and was adopted by the Office of Official Publications of the EU. Another year of review and comment resulted in the final text, which was published in record time after approval:
ISO 8879: Information processing---Text and office systems---Standard Generalized Markup Language (SGML), ([Geneva]: ISO, 1986).

1.1.6. History of SGML/XML: the '90s and into the new millennium

The most famous SGML application, HTML, first version released in 1992.
Over the years various Technical Corrigenda to ISO 8879 were made, the most important probably Web SGML Adaptations Annex in 1997.
W3C Recommendation on XML 1.0 released February 1998. (2nd edition 6 October 2000 corrects some errata)
A number of companion W3C Recommendations released since then:
  • DOM Level 1 V1.0 (October 1998)
  • XML Namespaces V1.0 (January 1999)
  • XPath V1.0 (November 1999)
  • XSLT V1.0 (November 1999)
  • XHTML V1.0 (January 2000)
  • XML Schema V1.0 (May 2001)
  • XLink V1.0 (June 2001)
  • XPointer V1.0 (September 2001)
  • XSL V1.0 (October 2001)
  • XML Information Set V1.0 (October 2001)
  • XPath 2.0 WD (April 2002)
  • etc., etc.

1.1.7. Characteristics of XML: descriptive markup

  • Markup codes categorise parts of a document; they do not tell what processing is to be carried out at particular points in a document (procedural markup). Compare:
    • “the following item is a paragraph”
    • “skip down one line, move 5 quads right”
  • In XML, instructions needed to process a document for some particular purpose (for example, to format it) are sharply distinguished from the descriptive markup which occurs within the document. Usually, they are collected outside the document in so called stylesheets.

1.1.8. Characteristics of XML: document types

  • Documents are regarded as having types; the type of a document is formally defined by its constituent parts and their structure. A Document Type Definition will specify which elements are allowed, which required, what order they must appear in, how they are nested, etc. An XML document type is also known as an XML application.
  • If documents are of known types, a validating parser provided with a definition of a document's type, can check that any document claiming to be of a that type does in fact conform to the specification. Such XML documents are valid; those that do not have a known document type, but only conform to XML syntax, are well-formed.
  • Different documents of the same type can be processed in a uniform way. Programs can be written which take advantage of the knowledge encapsulated in the document structure information, and which can thus behave in a more intelligent fashion. This is especially useful for XML authoring environments and development work.

1.1.9. Characteristics of XML: data independence

  • XML documents, whatever language or writing system they employ, use the same underlying universal character set known as Unicode.
  • For technical and historical reasons, it is often necessary to translate Unicode encoded texts into some smaller or less general encoding scheme.
  • XML provides a general purpose mechanism for string substitution, inherited from SGML, i.e. a simple machine-independent way of stating that a particular string of characters in the document should be replaced by some other string when the document is processed. The strings defined by this string-substitution mechanism are called entities.

1.1.10. XML vs. HTML vs. SGML

Difference between XML and HTML:
  • XML is extensible: it does not contain a fixed set of tags;
  • XML focuses on the meaning of data, not its presentation;
  • XML documents must be well-formed according to a defined syntax, and may be formally validated.
Difference between XML and SGML:
  • many things that are parametrisable in SGML are fixed in XML, e.g. tag omission, the shape of tags;
  • many things that are allowed in SGML are disallowed in XML, e.g. certain combinations of elements;
  • in SGML each document must have assigned a document type; this is not necessary in XML.

1.2. XML Structures

1.2.1. Elements

... Rosalind's remarks <quote>This is the silliest stuff
that ere I heard of!</quote> clearly indicate ...

  • < is the tag open delimiter
  • > is the the tag close delimiter
  • quote is a generic identifier
  • <quote> is a start tag
  • </quote> is an end tag
  • <quote>This is ... heard of!</quote> is an element
  • This is ... heard of! is the content of the element

1.2.2. Content models

   <poem><title>The SICK ROSE</title>
           <line>O Rose thou art sick.</line>
           <line>The invisible worm,</line>
           <line>That flies in the night</line>
           <line>In the howling storm:</line>
           <line>Has found out thy bed</line>
           <line>Of crimson joy:</line>
           <line>And his dark secret love</line>
           <line>Does thy life destroy.</line>
           <!-- more poems go here    -->

Characteristics of a well-formed XML document:
  • there should be a single element containing the whole document: this is known as the root element;
  • each element, except root, should be completely contained in some other element; i.e. elements may not partially overlap one another;
  • the tags marking the start and end of each element must always be present.

1.2.3. Validating document structure

A valid XML document must be well-formed, and must also contain or make reference to a formal statement of its element grammar, and conform to this statement.
The conformity is checked by a validating parser. The ability to perform such validation is one of the key advantages of using XML.
The formal statement of the document type grammar may be provided by:
  • a Document Type Definition (DTD) or by
  • an XML schema.

1.2.4. The Document Type Definition

The DTD mechanism is inherited from SGML and uses special syntax to specify the element grammar of the document. Note that in SGML the DTD syntax and types of allowed combinations were much richer than in XML.
The Document Type Definition consists of:
the formal part
formally specifies what markup can appear in the document type and in what relation to other markup and data
the documentation
specifies what the defined elements mean: <date>Joe</date> can be formally OK, but is nonsense.
The formal DTD consists of:
  1. Element declarations;
  2. Attribute declarations;
  3. Entity specifications;

1.3. Elements

1.3.1. Element declarations: an example

An example DTD:

<!ELEMENT anthology   (poem+)>
<!ELEMENT poem        (title?, stanza+)>
<!ELEMENT title       (#PCDATA) >
<!ELEMENT stanza      (line+)   >
<!ELEMENT line        (#PCDATA) >

Each element declaration is composed of:
<! markup declaration open delimiter used for meta-level constructs, e.g. declarations, comments, etc.
ELEMENT declaration type keyword other keywords are ATTLIST and ENTITY
anthology generic identifier in XML GIs are case sensitive
(poem+) content model reserved words start with #; #PCDATA means "parsed character data"
> markup declaration close delimiter

1.3.2. Content model operators

  • ( open bracket for grouping
  • ) close bracket
  • , follows
  • | or
  • ? maybe
  • * repeated 0 or more times
  • + repeated once or more times
A more complex poem type:

<!ELEMENT poem 
          ( title?,
            ( (line+ )
            | (refrain?, (stanza, refrain?)+)

1.3.3. Constraints on content models

Mixed content
If an element contains #PCDATA and element content, #PCDATA must always appear as the first option in an alternation; the group containing it must use the star operator; it may appear once only, and in the outermost model group.

<!ELEMENT listitem1  (#PCDATA | para)*>           <!-- OK -->
<!ELEMENT listitem2  (#PCDATA | para | note)*>    <!-- OK -->

<!ELEMENT listitem4  (para | #PCDATA)*>           <!-- WRONG! -->
<!ELEMENT listitem3  (#PCDATA | para)>            <!-- WRONG! -->
<!ELEMENT listitem3  (#PCDATA | para | #PCDATA)>  <!-- WRONG! -->
<!ELEMENT listitem5  (para | (#PCDATA | note)*)>  <!-- WRONG! -->

Content model ambiguity
XML parsing is deterministic so content model must not be ambiguous.

<!ELEMENT x (a, (b | c)   )>  <!-- OK -->
<!ELEMENT x ((a, b)|(a, c))>  <!-- WRONG! -->

1.3.4. Empty content

Normally, elements are containers, but they can also be points in the document. To distinguish empty elements from those with content in well-formed XML documents, they have a special form: the tag ends with a slash.
In the DTD:
<!ELEMENT pageBreak EMPTY>
In the document:

The page ends here.
Here starts a new one.

Note that for XML processors, <gi/> and <gi></gi> are identical. Also note that neither of these forms is by default permitted for elements declared as EMPTY in SGML, where empty elements are represented by a start-tag in isolation, unless the SGML declaration has been modified to permit the first XML style cited above. Conversion of the way empty elements are represented is thus usually necessary when processing SGML legacy data in an XML environment.

1.4. Attributes

1.4.1. Introducing attributes

Attributes are used to describe information which is in some sense descriptive of a specific element occurrence but not regarded as part of its content.
<table id="P1" status='revised'> ... </table>
  • attribute values are supplied in the document instance as attribute-value pairs inside the start-tag for the element occurrence;
  • in XML, the value must always be given inside matching quotation marks, either single or double;
  • the order in which attribute-value pairs are supplied inside a tag has no significance;
  • even if different elements have attributes with the same name they are always regarded as different and may have different types of values assigned to them.
  • an XML processor can use the values of the attributes in any way it chooses; the id attribute is a slightly special case in that, by convention, it is always used to supply a unique value to identify a particular element occurrence, which may be used for cross reference purposes.

1.4.2. Declaring attributes

Attributes are declared in the DTD. As well as specifying its name and the element to which it is to be attached, it is possible to specify (within limits) what kind of value is acceptable for an attribute and a default value:

<!ATTLIST table
          id       ID                        #REQUIRED
          type     CDATA                     #IMPLIED
          status   (draft|revised|published) "draft"
          version  NMTOKEN                   #FIXED "1.0" >

Declared values:
  • CDATA: any valid character data;
  • NMTOKEN: composed only of characters legal for defining elements and attributes
  • ID: a unique identifier ; IDREF: a reference to an ID
  • ENITITY: a NMTOKEN value declared as an entity name
  • NMTOKENS, IDREFS, ENITITIES: a list of whitespace separated values.
Default values:
  • explicit default value, as in the example above;
  • keyword #IMPLIED: a value need not be supplied;
  • keyword #REQUIRED: a value must be supplied;
  • keyword #FIXED: the value, if present, must be the default.

1.4.3. Identifiers

ID and IDREF(S) are a mechanism built into XML used for cross-referencing between elements inside a document:
In the DTD:

<!ATTLIST poem     id       ID     #IMPLIED  >

<!ATTLIST poemRef  target   IDREF  #REQUIRED >

In the document:

      <poem id="p001"><title>The SICK ROSE</title>
         ...text of the poem...
<!-- more poems go here -->

<!-- and somewhere else in the document: -->

Blake's poem on the sick rose <poemRef target="p001"/>

An XML parser will check that:
  • each ID value is unique in the document
  • each IDREF value refers to an existing ID

1.5. Entities

1.5.1. Introducing entities

XML provides a method of encoding and naming arbitrary parts of the content of a document in a portable way. In XML the word entity has a special sense: it means a named part of a marked up document, irrespective of any structural considerations. An entity might be a string of characters or a whole file of text. Entities are declared in a DTD in the same way as elements or attributes, and they are included in a document, or even in a DTD, using a construction known as an entity reference.
Entities are divided into:
General entities
used in the document instance, and further divided into:
  • internal entities
  • external entities
Parameter entities
used in the DTD, also divided into:
  • internal parameter entities
  • external parameter entities

1.5.2. Entity use

An example:
  • in the DTD:
    <!ENTITY xml-url "">
    <!ENTITY xml-ref "<A href='&xml-url;'>&xml-url;</A>">
  • in the document:
    <hint>Read about XML at 
  • after processing:
    <hint>Read about XML at 
    <A href=''></A>.</hint>
  • entities can contain elements, but only complete ones; this is wrong:
    <!ENTITY xml-ref "<div type='chapter' level='1' rend='bold'>">
  • entities can contain entity references, but not recursive ones; this is wrong:
    <!ENTITY mirror "This is a &rorrim;">
    <!ENTITY rorrim "This is a &mirror;">

1.5.3. Entities and characters

A special type of entities are character entities, used for cases where the system supports a character, but it cannot be entered directly. Instead, an entity reference is given, which starts with # followed by the decimal number of the character, or by #x followed by the hexadecimal number of the character, e.g.:
Characters can also be encoded as general entities with descriptive names. A number of such entity sets for XML (derived from SGML entity sets) already exist, and can be imported into XML files.

<!-- iso-lat2.ent (initially distributed with DocBook XML DTD) -->
<!-- Derived from the corresponding ISO 8879 standard entity set
     and Unicode character mappings provided by Sebastian Rahtz -->

<!ENTITY abreve "&#x0103;"> <!--LATIN SMALL LETTER A WITH BREVE-->
<!ENTITY amacr  "&#x0101;"> <!--LATIN SMALL LETTER A WITH MACRON-->

With such definitions we can then write a Romanian text in 7-bit ASCII:

&Icirc;ntr-o zi senin&abreve; &scedil;i friguroas&abreve; de aprilie,
pe c&acirc;nd ceasurile b&abreve;teau ora treisprezece, Winston Smith,
cu b&abreve;rbia &icirc;nfundat&abreve; &icirc;n piept pentru a
sc&abreve;pa de v&acirc;ntul care-l lua pe sus, se strecur&abreve;
iute prin u&scedil;ile de sticl&abreve; ale Blocului Victoria,...

This would be displayed as:
Într-o zi senină şi friguroasă de aprilie,
pe când ceasurile băteau ora treisprezece, Winston Smith,
cu bărbia înfundată în piept pentru a
scăpa de vântul care-l lua pe sus, se strecură
iute prin uşile de sticlă ale Blocului Victoria,...

1.5.4. Predefined entities

Note that entities are also the only way to include the characters < and & into XML documents, as they have special meaning for the parser.
For valid XML documents (i.e. those without a DTD) the following entities are taken as predefined by XML processors:

<!ENTITY lt     "&#38;#60;">     <!-- less than,    < -->
<!ENTITY gt     "&#62;">         <!-- greater than, > -->
<!ENTITY amp    "&#38;#38;">     <!-- ampersand,    & -->
<!ENTITY apos   "&#39;">         <!-- apostrophe,   ' -->
<!ENTITY quot   "&#34;">         <!-- quote,        " -->

1.5.5. External entities

External entity references are substituted by the contents of files. External entity specifications are refer to by SYSTEM and, optionally, PUBLIC identifiers:

<!ENTITY Chap1  SYSTEM "P4X/p4chap2.xml">
<!ENTITY Chap2  SYSTEM "">
<!ENTITY Chap3  PUBLIC "-//TEI//TEXT Guidelines Chapter on XML//EN"
  • XML files must have a SYSTEM identifier, the value of which is an URI (Uniform Resource Identifier)
    [an URI identifies a resource by meta-information of any kind; in contrast, an URL locates a resource on the net, which means if you have a URL and the appropriate protocol you can retrieve the resource.]
  • Mapping from PUBLIC to SYSTEM identifiers is performed via the catalog file, which the XML processor must be aware of;
  • PUBLIC identifiers have a formalised structure, where fields are separated by //:
    1. designates the registering body (e.g. ISO); if none, then value is -
    2. identifies the owner of the PUBLIC identifier
    3. gives a) the type of the entity referred to (e.g. TEXT, DTD, ENTITIES) and b) a descriptive name of the entity
    4. gives the ISO 639 language code for the human language in which the entity is written
External entities are referenced in the document just as internal ones are:

<body> &Chap1; &Chap2; &Chap3; </body>

1.5.6. Unparsed entities and notations

Entities may contain non-textual data, e.g. digitised pictures or video. When such entities are declared it is essential to indicate that they contain data which an XML parser or processor cannot handle in the same way as the surrounding data:

<!ENTITY fig1 SYSTEM "figure1.png" NDATA png>

    '-//TEI//NOTATION IETF RFC2083 Portable Network Graphics//EN'>

  • NDATA keyword indicates the entity contains unparsed data
  • png is the (unique) name of a defined notation, which indicates to the processor how to treat the external entity
  • the notation is defined with its own declaration and is given a SYSTEM or PUBLIC identifier
The unparsed entity must be referenced in an ENTITY attribute value, not in content:
  • this is OK:
    This is illustrated in the following Figure: <image src="fig1"/>
  • this is WRONG:
    This is illustrated in the following Figure: &fig1;

1.5.7. Parameter entities

Parameter entities are used in the markup declarations (in the DTD). To distinguish them from general entities, they are introduced by a percent sign.
  • External parameter entities, used for multi-file DTDs, e.g. for external entities:
    <!ENTITY % ISOlat2 SYSTEM "iso-lat2.ent">
  • Internal parameter entities, used e.g. to factor out repetitive parts of the markup declarations:
    <!ENTITY % global-att '
               id        ID       #IMPLIED
               lang      IDREF    #IMPLIED' >
    <!ATTLIST div
              status     CDATA    #IMPLIED  >
    <!ATTLIST table
              border     (no|yes) "yes"     >
  • A parameter entity can be defined more than once in the DTD, in which case only the first definition is applies. This makes it easy to modify existing markup declarations by redefining parameter entities. This feature is heavily exploited in the TEI.

1.6. Other declarations

1.6.1. Marked sections

It is occasionally necessary to mark some portion of an XML document for special treatment. A marked section can be used in the document instance (CDATA marked section) for disabling XML processing in a part of the document, or in the DTD (conditional marked section) for including / ignoring certain parts of the markup declaration.
In CDATA marked sections, only markup recognised is ]]>:

<p>The &lt;term&gt; element may be used to mark any 
technical term:
    This <term>recursion</term> is giving me a headache.

1.6.2. Conditional marked sections

Conditional marked sections are used in the DTD:
  • INCLUDE / IGNORE keywords:
       <!ELEMENT poem (stanza+)> 
       <!ELEMENT stanza (line+)> 
       <!ELEMENT poem (couplet+)> 
       <!ELEMENT couplet (line,line)> 
  • Making a two-way switch:
    <!ENTITY % stanzas "INCLUDE">
       <!ELEMENT poem (stanza+)> 
       <!ELEMENT stanza (line+)> 
       <!ENTITY % couplets "IGNORE">
    <!ENTITY % couplets "INCLUDE">
       <!ELEMENT poem (couplet+)> 
       <!ELEMENT couplet (line,line)> 

1.6.3. Processing instructions

Processing instructions allow documents to contain instructions for applications. They are not part of the document's character data, but must be passed through to the application. The PI begins with a target used to identify the application to which the instruction is directed. The target names "XML", "xml", and so on are reserved for standardisation of the XML specification. The XML Notation mechanism may be used for formal declaration of PI targets.
  • A PI for the TeX processing system:
    <?tex \newpage ?>
  • At the beginning of XML documents:
    <?xml version="1.0" encoding="iso-8859-1"?>
  • Invoking a stylesheet:
    <?xml-stylesheet type="text/xsl" href="tlslides.xsl"?>

1.7. Putting It All Together

1.7.1. A complete XML document

An XML document consists of:
  • XML declaration, giving general properties of the document)
  • Document Type Declaration (a.k.a. DOCTYPE declaration), giving the root element type and designating the DTD. It is necessary only for valid XML documents
  • The root element, that contains the text and further markup.
A complete document:

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE anthology [
    <!ELEMENT anthology (poem+)>
    <!ELEMENT poem      (title?, stanza+)>
    <!ELEMENT title     (#PCDATA) >
    <!ELEMENT stanza    (line+)   >
    <!ELEMENT line      (#PCDATA) >
      <poem><title>The SICK ROSE</title>
              <line>O Rose thou art sick.</line>
              <line>The invisible worm,</line>
              <line>That flies in the night</line>
              <line>In the howling storm:</line>
              <line>Has found out thy bed</line>
              <line>Of crimson joy:</line>
              <line>And his dark secret love</line>
              <line>Does thy life destroy.</line>

1.7.2. The Document Type Declaration

Specifies the root element of the document, the external entity containing the DTD, and/or the (part of the) DTD contained in the internal subset:
  • using a SYSTEM identifier:
    <!DOCTYPE anthology SYSTEM "anthology.dtd">
  • using both SYSTEM and PUBLIC identifiers:
    <!DOCTYPE anthology PUBLIC "-//XXX//DTD Anthology//EN"
  • using external and internal subsets:
    <!DOCTYPE anthology SYSTEM "antology.dtd" [
      <!ENTITY jbw "Jabberwocky">

1.7.3. An example

A complex example - the TEI parametrisation:

          PUBLIC "-//TEI P3//DTD Main Document Type//EN" 
          "tei2.dtd" [
  <!ENTITY % TEI.prose 'INCLUDE'>
  <!ENTITY tla "Three Letter Acronym">
  <!ENTITY % x.phrase  'myTag|'>         
  <!ELEMENT myTag (#PCDATA)    >
  <!-- any other special-purpose declarations or
       re-declarations go  here -->
  <!-- This is an instance of a modified TEI.2 type document, which
       may contain <myTag>my special tags</myTag> and references 
       to my usual entities such as &tla;. -->

1.7.4. Language identification

In document processing, it is often useful to identify the natural or formal language in which the content is written. A special attribute named lang may be inserted in documents to specify the language used in the contents and attribute values of any element in an XML document. In valid documents, this attribute, like any other, must be declared if it is used. The values of the attribute are currently the two-letter language codes as defined by [ISO 639].

<div lang="en">
    Sprinkle rosemary over crisps and let them cool. The crisps may be
    made several days in advance and kept in a plastic bag at room
    temperature. Serve with a dollop of 
    <term lang="fr">creme fraiche</term>.