Foundational course at ESSLII 2005

Annotation of Language Resources

Lecture II.

XML-Related Recommendations

Tomaž Erjavec
Department of Knowledge Technologies
Jožef Stefan Institute
Jamova 39
SI-1000 Ljubljana
Slovenia

Abstract

This lecture discusses developments related to XML, in particular XML Schemas, XML Namespaces, XPath, and the XML transformation language, XSLT.

1. XML-Related Proposals
- 1.1. XML Namespaces
  - 1.1.1 XML Namespaces: Motivation
  - 1.1.2 XML Namespaces I.
  - 1.1.3 XML Namespaces II.
  - 1.1.4 XML Namespace Myths
- 1.2. XML Schemas
  - 1.2.1 Beyond DTDs
  - 1.2.2 W3C Schemas: an Example XML Document
  - 1.2.3 XML Schemas: an Example Schema
  - 1.2.4 RELAX NG Features
  - 1.2.5 RELAX NG Example
- 1.3. Identity of XML Documents
  - 1.3.1 When are two XML documents the same?
  - 1.3.2 XML Normal Forms
- 1.4. Formatting and Transforming XML
  - 1.4.1 Formatting and Transforming XML: Introduction
  - 1.4.2 XSL History
- 1.5. XPath
  - 1.5.1 XPath: Introduction
  - 1.5.2 XPath: Introduction 2
  - 1.5.3 Examples I
  - 1.5.4 Examples II
  - 1.5.5 XPath Expressions
  - 1.5.6 XPath Expressions, cont.
  - 1.5.7 Location Steps
  - 1.5.8 XPath Functions
  - 1.5.9 Abbreviated Syntax Expressions
  - 1.5.10 XPath String Functions
- 1.6. XSLT
  - 1.6.1 XSLT: Introduction
  - 1.6.2 Stylesheets
  - 1.6.3 Invoking Stylesheets
  - 1.6.4 Templates
  - 1.6.5 Implied Templates
  - 1.6.6 Selective and Repeated Processing
  - 1.6.7 Prefixes and Suffixes
  - 1.6.8 Tag Replacement and Namespaces
  - 1.6.9 Element Values
  - 1.6.10 Attribute Values
  - 1.6.11 Breaking Well-Formedness
  - 1.6.12 Second Try
  - 1.6.13 Disable Output Escaping
  - 1.6.14 Caution!
  - 1.6.15 Top-Level Elements
  - 1.6.16 Contextual Formatting
  - 1.6.17 Template Priorities
  - 1.6.18 Modes
  - 1.6.19 Attribute Values
  - 1.6.20 Conditional Constructs
  - 1.6.21 Sorting
  - 1.6.22 Numbering
  - 1.6.23 Advanced Numbering
  - 1.6.24 Linking with IDs
  - 1.6.25 Linking with Keys
  - 1.6.26 Variables
  - 1.6.27 Named Templates
  - 1.6.28 Other XSLT Features
  - 1.6.29 XSLT Version 2.0
- 1.7. XSL
  - 1.7.1 Introduction to XSL
  - 1.7.2 Formatting Objects
  - 1.7.3 Templates and Content
  - 1.7.4 An example
- 1.8. Other XLT Companion Recommendations
  - 1.8.1 Other XML Related Recommendations

XML-Related Proposals

↑

In this part we look at the following XML-related proposals:

XML Namespaces
XML Schemas
XPath
XSLT
XSL

XML Namespaces

↑

XML Namespaces: Motivation

← ↑ →

A single XML document could usefully contain elements and attributes ("markup vocabulary") that are defined for and used by multiple software modules.
Such documents pose problems of recognition and collision. Software modules need to be able to recognise the tags and attributes which they are designed to process, even in the face of "collisions" occurring when markup intended for some other software package uses the same element type or attribute name.
Therefore document constructs should have universal names, whose scope extends beyond their containing document; such universal names are defined by the XML Namespaces specification (January 1999).
Namespaces make use of the notion of a Uniform Resource Identifier, (URI), which identifies a resource by meta-information of any kind; in contrast, an URL locates a resource on the net, which means if you have a URL and the appropriate protocol you can retrieve the resource.

XML Namespaces I.

← ↑ →

<?xml version="1.0" ?>
<html:html 
      xmlns:html="http://www.w3.org/HTML/1998/html4"
      xmlns:nms="http://www.names.net/address">

 <html:head><html:title>Addresses</html:title></html:head>
 <html:body>
   <nms:addresses nms:version="1.0">
   <html:hr/>
   <nms:person xmlns:nms="http://www.names.net/address-addendum">
     <nms:title>Mr.</nms:title>
     <nms:first>Simon</nms:first>
     <nms:last>Schuster</nms:last>
   </nms:person>
   <html:hr/>
<!-- ... -->
 </html:body>
</html:html>

XML Namespaces provide a two-part naming system for element types and attributes
The xmlns prefixed attributes give the URI and the local prefix of the namespaces
Qualified names consist of the prefix, colon, and local part of the name
The meaning of the prefix of qualified names is inherited - and possibly overridden - by child elements

XML Namespaces II.

← ↑ →

<?xml version="1.0" ?>
<html xmlns="http://www.w3.org/HTML/1998/html4"
      xmlns:nms="http://www.names.net/address">
 <head><title>Addresses</title></head>
 <body xml:lang="en">
   <nms:addresses nms:version="1.0">
   <hr/>
   <nms:person>
     <nms:title>Mr.</nms:title>
     <nms:first>Simon</nms:first>
     <nms:last>Schuster</nms:last>
   </nms:person>
   <hr/>
<!-- ... -->
 </body>
</html>

The default namespace is introduced by the attribute xmlns, without a local prefix
The prefix xml is by definition bound to the namespace name http://www.w3.org/XML/1998/namespace

XML Namespace Myths

← ↑ →

There is less to XML Namespaces than meets the eye!

The URI is not a URL - it does not need to refer to a DTD or to be accessible; it is never resolved:

<?xml version="1.0" ?>
<isbn:book xmlns:isbn="urn:ISBN:0-395-36341-6"
           xmlns:nms="brrr://completely-silly-address/ha/ha">
...

The XML Namespaces recommendation is compatible with XML 1.0, hence it does not provide a way to validate a document against two or more DTDs; in fact, it is almost impossible to validate a document using XML Namespaces against a DTD.
An overview of common misconceptions about XML namespaces is given in Ronald Bourret: Namespace Myths Exploded (2000).

XML Schemas

↑

Beyond DTDs

← ↑ →

Document Type Definitions, DTDs are the traditional way in which to declare document types and to validate SGML/XML documents. However, they have two problems:

DTDs can impose only weak constraints on attribute and element content
DTDs themselves are not written in XML, so tools to process (edit, validate, present) XML do not work with them

Several proposals exist to address these shortcomings:

XML Schema
W3C Recommendation
RELAX NG (Regular Language Description for XML -- Next Generation)
Based on TREX (James Clark) and RELAX (Murata Makoto)
Moving towards an ISO standard
Schematron
Rick Jelliffe, (Academia Sinica)
Moving towards an ISO standard (ISO CD)

Validators exist that implement all of the above proposals and also convert DTDs to schemas.

W3C Schemas: an Example XML Document

← ↑ →

Example XML document from the W3C Schema Primer:


<?xml version="1.0"?>
<purchaseOrder orderDate="1999-10-20">
    <shipTo country="US">
        <name>Alice Smith</name>
        <street>123 Maple Street</street>
        <city>Mill Valley</city>
        <state>CA</state>
        <zip>90952</zip>
    </shipTo>
    <billTo country="US">
        <name>Robert Smith</name>
        <street>8 Oak Avenue</street>
        <city>Old Town</city>
        <state>PA</state>
        <zip>95819</zip>
    </billTo>
    <comment>Hurry, my lawn is going wild!</comment>
    <items>
        <item partNum="872-AA">
            <productName>Lawnmower</productName>
            <quantity>1</quantity>
            <USPrice>148.95</USPrice>
            <comment>Confirm this is electric</comment>
        </item>
        <item partNum="926-AA">
            <productName>Baby Monitor</productName>
            <quantity>1</quantity>
            <USPrice>39.98</USPrice>
            <shipDate>1999-05-21</shipDate>
        </item>
    </items>
</purchaseOrder>

XML Schemas: an Example Schema

← ↑ →

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">

 <xsd:annotation>
  <xsd:documentation xml:lang="en">
   Purchase order schema for Example.com.
   Copyright 2000 Example.com. All rights reserved.
  </xsd:documentation>
 </xsd:annotation>

 <xsd:element name="purchaseOrder" type="PurchaseOrderType"/>

 <xsd:element name="comment" type="xsd:string"/>

 <xsd:complexType name="PurchaseOrderType">
  <xsd:sequence>
   <xsd:element name="shipTo" type="USAddress"/>
   <xsd:element name="billTo" type="USAddress"/>
   <xsd:element ref="comment" minOccurs="0"/>
   <xsd:element name="items"  type="Items"/>
  </xsd:sequence>
  <xsd:attribute name="orderDate" type="xsd:date"/>
 </xsd:complexType>

 <xsd:complexType name="USAddress">
  <xsd:sequence>
   <xsd:element name="name"   type="xsd:string"/>
   <xsd:element name="street" type="xsd:string"/>
   <xsd:element name="city"   type="xsd:string"/>
   <xsd:element name="state"  type="xsd:string"/>
   <xsd:element name="zip"    type="xsd:decimal"/>
  </xsd:sequence>
  <xsd:attribute name="country" type="xsd:NMTOKEN" fixed="US"/>
 </xsd:complexType>

 <xsd:complexType name="Items">
  <xsd:sequence>
   <xsd:element name="item" minOccurs="0" maxOccurs="unbounded">
    <xsd:complexType>
     <xsd:sequence>
      <xsd:element name="productName" type="xsd:string"/>
      <xsd:element name="quantity">
       <xsd:simpleType>
        <xsd:restriction base="xsd:positiveInteger">
         <xsd:maxExclusive value="100"/>
        </xsd:restriction>
       </xsd:simpleType>
      </xsd:element>
      <xsd:element name="USPrice"  type="xsd:decimal"/>
      <xsd:element ref="comment"   minOccurs="0"/>
      <xsd:element name="shipDate" type="xsd:date" minOccurs="0"/>
     </xsd:sequence>
     <xsd:attribute name="partNum" type="SKU" use="required"/>
    </xsd:complexType>
   </xsd:element>
  </xsd:sequence>
 </xsd:complexType>

 <!-- Stock Keeping Unit, a code for identifying products -->
 <xsd:simpleType name="SKU">
  <xsd:restriction base="xsd:string">
   <xsd:pattern value="\d{3}-[A-Z]{2}"/>
  </xsd:restriction>
 </xsd:simpleType>

</xsd:schema>

RELAX NG Features

← ↑ →

Features of RELAX NG are that it:

is simple
is easy to learn
has both an XML syntax and a compact non-XML syntax
supports XML namespaces
treats attributes uniformly with elements so far as possible
has unrestricted support for unordered content
has unrestricted support for mixed content
has a solid theoretical basis
can partner with a separate datatyping language (such W3C XML Schema Datatypes

RELAX NG Example

← ↑ →

Example of a RELAX NG schema:

<element name="addressBook" xmlns="http://relaxng.org/ns/structure/1.0">
  <zeroOrMore>
    <element name="card">
     <attribute name="type"><text/></attribute>
      <element name="name"><text/></element>
      <element name="email"><text/></element>
      <optional><element name="note"><text/></element></optional>
    </element>
  </zeroOrMore>
</element>

Compact notation:

element addressBook {
  element card {
    attribute name { text },
    element name { text },
    element email { text },
    element note { text }?
  }*
}

Identity of XML Documents

↑

When are two XML documents the same?

← ↑ →

<anthology>                             <anthology
  <poem id = "p001" rend = "center">      ><poem rend='center'  
     <title>The SICK ROSE</title>           id='p001'><title>The SICK ROSE</title>
     <line>O Rose thou art sick.</line>       <line>O Rose thou art sick.</line>
   thou art sick.</line>                   </poem> <!--end of the poem-->
  </poem></anthology>                   </anthology>

Two XML documents are "the same", when they are logically equivalent within an application context
Differences that are irrelevant:
- Order of attributes and usage of quotes
- Non-significant whitespace (in content depends on presence of DTD)
- Representation of characters: ü, ü, ì, ì, ì
- Entity references v.s. their expansion ( ë, SYSTEM entities)
- Comments

XML Normal Forms

← ↑ →

W3C Recommendation "Canonical XML" describes a method for generating a physical representation, the canonical form, of an XML document that accounts for the permissible changes
XML canonicalization is defined in terms of the XPath definition of a node-set

W3C Recommendation "XML Information Set" defines an abstract data model called the XML Information Set (Infoset).
Its purpose is to provide definitions for use in other specifications that need to refer to the information in a well-formed XML document
XML document's information set (a tree) consists of a number of information items (nodes); each information item has a set of associated named properties.
Information items are verly similar to nodes in XPath

Formatting and Transforming XML

↑

This part of the course deals with the XSL family of W3C recommendations, in particular XPath, XSLT, and XSL. The structure and examples follow the book:
Neil Bradley: The XSL Companion. Addison-Wesley, 2000.

Formatting and Transforming XML: Introduction

← ↑ →

XML markup is supposed to be descriptive (e.g. <title>) rather than presentational (e.g. <bold>). But, sooner or later, we do want to render the documents. How do we do this?

rendering built directly into software (e.g. HTML browsers)
direct conversion to output format with XML aware transformation software (e.g. with XSLT to HTML)
conversion to intermediary, abstract presentation oriented format, and from there to final output format (e.g. with XSLT to XSL to PDF)

Styling languages:

HTML: CCS (Cascading Style Sheets)
SGML: DSSSL (Document Style Semantics and Specification Language)
XML: XSL (eXtensible Stylesheet Language)

XSL History

← ↑ →

The proposal for a stylesheet language originally named XSL was proposed to the W3C in 1997. But during its gestation, the proposal was pulled apart into three separate standards:

XPath (V1.0, November 1999): defines a mechanism for locating information in XML documents, and has many other uses besides that in formatting documents
XSLT (V1.0, November 1999): defines a means of transforming XML documents into other data formats (XML or otherwise), including (but not limited to) formatting languages
XSL (V1.0, October 2001): is now properly used only to name a proposed standard for embedding formatting information in documents using XML elements.

XPath

↑

XPath: Introduction

← ↑ →

XML Path Language (XPath) Version 1.0, W3C Recommendation November 1999 (Version 2.0, W3C Working Draft 30 April 2002).

The primary purpose of XPath is to address parts of an XML document; however, it has a natural subset that can be used for testing whether or not a node matches a pattern.
XPath uses a compact, non-XML syntax; it gets its name from its use of a path notation as in URLs for navigating through the hierarchical structure of an XML document.

XPath: Introduction 2

← ↑ →

XPath operates on the abstract, logical structure of an XML document (its InfoSet), rather than its surface syntax; it models an XML document as a tree of nodes. There are different types of nodes, including element nodes, attribute nodes and text nodes.
The primary syntactic construct in XPath is the expression, which is evaluated to yield an object of type: node-set (an unordered collection of nodes without duplicates), boolean, number, or string.
The full syntax of XPath expressions is cumbersome, so various abbreviations are allowed.
Expression evaluation occurs with respect to its context node.

Examples I

← ↑ →

*: selects all element children of the context node
@name: selects the attribute name of the context node
@*: selects all the attributes of the context node
para[1]: selects the first para child of the context node
*/para: selects all para grandchildren of the context node
/doc/chapter[5]/section[2]: selects the second section of the fifth chapter of the doc child of the root node
//para: selects all the para descendants of the document root and thus selects all para elements in the same document as the context node
//olist/item: selects all the item elements in the same document as the context node that have an olist parent
.: selects the context node
.//para: selects the para element descendants of the context node

Examples II

← ↑ →

..: selects the parent of the context node
../@lang: selects the lang attribute of the parent of the context node
para[@type="warning"][5]: selects the fifth para child of the context node that has a type attribute with value warning
para[5][@type="warning"]: selects the fifth para child of the context node if that child has a type attribute with value warning
chapter[title="Introduction"]: selects the chapter children of the context node that have one or more title children with string-value equal to Introduction
chapter[title]: selects the chapter children of the context node that have one or more title children
employee[@secretary and @assistant]: selects all the employee children of the context node that have both a secretary attribute and an assistant attribute

XPath Expressions

← ↑ →

An XPath expression contains one or more location steps, separated by slashes

Each location step has the following form:

axis-name :: node-test [predicate]*

For example:

child::para[attribute::type="warning"]

The XPath axis contains a part of the document, defined from the perspective of the context node
The node test makes a selection from the nodes on that axis
By adding predicates, it is possible to select a subset from these nodes. If the expression in the predicate returns true, the node remains in the selected set, otherwise it is removed

XPath Expressions, cont.

← ↑ →

axis-name :: node-test [predicate]*

Some axis names: child, parent, descendant, ancestor, self, ancestor-or-self, following-sibling, following, attribute, ...
The ancestor, descendant, following, preceding and self axes partition a document (ignoring attribute and namespace nodes): they do not overlap and together they contain all the nodes in the document.
Some node tests: literal name, *, text(),
A predicate is an expression; it is composed of values, operators and other XPath expressions. XPath also defines a set of functions for use in predicates.

Location Steps

← ↑ →

Location steps are similar to file system addressing (child:: axes below omitted):

A: selects all elements A that are children of the context node
A/B: select B elements that are children of A
A//B: select B elements that are descendants of A
/A: select root element A
/A//B: select B elements that are descendants of root element A

XPath Functions

← ↑ →

XPath also defines a number of functions. Here are some examples of their use in expressions:

child::text(): selects all text node children of the context node
child::para[last()]: selects the last para child of the context node
child::para[position()=1]: selects the first para child of the context node

Abbreviated Syntax Expressions

← ↑ →

Full syntax is cumbersome, so various abbreviations are allowed:

child:: can be omitted from a location step
attribute:: can be abbreviated to @
position()=n can be abbreviated to n
etc.

For example, child::para[position()=1][attribute::type="warning"] can be abbreviated to para[1][@type="warning"]

XPath String Functions

← ↑ →

A number of functions are also defined for strings; some examples are:

concat(string, string, string*): returns the concatenation of its arguments.
contains(string, string): returns true if the first argument string contains the second argument string, and otherwise returns false. For example
contains("abc", "b")
substring-before(string, string): returns the substring of the first argument string that precedes the first occurrence of the second argument string in the first argument string, or the empty string if the first argument string does not contain the second argument string. For example,
substring-before("1999/04/01","/") returns "1999".
normalize-space(string?): returns the argument string with whitespace normalised by stripping leading and trailing whitespace and replacing sequences of whitespace characters by a single space. For example
normalize-space(" a b c ") returns "a b c"

etc.

XSLT

↑

XSLT: Introduction

← ↑ →

XSL Transformations (XSLT) Version 1.0 W3C Recommendation; 16 November 1999 (Version 2.0, W3C Working Draft; April 2002)

XSLT defines a means of transforming XML documents into other data formats. It can be used for:

transforming XML documents into documents using the XSL stylesheet language, i.e. for formatting XML documents
as a general XML transformation language, used to transmit data between applications
at present, the most popular use is to convert XML into HTML.

Processors:

XSLT processor converts XML input into XML/XSL output, when supplied with a XSLT stylesheet
XSL processor convert XSL documents into device-dependent output formats.

Stylesheets

← ↑ →

How do we specify an XSLT stylesheet?

It is an XML document, and the root element is
```
<xsl:stylesheet> ... </xsl:stylesheet>
```
To indicate that XSLT can be used for more than just styling, an equivalent element is:
```
<xsl:transform> ... </xsl:transform>
```

Typical form:


<xsl:stylesheet 
     version="1.0"
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
.
.
.
</xsl:stylesheet>

Invoking Stylesheets

← ↑ →

There are three ways of invoking stylesheets:

Unlinked stylesheet:

xsltproc tlslides.xsl esslli05.xml > esslli05.html

Referenced stylesheet:

<?xml-stylesheet type="text/xsl" href="tlslides.xsl"?>
<!DOCTYPE TEI.2 SYSTEM 'teixlite.dtd'>

Embedded stylesheet:

<?xml-stylesheet type="text/xsl" href="#localStyle"?>
<!DOCTYPE TEI.2 SYSTEM 'teixlite.dtd'>
<TEI.2>
  <xsl:stylesheet version="1.0"
       xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
       id="localStyle">
  ...
  </xsl:stylesheet>
...
</TEI.2>

Templates

← ↑ →

Stylesheets are composed (mostly) of templates. A typicaly template could look like this:

<template match="para">  
  <apply-templates/>
</template>

A template specifies the transformation that is to be applied to a specific part of the source document
The match attribute, whose value is a pattern (a subset of XPath expressions), specifies which node the template applies to; matching is done with reference to the context node
<apply-templates/> triggers processing of the children nodes of the context node

Implied Templates

← ↑ →

Unnamed elements are processed by default, just as if the following template were included:
```
<template match="*">  
  <apply-templates/>
</template>
```
Text nodes are output by default, just as if the following template were included:
```
<template match="text()">  
  <value-of select="."/>
</template>
```
To ignore an element, e.g. <hide>, the following is therefore needed:
```
<template match="hide"/>
```

Selective and Repeated Processing

← ↑ →

It is possible to process only selected elements:


<template match="chapter">  
  <apply-templates select="title"/>
  <apply-templates select="para"/>
</template>

This processes the <title> twice:


<template match="chapter">  
  <apply-templates select="title"/>
  <apply-templates/>
</template>

Prefixes and Suffixes

← ↑ →

Text can be inserted before and after the node:


<template match="chapter">  
  This text will appear before the content of chapter
  <apply-templates/>
  This text will appear after the content of chapter
</template>

Using the <text> element allows for better control of whitespace:


<template match="quote">  
  <text> "</text>
  <apply-templates/>
  <text>" </text>
</template>

<template match="lb">
  <text>&#10;</text>
</template>

Tag Replacement and Namespaces

← ↑ →

Because the main purpose of XSLT is to convert to XML or HTML, replacing or inserting tags is very common. This can be done in two different ways:

Using the XSLT <element>:

<template match="book">  
  <element name="HTML">
    <element name="HEAD">
      <element name="TITLE">The Title</element>
    </element>
    <element name="BODY">
      <apply-templates/>
    </element>
  </element>
</template>

Using XML Namespaces:

<xsl:template match="book">  
  <HTML>
    <HEAD><TITLE>The Title</TITLE></HEAD>
    <BODY>
      <xsl:apply-templates/>
    </BODY>
  <HTML>
</xsl:template>

Element Values

← ↑ →

Input:

<para>Hello <hi>world</hi>!</para>

Copying document fragments:

<xsl:template match="para">  
  <xsl:copy-of select=".">  
</xsl:template>

Output:    Hello <hi>world</hi>!

Accessing content element as a string:

<xsl:template match="para">  
  <xsl:value-of select=".">  
</xsl:template>

Output:    Hello world!

Accessing specific elements:

<xsl:template match="para">  
  <xsl:value-of select="hi"/>
</xsl:template>

Output:    world

Attribute Values

← ↑ →

Accessing attribute content:

Input:    
<para type="important">Hello world!</para>

Template: 

  <xsl:template match="para">  
    <P>
    [TYPE: <xsl:value-of select="@type"/>]
    <xsl:apply-templates/>
    </P>
  </xsl:template>


Output:    

  <p>[TYPE: important] Hello world!</p>

Breaking Well-Formedness

← ↑ →

XSLT stylesheet must be a well-formed XML document, and it outputs also only well-formed XML documents. This can sometimes be problematic:

Input:    
  <first>John</first>  <last>Smith</last>
  <first>Frank</first> <last>Furter</last>

Intended output:    
  <p>John Smith</p>
  <p>Frank Furter</p>

First try:

  <xsl:template match="first">  
    <P> <xsl:value-of select="."/>   <!-- WRONG! -->
  </xsl:template>

  <xsl:template match="last">  
    <xsl:value-of select="."/> </P>  <!-- WRONG! -->
  </xsl:template>

Second Try

← ↑ →

<xsl:template match="first">  
  &lt;p&gt; <xsl:value-of select="."/>  <!-- Escape <p> -->
</xsl:template>
<xsl:template match="last">  
  <xsl:value-of select="."/> &lt;/p&gt; <!-- Escape </p> -->
</xsl:template>

However, this doesn't work:

&lt;p&gt; John Smith &lt;/p&gt; 
&lt;p&gt; Frank Furter &lt;/p&gt;

Disable Output Escaping

← ↑ →

Output escaping can be disabled:

<xsl:template match="first">  
  <xsl:text disable-output-escaping="yes">&lt;p&gt;</xsl:text> 
  <xsl:value-of select="."/>
</xsl:template>
<xsl:template match="last">  
  <xsl:value-of select="."/> 
  <xsl:text disable-output-escaping="yes">&lt;/p&gt;</xsl:text> 
</xsl:template>

Input:    
  <first>John</first>  <last>Smith</last>
  <first>Frank</first> <last>Furter</last>

Output:    
  <p>John Smith</p>
  <p>Frank Furter</p>

Caution!

← ↑ →

The perceived need to D-O-E usually comes from thinking about the transformation on a wrong way; XSLT is not about writing out start and end tags in a linear stream, but about transforming one tree structure into another.

So, a better way of implementing the transformation would be:


  ...
  <xsl:apply-templates select="first"/>
  ...

<xsl:template match="first">
  <P>
    <xsl:apply-templates/>
    <xsl:apply-templates select="following-sibling::last[1]"/>
  </P>
</xsl:template>

Top-Level Elements

← ↑ →

These elements are children of <stylesheet>:

External definitions:

<xsl:import href="tbl1.xsl"/>  <!--first element, included at end-->
<xsl:include href="tbl2.xsl"/> <!--included here-->

Output specification:

<xsl:output 
   method="xml"
   version="1.0"
   encoding="ISO-8859-1"
   standalone="no"
   doctype-system="tei2.dtd"
   doctype-public="-//TEI P3//DTD Main Document Type//EN"
   indent="yes"
   cdata-section-elements="code eg"
   media-type="text/xml"/>

Treatment of element whitespace:

<xsl:preserve-space elements="head p"/>
<xsl:strip-space elements="div"/>

Also some others, in particular, <template>

Contextual Formatting

← ↑ →

Templates can be sensitive to the context of an element:

Trivial case:

<xsl:template match="div">  ... </xsl:template>
<xsl:template match="head"> ... </xsl:template>
<xsl:template match="p">    ... </xsl:template>

Specific ancestor, child and sibling:

<xsl:template match="A//X">   ... </xsl:template>
<xsl:template match="X[C]">   ... </xsl:template>
<xsl:template match="P[S]/X"> ... </xsl:template>

Specific attribute and attribute value:

<xsl:template match="X[@a]">   ... </xsl:template>
<xsl:template match="X[@a='v']"> ... </xsl:template>

Template Priorities

← ↑ →

Only one XSLT template can apply to a specific instance of an element. If two or more templates match an element, the conflict is resolved using template priority:

With an explicit attribute:

<xsl:template match="para//emph"  priority="1"> ... </xsl:template>
<xsl:template match="quote//emph" priority="2"> ... </xsl:template>

With default values:
xsl:priority="0"
Explicit element (xsl:match="p")
xsl:priority="-0.25"
Any element (xsl:match="*")
xsl:priority="-0.5"
Node test (match="node()")
xsl:priority="0.5"
Other cases, i.e. contextual match (xsl:match="p/emph")
Conflict resolution: if two templates remain applicable, the processor usually issues a warning and selects the one closer to the end of the stylesheet.

Modes

← ↑ →

Same content can be invoked in various parts of the stylesheet, but needs to be formatted differently. This context dependent behaviour can be achieved using modes:

  <xsl:template match="title"> 
    <!-- formatting for <title> in the body of document -->
  </xsl:template>

  <xsl:template match="title" mode="toc">
    <!-- formatting for <title> in the table of contents -->
  </xsl:template>

  ...

    <!-- Generating the table of contents: -->
    <xsl:apply-templates mode="toc" select="//title"/>

Attribute Values

← ↑ →

Use of the XSL element <attribute>:

Input:
<image file="house.jpg" x="100" y="100">My house</image>

Output:
<IMG SRC="house.jpg" HEIGHT="100" WIDTH="100" ALT="My house"/>

Template:
<xsl:template match="image">
  <xsl:element name="IMG">
    <xsl:attribute name="SRC"><value-of select="@name"/></xsl:attribute>
    <xsl:attribute name="HEIGHT"><value-of select="@x"/></xsl:attribute>
    <xsl:attribute name="WIDTH"><value-of select="@y"/></xsl:attribute>
    <xsl:attribute name="ALT"><value-of select="."/></xsl:attribute>
  </xsl:element>
</xsl:template>

Alternativelly, the shorthand curly brackets can be used:

<xsl:template match="image">
  <IMG SRC="{@name}" HEIGHT="{@y}" WIDTH="{@x}" ALT="{text()}"/>
</xsl:template>

Conditional Constructs

← ↑ →

XSLT provides two elements to select optional pieces of a template:

If statement:

<xsl:if test="not(position() = last)">
  <xsl:text>, </xsl:text>
</xsl:if>

Multiple choices:

<xsl:choose>
  <xsl:when test="@type='error'">
     <FONT color="red"><xsl:apply-templates/></FONT>
  </xsl:when>
  <xsl:when test="@type='warning'">
     <FONT color="yellow"><xsl:apply-templates/></FONT>
  </xsl:when>
  <xsl:otherwise>
    <xsl:apply-templates/>
  </xsl:otherwise>
</xsl:choose>

Sorting

← ↑ →

Sorting specified in <apply-templates>:

Input:
  <people>
    <name>
      <first>John</first>
      <last>Smith</last> 
      <age>53</age>
    </name>
    <name>
      <first>Frank</first>
      <last>Furter</last>
      <age>35</age>
    </name>
  </people>

Template:
  <xsl:template match="people">
    <DIV>
      <xsl:apply-templates>
        <xsl:sort select="last">
        <xsl:sort data-type="number" select="age">
      </xsl:apply-templates>
    </DIV>
  </xsl:template>

Numbering

← ↑ →

Numbering is specified in <template> with <number>:

The number is the position of the element within its list of sibling elements:

Input:                            Output:                  
  <people>                            
    <name>John Smith</name>           1) John Smith  
    <name>Frank Furter</name>         2) Frank Furter
  </people>

Template:
  <xsl:template match="name">
    <xsl:number/> <xsl:text>) </xsl:text>
    <xsl:apply-templates/>
  </xsl:template>

Formatting:

Template:
    <xsl:number format="a"/>. <xsl:apply-templates/>
Output:
    a. John Smith
    b. Frank Furter

Template:
    <xsl:number format="(i)"/> <xsl:apply-templates/>
Output:
    (i) John Smith
    (ii) Frank Furter

Advanced Numbering

← ↑ →

Manipulating the counter:

Template:
    <xsl:number value="position()" format="1) "/>
Output:
    1) John Smith
    2) Frank Furter

Template:
    <xsl:number value="last() + 1 - position()" format="1) "/>
Output:
    2) John Smith
    1) Frank Furter

Element selection:

Count only those that have status different from ignore:
```
<xsl:number count="item[not(@status='ignore')]"/>
```
Count both normal and special:
```
<xsl:number count="normal | special"/>
```

Multipart numbering:

    <xsl:number count="div1"/>.
    <xsl:number count="div2"/>)

equivalently:

    <xsl:number level="multiple" count="div1 | div2" format="1.1)"/>

Document-wide numbering:

  <xsl:template match="table/title">
    <TITLE>
      <xsl:number level="any" count="table"/>
      <xsl:apply-templates/>
    </TITLE>
  </xsl:template>

Linking with IDs

← ↑ →

XML IDs can be used in XSLT:

The ID attribute must be declared in DTD:
```
<!ATTLIST div name ID #REQUIRED>
```

IDs are accessed with the id() XPath function:

<xsl:template match="id('chp-intro')"> ... </xsl:template>

The argument of id() is #IDREFS:

<xsl:template match="id('chp-intro chp-conc')"> ... </xsl:template>

Linking with Keys

← ↑ →

A more flexible method is possible by using keys:

Keys do not have to be defined in a DTD, be stored in an attribute, be unique, or refer to only one element.
Keys are defined by a namespace, elements that are referred to, and what part of them is considered the identifier value:
```
<xsl:key name="Personnel" match="people/name" use="last"/>
```

Keys are accessed by using the key() XPath function:

<xsl:template match="key('Personnel' 'Smith')"> ... </xsl:template>

Keys are just a shorthand for a predicate, but XSLT software might process keys more efficiently.

Variables

← ↑ →

Variables are named objects that hold values.

Defining the value:

<xsl:variable name="color">red</xsl:variable>

XSLT is a declarative language, so a value of a variable cannot be changed:

  <xsl:variable name="n">1</xsl:variable>
  <xsl:variable name="n">2</xsl:variable> <!-- WRONG! -->

However, a variable definition in a template overrides one made globally:

<xsl:stylesheet ...>
  <xsl:variable name="level">1</xsl:variable>
  <xsl:template match=".*">
    <xsl:variable name="level">2</xsl:variable> <!-- OK -->

Variables are referenced by prefixing their name with $:

The sky was
<FONT color="{$color}"><xsl:value-of select="$color"></FONT>

Variables can also contain result-tree fragments:

<xsl:variable name="warning"><hi>Warning!</hi></xsl:variable>
...
<xsl:copy-of select="$warning">

Named Templates

← ↑ →

Repetitive output structures can be encapsulated in named templates.

A named template is defined in the usual way but it is given a name:
```
<xsl:template name="line">
  <BR/><HR/><BR/>
</xsl:template>
```

It is invoked by <call-template>:

  <xsl:template select="chapter">
    <xsl:call-template name="line">
    <H1>New chapter</H1> <xsl:apply-templates/>
  </xsl:template>

Templates can have parameters:

  <xsl:template name="colorize">
    <xsl:param name="color">white</xsl:param>
     <FONT color="{$color}"> <xsl:apply-templates/>
     </FONT>
  </xsl:template>

  <xsl:template select="error">
    <xsl:call-template name="colorize">
      <xsl:with-param name="color">red</xsl:with-param>
    </xsl:call-template>
  </xsl:template>
  <xsl:template select="warning">
    <xsl:call-template name="colorize">
      <xsl:with-param name="color">yellow</xsl:with-param>
    </xsl:call-template>
  </xsl:template>

Other XSLT Features

← ↑ →

When the stylesheet would contain only one template, we can use the single template shortcut:

<BOOK xsl:version="1.0"
       xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  ...
  <xsl:value-of .../> ...
</BOOK>

When we want to loop through elements directly, we can use direct processing:

<xsl:for-each select="row">
  ...
  <xsl:for-each select="cell">
    ...
  </xsl:for-each>
</xsl:for-each>

When we want to invoke a simple kind of debugging:

<xsl:template select="//warning/*/para">
  <xsl:message>Template //warning/*/para is activated!</xsl:message>
  <xsl:apply-templates/>
</xsl:template>

XSLT Version 2.0

← ↑ →

XSLT Version 2.0, W3C Working Draft; April 2002

What's new in XSLT V2.0:

not 100% backward compatible with XSLT V1.0;
many terminological and other changes in the specification;
a transformation can produce multiple result trees;
support for XPath V2.0 and stronger data typing;
facilities are introduced for grouping of nodes;
creation of user-defined functions within the stylesheet, that can be called from XPath expressions;
improved sorting;
an XHTML output method has been added.

XSL

↑

Introduction to XSL

← ↑ →

Extensible Stylesheet Language (XSL) Version 1.0; W3C Recommendation October 2001

XSL is a markup language suitable for formatting material to screen and paper;
XSL is a powerful and complex language with 51 formatting object types, such as blocks, inline areas, lists, tables, dynamic features and links. Formatting objects are configured using some of the 231 properties also specified.
Currently, only limited support for typesetting with XSL exists; a common approach is to convert XSL documents to TeX, and from there to e.g. PDF or Postscript.

Formatting Objects

← ↑ →

XSL formatting instructions are XML elements containing the text they apply to
An XSL formatting instruction is called a formatting object;
The namespace of formatting objects is http://www.w3.org/1999/XSL/Format; it is commonly mapped to prefix fo

For example:

An HTML element:
```
<P>A <B>bold</B> statement.</P>
```

An equivalent XSL element:

<block>A <wrapper font-weight="bold">bold</wrapper> statement.</block>

An XSLT stylesheet implementing the transformation:

<stylesheet version="1.0"
       xmlns="http://www.w3.org/1999/XSL/Transform"
       xmlns:fo="http://www.w3.org/1999/XSL/Format">

  <template select="P">
    <fo:block><apply-templates/></fo:block>
  </template>
...

Templates and Content

← ↑ →

The root element in a XSL document is <root>; it contains two major sections: templates and content
Templates specify the characteristics of pages to display or print
Content is enclosed in page sequences, each making reference to a template

<root>
  <layout-master-set>
    <simple-page-master master-name="front">
      <!-- TEMPLATE 1 -->
    </simple-page-master>
    <simple-page-master master-name="body">
      <!-- TEMPLATE 2 -->
    </simple-page-master>
  </layout-master-set>

  <page-sequence-master master-name="front">
      <!-- CONTENT 1 -->
  </page-sequence-master>

  <page-sequence-master master-name="body">
      <!-- CONTENT 2 -->
  </page-sequence-master>
</root>

An example

← ↑ →

<fo:root xmlns:fo="http://www.w3.org/1999/XSL/Format" >
  <fo:layout-master-set>
    <fo:simple-page-master master-name="only" page-height="29.7cm" 
        page-width="21cm" margin-top="1cm" margin-bottom="2cm" 
        margin-left="2.5cm" margin-right="2.5cm">
      <fo:region-body margin-top="3cm"/>
      <fo:region-before extent="3cm"/>
      <fo:region-after extent="1.5cm"/>
    </fo:simple-page-master>
  </fo:layout-master-set>
  <fo:page-sequence master-name="only" initial-page-number="1">
    <fo:static-content flow-name="xsl-region-before">
      <fo:block text-align="end" font-size="10pt" font-family="serif" 
          line-height="14pt">
XML Recommendation - p. 
         <fo:page-number/>
      </fo:block>
    </fo:static-content>
    <fo:flow flow-name="xsl-region-body">
      <fo:block font-size="18pt" font-family="sans-serif" 
          line-height="24pt" space-after.optimum="15pt" 
          background-color="blue" color="white" text-align="center" 
          padding-top="0pt"> 
Extensible Markup Language (XML) 1.0 
      </fo:block>
     <fo:block font-size="16pt" font-family="sans-serif" 
         line-height="20pt" space-before.optimum="10pt" 
         space-after.optimum="10pt" text-align="start" padding-top="0pt">
Abstract
     </fo:block>
     <fo:block font-size="12pt" font-family="sans-serif" 
         line-height="15pt" space-after.optimum="3pt" text-align="start">
The Extensible Markup Language (XML) is a subset of SGML that is
completely described in this document. Its goal is to enable generic
SGML to be served, received, and processed on the Web in the way that
is now possible with HTML. XML has been designed for ease of
implementation and for interoperability with both SGML and HTML. For
further information go to
       <fo:basic-link external-destination="normal.pdf">normal.pdf</fo:basic-link>
     </fo:block>
...

Other XLT Companion Recommendations

↑

Other XML Related Recommendations

← ↑ →

“The nice thing about standards is that there are so many of them.”

XML Information Set (Infoset): A set of definitions for use in specifications that need to refer to the information in an XML document.
XML Linking Language (XLink): A language that allows elements to be inserted into XML documents in order to create and describe sophisticated links between resources.
XML Pointer Language (XPointer): XPath-based language to be used as a fragment identifier for any URI-reference that locates an XML resource; supports addressing into the internal structures of XML documents.
XML Query: XPath-based query language, designed to be broadly applicable across many types of XML data sources.
Simple API for XML (SAX): SAX (not a W3C recommendation) was the first widely adopted API for XML in Java, and is a de facto standard; now there are versions for several other programming language environments.
Document Object Model (DOM): A platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content, structure and style of documents.
XHTML: This specification defines XHTML 1.0, a reformulation of HTML 4 as an XML 1.0 application.

Foundational course at ESSLII 2005

Annotation of Language Resources

Lecture II.

XML-Related Recommendations

Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute Jamova 39 SI-1000 Ljubljana Slovenia

Abstract

XML-Related Proposals

XML Namespaces

XML Namespaces: Motivation

XML Namespaces I.

XML Namespaces II.

XML Namespace Myths

XML Schemas

Beyond DTDs

W3C Schemas: an Example XML Document

XML Schemas: an Example Schema

RELAX NG Features

RELAX NG Example

Identity of XML Documents

When are two XML documents the same?

XML Normal Forms

Formatting and Transforming XML

Formatting and Transforming XML: Introduction

XSL History

XPath

XPath: Introduction

XPath: Introduction 2

Examples I

Examples II

XPath Expressions

XPath Expressions, cont.

Location Steps

XPath Functions

Abbreviated Syntax Expressions

XPath String Functions

XSLT

XSLT: Introduction

Stylesheets

Invoking Stylesheets

Templates

Implied Templates

Selective and Repeated Processing

Prefixes and Suffixes

Tag Replacement and Namespaces

Element Values

Attribute Values

Breaking Well-Formedness

Second Try

Disable Output Escaping

Caution!

Top-Level Elements

Contextual Formatting

Template Priorities

Modes

Attribute Values

Conditional Constructs

Sorting

Numbering

Advanced Numbering

Linking with IDs

Linking with Keys

Variables

Named Templates

Other XSLT Features

XSLT Version 2.0

XSL

Introduction to XSL

Formatting Objects

Templates and Content

An example

Other XLT Companion Recommendations

Other XML Related Recommendations

Tomaž Erjavec
Department of Knowledge Technologies
Jožef Stefan Institute
Jamova 39
SI-1000 Ljubljana
Slovenia