1.5.1. Introducing entities
XML provides a method of encoding and naming arbitrary parts of the
content of a document in a portable way. In XML the word
entity has a special sense: it means a named part of a marked
up document, irrespective of any structural considerations. An entity
might be a string of characters or a whole file of text. Entities are
declared in a DTD in the same way as elements or attributes, and they
are included in a document, or even in a DTD, using a construction
known as an entity reference.
Entities are divided into:
- General entities
- used in the document instance, and further divided into:
- internal entities
- external entities
- Parameter entities
- used in the DTD, also divided into:
- internal parameter entities
- external parameter entities
1.5.3. Entities and characters
A special type of entities are
character entities, used
for cases where the system supports a character, but it
cannot be entered directly. Instead, an entity reference is given,
which starts with
# followed by the decimal number of the
character, or by
#x followed by the hexadecimal number of the
character, e.g.:
Saarbrücken
Characters can also be encoded as general entities with descriptive
names. A number of such entity sets for XML (derived from SGML entity
sets) already exist, and can be imported into XML files.
<!-- iso-lat2.ent (initially distributed with DocBook XML DTD) -->
<!-- Derived from the corresponding ISO 8879 standard entity set
and Unicode character mappings provided by Sebastian Rahtz -->
<!ENTITY abreve "ă"> <!--LATIN SMALL LETTER A WITH BREVE-->
<!ENTITY Abreve "Ă"> <!--LATIN CAPITAL LETTER A WITH BREVE-->
<!ENTITY amacr "ā"> <!--LATIN SMALL LETTER A WITH MACRON-->
<!ENTITY Amacr "Ā"> <!--LATIN CAPITAL LETTER A WITH MACRON-->
...
With such definitions we can then write a Romanian text in 7-bit ASCII:
Într-o zi senină şi friguroasă de aprilie,
pe când ceasurile băteau ora treisprezece, Winston Smith,
cu bărbia înfundată în piept pentru a
scăpa de vântul care-l lua pe sus, se strecură
iute prin uşile de sticlă ale Blocului Victoria,...
This would be displayed as:
Într-o zi senină şi friguroasă de aprilie,
pe când ceasurile băteau ora treisprezece, Winston Smith,
cu bărbia înfundată în piept pentru a
scăpa de vântul care-l lua pe sus, se strecură
iute prin uşile de sticlă ale Blocului Victoria,...
1.5.4. Predefined entities
Note that entities are also the only way to include the characters
< and & into XML documents, as they
have special meaning for the parser.
For valid XML documents (i.e. those without a DTD) the following
entities are taken as predefined by XML processors:
<!ENTITY lt "&#60;"> <!-- less than, < -->
<!ENTITY gt ">"> <!-- greater than, > -->
<!ENTITY amp "&#38;"> <!-- ampersand, & -->
<!ENTITY apos "'"> <!-- apostrophe, ' -->
<!ENTITY quot """> <!-- quote, " -->
1.5.5. External entities
External entity references are substituted by the contents of files.
External entity specifications are refer to by
SYSTEM and, optionally,
PUBLIC identifiers:
<!ENTITY Chap1 SYSTEM "P4X/p4chap2.xml">
<!ENTITY Chap2 SYSTEM "http://www.tei-c.org/P4X/p4chap2.xml">
<!ENTITY Chap3 PUBLIC "-//TEI//TEXT Guidelines Chapter on XML//EN"
"http://www.tei-c.org/P4X/p4chap2.xml">
- XML files must have a SYSTEM identifier, the value
of which is an URI (Uniform Resource Identifier)
[an URI identifies a resource by
meta-information of any kind; in contrast, an URL locates a resource
on the net, which means if you have a URL and the appropriate protocol
you can retrieve the resource.]
- Mapping from PUBLIC to SYSTEM
identifiers is performed via the catalog file,
which the XML processor must be aware of;
- PUBLIC identifiers have a formalised structure, where fields are
separated by //:
- designates the registering body (e.g. ISO); if none, then value
is -
- identifies the owner of the PUBLIC identifier
- gives a) the type of the entity referred to
(e.g. TEXT, DTD, ENTITIES) and
b) a descriptive name of the entity
- gives the ISO 639 language code for the human language
in which the entity is written
External entities are referenced in the document just as internal
ones are:
<body> &Chap1; &Chap2; &Chap3; </body>
1.5.6. Unparsed entities and notations
Entities may contain non-textual data, e.g. digitised pictures or
video.
When such entities are declared it is
essential to indicate that they contain data which an XML parser or
processor cannot handle in the same way as the surrounding data:
<!ENTITY fig1 SYSTEM "figure1.png" NDATA png>
<!NOTATION png PUBLIC
'-//TEI//NOTATION IETF RFC2083 Portable Network Graphics//EN'>
- NDATA keyword indicates the entity contains unparsed data
- png is the (unique) name of a defined notation, which
indicates to the processor how to treat the external entity
- the notation is defined with its own declaration and is given a
SYSTEM or PUBLIC identifier
The unparsed entity must be referenced in an
ENTITY
attribute value, not in content: