Text Encoding Initiative

The XML Version of the TEI Guidelines

Notes for TEI P4 Guidelines for Electronic Text Encoding and Interchange XML-compatible edition


1. TEI ED W69, available from the TEI website at http://www.tei-c.org/Vault/ED/edw69.htm.

2. TEI documents bear identifying numbers which indicate the provenance of the document (here simply ‘TEI’, in other cases the TEI work group number, e.g. ‘TEI AI5’), the type of document (here ‘U’ and ‘T’, meaning users' guide or users' manual and sample text(s)), and a sequential number. The TEI document number of the document in hand is TEI P4 (for TEI public proposal number 4).

3. These guidelines do not provide any other schema (XML Schema, RELAX NG, etc.) corresponding to the DTDs, although such may be provided at a later time.

4. As originally published in previous editions of the Guidelines, this chapter provided a gentle introduction to `just enough' SGML for anyone to understand how the TEI used that standard. Since then, the Gentle Guide seems to have taken on a life of its own independent of the Guidelines, having been widely distributed (and flatteringly imitated) on the web. In revising it for the present draft, the editors have therefore felt free to reduce considerably its discussion of SGML-specific matters, in favour of a simple presentation of how the TEI uses XML.

5. International Organization for Standardization, ISO 8879: Information processing – Text and office systems – Standard Generalized Markup Language (SGML), ([Geneva]: ISO, 1986).

6. World Wide Web Consortium: Extensible Markup Language (XML) 1.0, available from http://www.w3.org/TR/REC-xml

7. In the ‘continuous writing’ characteristic of manuscripts from the early classical period, words are written continuously with no intervening spaces or punctuation.

8. We do not here discuss in any detail the ways that a style sheet can be used or defined, nor do we discuss the increasingly popular W3C Stylesheet Languages. See http://www.w3.org/TR/xsl for the Extensible Stylesheet Language (XSL), and http://www.w3.org/TR/xslt for the XSL Transformations (XSLT) Language.

9. See Extensible Markup Language (XML) 1.0, Section 2.2 Characters.

10. ISO/IEC 10646-1993 Information Technology — Universal Multiple-Octed Coded Character Set (UCS)

11. See http://www.unicode.org/

12. In SGML (but not in XML) the name and the content model may be separated by an additional part of the declaration which specifies `omission rules' for the element concerned. These rules state whether or not start- and end-tags must be present for every occurrence of the element concerned: as noted above, such tag omission is not permitted in XML, and is not permitted in the TEI Interchange format.

13. Because the opening angle bracket has this special function in an XML document, special steps must be taken to use that character for other purposes (for example, as the mathematical less-than operator); see further 2.7.2 Entity references; in SGML (but not XML) different characters may be defined for use as any of the delimiting characters (the angle brackets, exclamation mark and solidus).

14. The example is taken from William Blake's Songs of innocence and experience (1794). The markup is designed for illustrative purposes and is not TEI-conformant.

15. This is not strictly true for empty elements, for which start- and end-tags can be combined, as further discussed below.

16. Note that this simple example has not addressed the problem of marking elements such as sentences explicitly; the implications of this are discussed below in section 2.5 Complicating the issue.

17. The DTD language described in the remainder of this section is neither the only way of representing such criteria, nor the most powerful. One important alternative is provided by another W3C Recommendation: the XML Schema language (http://www.w3.org/XML/Schema); another is provided by the OASIS Committee's specification for Relax NG (http://www.oasis-open.org/committees/relax-ng/). It is highly probable that future releases of these Guidelines will use such a language, in preference to, or as well as, a DTD.

18. In SGML (but not in XML) the name and the content model are separated by an additional part of the declaration which specifies minimization rules for the element concerned. Minimization (informally speaking, whether or not start- and end-tags must be present in every occurrence of the element concerned) is not permitted in XML, and is not recommended in the TEI Interchange format.

19. In XML, a single colon may also appear in a GI, where it has a special significance related to the use of namespaces, as further discussed in section 2.9.2 Namespaces. The characters defined by Unicode as combining characters and as extenders are also permitted. In SGML, the rules stated informally here may vary somewhat depending on the SGML declaration in force; in particular, it is not usually the case that upper and lower case letters are distinguished, although such usage is highly recommended for TEI Interchange. The present version of the Guidelines does not mandate this, for compatibility reasons, but this is likely to change in a subsequent release.

20. In SGML (but not XML), a third connector, the ampersand, is sometimes used, signifying that the components connected by it may appear in either order. Its use is not supported (or recommended) by the TEI interchange format of SGML.

21. It will not have escaped the astute reader that the fact that verse paragraphs need not start on a line boundary seriously complicates the issue; see further section 2.5 Complicating the issue.

22. In SGML, but not XML, it is possible to use a group of names instead of a single GI within an element declaration, so the three declarations could be combined like this:

<!ELEMENT (line|firstLine|secondLine) O O (#PCDATA)>
This is not however supported by the TEI Interchange Format.

23. The (good) rationale for these restrictions is beyond the scope of this tutorial, as are the consequences of attempting to evade them. The TEI content models all obey these constraints.

24. See Renear, A., Mylonas, E., Durand, D. Refining our notion of what text really is: the problem of overlapping hierarchies in Ide and Hockey, eds., Research in Humanities Computing, OUP, 1996

25. SGML (but not XML) provides a mechanism to define `concurrent' document structures, which is discussed in chapter 31 Multiple Hierarchies below; however, this is not widely implemented, and is not further discussed here.

26. In SGML, the quotation marks may be omitted in certain circumstances; however their use is required by the TEI interchange format.

27. As with content models, it is possible in SGML (but not in XML) to combine several attribute specifications together in a single declaration by supplying a list of element names instead of a single name; this is not however done in the current version of the TEI DTDs.

28. These parts are conventionally lined up in rows for human readability; the parser only requires that there be some kind of whitespace between them.

29. XML also permits representation of empty elements by an immediately adjacent start- and end-tag, thus

<poemRef target='Rose'></poemRef>
Neither form is by default permitted for elements declared as EMPTY in an SGML context, for which empty elements should be represented by a start-tag in isolation, unless the SGML declaration has been modified to permit the first XML style cited above. Conversion of the way empty elements are represented is thus usually necessary when processing SGML legacy data in an XML environment.

30. In general, an external entity can be any data source available to the XML processor: files, results of database queries, results of calls to system functions, web pages — anything at all. System identifiers can use any method to name an entity which the XML parser's interface to its operating environment can use to elicit data from the environment.

31. In SGML (but not XML) the semicolon may be omitted if the entity reference is followed by whitespace; this is not recommended practice, and may be prohibited in future revisions of these Guidelines.

32. This restriction does not apply to SGML documents, which may employ conditional marked sections within the document instance. Such usage is not recommended where XML/SGML compatibility is a consideration.

33. This is explained in more detail in section 2.10.2 The DOCTYPE declaration below; the key point for our present purposes is that declarations in the DTD subset are always read before those in the external DTD file, and, as mentioned above in section 2.7.5 Parameter entities, the first declaration of a given entity is the one which counts.

34. The SGML Open catalog format is documented in SGML Open Technical Resolution 9401:1997, Entity Management, which is available from http://xml.coverpages.org/sotr9401-a2.html; the XML Catalog specification, also produced by OASIS is available from their site at http://www.oasis-open.org/committees/entity/spec.html.

35. A parameter entity is an entity used only in markup declarations; references to parameter entities are delimited by a percent sign and a semicolon rather than the ampersand and colon used for general entity references. The entity TEI.core.ent, for example, would be referred to using the string %TEI.core.ent;. Parameter entities can also be used to control the inclusion or exclusion of marked sections of the document or DTD; the TEI DTD uses marked sections to handle the selection of different base and additional tag sets.

36. More exactly, these are the attributes of the element class global, to which all elements belong; for further discussion of attribute classes and ways in which attributes may be inherited and over-ridden, see section 3.7.1 Classes Which Share Attributes.

37. A dummy element class TEIform is defined in the reference section, solely for documentary purposes.

38. The colon is also by default a valid name character; however, it is reserved for a specific purpose in XML (to indicate namespace prefixes), and is not therefore generally recommended by these Guidelines, for compatability reasons.

39. Validation checks that all IDREF values exist as id values on elements somewhere in the current SGML document. It is a requirement of the TEI scheme, not of SGML or XML, that the lang attribute point to a <language> element.

40. The TEIform attribute is based on the notion of architectural forms developed for HyTime (ISO 10744).

41. Because the details of their pointing mechanism differ, the members of the pointer class do not, however, share their pointing attributes.

42. Note that in this context, phrase means any string of characters, and can apply to individual words, parts of words, and groups of words indifferently; it does not refer only to linguistically motivated phrasal units. This may cause confusion for readers accustomed to applying the word in a more restrictive sense.

43. It is expected that after completion of the full text of these Guidelines, the TEI will prepare alternate sets of generic identifiers in languages other than English. It should be noted, however, that in the interests of simplicity parameter entities are used only for generic identifiers. Attribute names, standard attribute values, and parameter entity names are less easily modified.

44. Defined by ISO 8601: 2000(E), Data elements and interchange formats — Information interchange — Representation of dates and times ([Geneva]: International Organization for Standardization, 2000).

45. In general, the design goal has been to maintain backwards compatability: any document conforming to the original (P3) SGML version of the DTD should also conform to an SGML DTD described in the present document (P4), and would conform to the same DTD expressed in XML if it further followed the rules of XML (case sensitivity, always quoting attribute values, etc.). It is not, however, guaranteed that a document conforming to the present DTD will also conform to the previous one.

46. Since the first publication of this chapter, many of its recommendations have been rendered obsolete or obsolescent by the development of ISO/IEC 10646 and the adoption of Unicode as the underlying character set for all XML documents. The chapter has undergone considerable revision to reflect these changes, but further substantial change is likely in the next release of these Guidelines.

47. This informal introduction is derived partly from an excellent tutorial on character code issues written by Jukka Korpela, available from http://www.cs.tut.fi/~jkorpela/chars.html, which includes a useful list of pointers to other introductory tutorial material. Definitive information on the topics discussed here is available from the Unicode Consortium's website at http://www.unicode.org/.

48. The terminology used is summarized in ISO/IEC 2022 (1994)

49. Abstract characters such as the diaresis or umlaut symbol, which are combined with others to form new characters are technically known as composing or combining characters.

50. The way that a numerical value is actually represented as a sequence of bits in computer storage may vary: for example, the number 31 might be represented using 16 or 32 or even 64 bits, with different left-to-right ordering, or with different byte-groupings, on different hardware.

51. See ISO/IEC 15285:1998 Information technology — An operational model for characters and glyphs.

52. For a historical survey, see Charles E. Mackenzie Coded character sets: history and development (Addison-Wesley, 1980); see also Tom Jennings' Annotated history of character codes at http://www.wps.com/texts/codes/.

53. This subset comprised only the following characters taken from the international reference version (IRV) of ISO 646

a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
" % & ' ( ) * + , - . / : ; < = > ? _
The 1994 edition of these Guidelines recommended that (for interchange purposes) other characters should be represented with entity references, or with transliterations, documented in an accompanying Writing System Declaration.

54. This section gives only a very short overview of those parts of the Unicode standard relevant to the current discussion. For further and more precise information, the reader should consult the Unicode Consortium website or the book The Unicode Standard Version 3.0. This was the current major edition available in print at the time of writing, and is used in references. Two minor revisions have been made to this, which are documented at the web site of the Unicode Consortium; the version number of the online edition is 3.1.1.

55. Written by Martin Dürst and Asmus Freytag, this document is available from http://www.unicode.org/unicode/reports/tr20/tr20-5.html .

56. The character encoding standards defined by the ISO as `ISO/IEC 10646-1:2000' and the Unicode Consortium as the `Unicode Standard' are identical for most practical purposes: in all instances where we refer to either of the two standards below, the other is also meant to be included.

57. A compatability character is a character included for compatibility with existing standards, even though it can be represented by other characters or combinations of characters already encoded in the Unicode Standard. See Unicode in XML and other Markup Languages , Section 4 for additional information.

58. See Unicode Technical Report 15 at http://www.unicode.org/unicode/reports/tr15/.

59. Available at http://www.w3.org/TR/charmod: see section 4.2 Definitions for W3C Text Normalization.

60. The most widely used such entity set is to be found in Annex D to ISO 8879; it is also reproduced or summarized in most SGML textbooks, notably Charles F. Goldfarb, The SGML Handbook (Oxford: Clarendon Press, 1990). Entity sets appropriate for use with both SGML and XML are available from the TEI website.

61. Many reversible transliteration schemes were defined in pre-Unicode days, see for example ALA-LC Romanization Tables: Transliteration Schemes for Non-Roman Scripts, approved by the Library of Congress and the American Library Association, tables compiled and edited by Randall K. Barry (Washington: Library of Congress, 1991).

62. Thesaurus Linguæ Græcæ, Beta Manual (Irvine: TLG, [1988]). See also Luci Berkowitz and Karl A. Squitier, Thesaurus Linguæ Græcæ Canon of Greek Authors and Works 2nd edition (Oxford: Oxford University Press, 1986).

63. When SGML is in use, the lang attribute also implies a particular coded character set (as defined by the associated WSD); in the XML context however, no change in character encoding is implied by a change in the lang value.

64. Codes for the Representation of Names of Languages-Part 2: Alpha-3 Code, ([Geneva]: International Organization for Standardization, 1998). The list of language codes is also available from the the Library of Congress, which is the registration authority for ISO 639-2: see http://lcweb.loc.gov/standards/iso639-2/langhome.html.

65. The SIL Ethnologue database at http://www.ethnologue.com/language_code_index.asp is a recommended alternative list of language identifiers.

66. For more information on this highly influential family of standards, first proposed in 1969 by the International Federation of Library Associations, see http://www.ifla.org/VII/s13/pubs/isbd.htm. On the relation between the TEI proposals and other standards for bibliographic description, see further section 5.7 Note for Library Cataloguers.

67. Michael Gorman and Paul W. Winkler, eds., Anglo-American Cataloguing Rules, Second Edition (Chicago: American Library Association; London: Library Association; Ottawa: Canadian Library Association, 1978).

68. Agencies compiling catalogues of machine-readable files are recommended to use available authority lists, such as the Library of Congress Name Authority List, for all common personal names.

69. In the case of a TEI corpus (23 Language Corpora), a <tagsDecl> in a corpus header will describe tag usage across the whole corpus, while one in an individual text header will describe tag usage for the individual text concerned.

70. On the milestone tag itself, what are here referred to as `variables' are identified by the combination of the ed and unit attributes.

71. In an SGML context, the external entity might alternatively be declared using the SUBDOC keyword to indicate that this entity contains SGML data which can be parsed using some other DTD than the current one. Since SUBDOC entities are not provided in XML, this is not recommended for general usage.

72. shy (`soft hyphen') is defined in the standard public entity set ISOnum; Unicode reserves code point 2010 for the hyphen, and 2011 for the `non-breaking' hyphen.

73. Although the way in which a spoken text is performed, (for example, the voice quality, loudness, etc.) might be regarded as analogous to `highlighting' in this sense, these Guidelines recommend distinct elements for the encoding of such `highlighting' in spoken texts. See further section 11.2.6 Shifts.

74. The Oxford English Dictionary documents the phrase ‘to come down’ in the sense to bring or put down; esp. to lay down money; to make a disbursement as being in use, mostly in colloquial or humorous contexts, from at least 1700 to the latter half of the 19th century.

75. See, for example, Sociolinguistics/Soziolinguistik (An international handbook of the science of language and society. Ein internationales Handbuch zur Wissenschaft von Sprache und Gesellschaft) (Berlin, New York: De Gruyter, 1988), I, pp. 271 and 274.

76. In some contexts, the term ‘regularization’ has a narrower and more specific significance than that proposed here: the <reg> element may be used for any kind of regularization, including normalization, standardization, and modernization.

77. See chapter 14 Linking, Segmentation, and Alignment for a discussion of these elements and the extended syntax they provide for `hypertext' links.

78. Many encoders find it convenient to retain the line breaks of the original during data entry, to simplify proof-reading, but this may be done without inserting a tag for each line break of the original.

79. XML allows many more international characters in identifiers; the legal form of identifiers in SGML depends in part on the SGML declaration. With appropriate modifications in the declaration, other characters may be made legal in identifiers; this is allowed though not encouraged in TEI-conformant SGML documents.

80. For example, to distinguish ‘London’ as an author's name from ‘London’ as a place of publication or as a component of a title.

81. Among the bibliographic software systems and subsystems consulted in the design of the <biblStruct> structure were BibTeX, Scribe, and ProCite. The distinctions made by all three may be preserved in <biblStruct> structures, though the nature of their design prevents a simple one-to-one mapping from their data elements to TEI elements. For further information, see section 6.10.4 Relationship to Other Bibliographic Schemes.

82. American National Standard for Bibliographic References, ANSI Z39.29-1977 (New York: American National Standards Institute, 1977), p. 34 (sec. A.2.2.1).

83. The analysis is not wholly unproblematic: as the text of the standard points out, the first subordinate title is subordinate only to the parallel title in French, while the second is subordinate to both the English main title and the French parallel title, without this relationship being made clear, either in the markup given in the example or in the reference structure offered by the standard.

84. The BibTeX scheme is intentionally compatible with that of Scribe, although it omits some fields used by Scribe. Hence only one list of fields is given here.

85. This convention (corresponding with the idea that a type-set document may begin either with a ‘level 0’ or a ‘level 1’ heading) is provided for convenience and compatibility with some widely used formatting systems.

86. This decision should be recorded in the <sampling> element of the header.

87. As with all lists of `suggested values' for attributes, it is recommended that software written to handle TEI-conformant texts be prepared to recognize and handle these values when they occur, without limiting the user to the values in this list.

88. Definition of such a tag set remains a work item for the TEI; such tag sets for contemporary printed matter already exist or are being created within the publishing industry, for example the Majour (Modular Application for Journals) Project of the European Workgroup on SGML. See for example MAJOUR: Modular Application for Journals: DTD for Article Headers ([n.p.]: EWS, 1991).

89. For discussion of other attributes of this class, see 7.1.4 Partial and Composite Divisions.

90. The eccentric formatting of this example is a consequence of the fact that whitespace within the <seg> element is always retained: this is desirable for the lower level <seg> elements (those of type syll) but not for the higher level ones. Whitespace inside a start-tag, as here in the foot type <seg> elements, is always discarded and does not affect the result.

91. For clarity of presentation, we have in this example adopted the convention that white space outside the syllable tags is to be ignored; as noted above, this is not generally enforceable in XML and this convention is not therefore recommended for normal use.

92. This chapter has not yet been revised to reflect developments in the field of speech transcription and multimodal language annotation since its first publication. It is planned that a future revision of these Guidelines will include recommendations in these areas, which will in turn imply some revision of the features discussed here.

93. For a discussion of several of these see J. A. Edwards and M. D. Lampert, eds., Talking Language: Transcription and Coding of Spoken Discourse (Hillsdale, N.J.: Lawrence Erlbaum Associates, 1993); Stig Johansson, Encoding a Corpus in Machine-Readable Form, in Computational Approaches to the Lexicon: An Overview, ed. B. T. S. Atkins et al. (Oxford: Oxford University Press, forthcoming); and Stig Johansson et al. Working Paper on Spoken Texts, document TEI AI2 W1, 1991.

94. The original is a conversation between two children and their parents, recorded in 1987, and discussed in Brian MacWhinney, CHAT Manual ([Pittsburgh]: Dept of Psychology, Carnegie-Mellon University, 1988), pp. 87ff.

95. For the most part, the examples in this chapter use no sentence punctuation except to mark the rising intonation often found in interrogative statements; for further discussion, see section 11.3.3 Regularization of Word Forms.

96. For details see S. Boase, London-Lund Corpus: Example Text and Transcription Guide (London: Survey of English Usage, University College London, 1990).

97. The term was apparently first proposed by Bengt Loman and Nils Jørgensen, in Manual for analys och beskrivning av makrosyntagmer (Lund: Studentlitteratur, 1971), where it is defined as follows: ‘A text can be analysed as a sequence of segments which are internally connected by a network of syntactic relations and externally delimited by the absence of such relations with respect to neighbouring segments. Such a segment is a syntactic unit called a macrosyntagm’ (trans. S. Johansson).

98. Laura Gavioli and Gillian Mansfield, The Pixi Corpora (Bologna: Cooperativa Libraria Universitaria Editrice, 1990), p. 74.

99. We refer the reader to previous and current discussions of a common format for encoding dictionaries. For example, Robert A. Amsler and Frank W. Tompa, An SGML-Based Standard for English Monolingual Dictionaries, in Information in Text: Fourth Annual Conference of the U[niversity of] W[aterloo] Centre for the New Oxford English Dictionary October 26–28, 1988, Waterloo, Canada, pp. 61–79; Nicoletta Calzolari et al., Computational Model of the Dictionary Entry: Preliminary Report, Acquilex: Esprit Basic Research Action No. 3030, Six-Month Deliverable, Pisa, April 1990; John Fought and Carol Van Ess-Dykema, Toward an SGML Document Type Definition for Bilingual Dictionaries, TEI working paper TEI AIW20 (available from the TEI); Nancy Ide and Jean Veronis, Encoding Print Dictionaries, Computers and the Humanities 29: 167–195, 1995; Nancy Ide, Jacques Le Maitre, and Jean Veronis, Outline of a Model for Lexical Databases, (Information Processing and Management, 29, 2, 159–186, 1993); Nancy Ide, Jean Veronis, Susan Warwick- Armstrong, Nicoletta Calzolari, Principles for Encoding machine readable dictionaries, Proceedings of the Fifth EURALEX International Congress, EURALEX'92, University of Tampere, Finland; The DANLEX Group, Descriptive tools for electronic processing of dictionary data, in Lexicographica, Series Maior (Tübingen: Niemeyer, 1987); and A. Tutin and Jean Véronis, J. (1998). Electronic dictionary encoding: customizing the TEI Guidelines, in Proceedings of the Eighth Euralex International Congress, 1998.

100. It is unlikely that many conventional dictionaries will require smaller divisions, but all the usual division elements <div2> through <div7> may be used.

101. Each example taken from a real dictionary indicates its source using the following abbreviations for dictionary names:

C/R
Beryl T. Atkins et al., Collins Robert French-English English-French Dictionary (London: Collins, 1978, rpt. 1983)
CED
Collins English Dictionary
CP
Collins Pocket
DNT
Le Dictionnaire de Notre Temps, ed. Françoise Guerard (Paris: Hachette, 1990).
LDOCE
Longman Dictionary of Contemporary English
NPEG
The New Penguin English Dictionary (London: Penguin, 1986, rpt. 1987).
OALD
Oxford Advanced Learner's Dictionary of Current English, ed. A. S. Hornby with A. P. Cowie and A. C. Gimson (Oxford University Press, 1974).
PLC
Petit Larousse en Couleurs (Paris: Larousse, 1990).
PLI
Pequeño Larousse Illustrado por Ramón García-Pelayo y Gross (Buenos Aires, Mexico, Paris: Ediciones Larousse, 1964).
PR
Le Petit Robert
SSSE
Simon and Schuster's International Dictionary English/Spanish Spanish/English ed. Tana de Gómez (New York: Simon and Schuster, 1973).
W7
Webster's 7th Collegiate
WNC
Webster's New Collegiate Dictionary (Springfield, Mass.: G. & C. Merriam Co., 1975).

To simplify the electronic presentation of this document on systems with limited character sets, most of the pronunciations are presented using the transliteration found in the electronic edition of the Oxford Advanced Learner's Dictionary. Also, the middle dot in quoted entries is rendered with a full stop, while within the sample transcriptions hyphenation and syllabification points are indicated with |, regardless of their rendition in the source text.

102. Complications of sequence caused by marginal or interlinear insertions and deletions, which are frequent in manuscripts, or by unconventional page layouts, as in concrete poetry, magazines with imaginative graphic designers, and texts about the nature of typography as a medium, typically do not occur in dictionaries, and so are not discussed here.

103. This is a slight oversimplification. Even in conservative transcriptions, it is common to omit page numbers, signatures of gatherings, running titles and the like. The simple description above also elides, for the sake of simplicity, the difficulties of assigning a meaning to the phrase ‘original sequence’ when it is applied to the printed characters of a source text; the ‘original sequence’ retained or recovered from a conservative transcription of the editorial view is, of course, the one established during the transcription by the encoder.

104. The omission of rendition text is particularly common in systems for document production; it is considered good practice there, since automatic generation of rendition text is more reliable and more consistent than attempting to maintain it manually in the electronic text.

105. Since its first publication, this chapter has been rendered obsolete in several respects, chiefly as a result of the publication of ISO 12200, and a variant of it (TBX) which has been recently adopted by LISA, the Localisation Industry Standard Association. Work is currently ongoing in the ISO community to define a generic platform for terminological markup (ISO CD 16642, TMF : Terminological Markup Framework), in the light of which it is anticipated that the recommendations of the present chapter will be substantially revised. Readers are cautioned in particular that the discussion below of `nested' and `flat' structures is now far removed from current practices in the terminological field. A major revision of this chapter is planned for the next edition of these Guidelines.

106. This document is reprinted in TermNet News, no 40, 1993, pp 5–64; copies are also available from Infoterm, z.Hd. Herrn Dr. Gerhard Budin, Heinestraße 38, Postfach No. 130, A-1021 Vienna, Austria.

107. In this example, as in the others, white space has been liberally used for the sake of legibility; in practice most actual encodings would use less white space.

108. ISO 10241, Preparation and layout of international terminology standards, 1993.

109. We use the term alignment as a special case of the more general notion of correspondence. Using A as a short form for ‘an element with its attribute id set to the value A’, and suppose elements A1, A2 and A3 occur in that order and form one group, while elements B1, B2 and B3 occur in that order and form another group. Then a relation in which A1 corresponds to B1, A2 corresponds to B2 and A3 corresponds to B3 is an alignment. On the other hand, a relation in which A1 corresponds to B2, B1 to C2, and C1 to A2 is not an alignment.

110. The type attribute on the note is used to classify the notes using the typology established in the Advertisement to the work: ‘The Imitations of the Ancients are added, to gratify those who either never read, or may have forgotten them; together with some of the Parodies, and Allusions to the most excellent of the Moderns.’ In the source text, the text of the poem shares the page with two sets of notes, one headed ‘Remarks’ and the other ‘Imitations’.

111. Since no special element is provided for this purpose in the present version of these Guidelines, the information should be supplied as a series of paragraphs at the end of the <encodingDesc> element described in section 5.3 The Encoding Description.

112. HyTime is an international standard (ISO 10744) built on SGML. It provides facilities for representing both static and dynamic information for processing and interchange by hypertext and multimedia applications. See ISO/IEC 10744 Information Technology — Hypermedia/Time-based Structuring Language (HyTime) ([Geneva]: International Organization for Standardization, 1992).

113. The notation used for this formal grammar is that defined in chapter 39 Formal Grammar for the TEI-Interchange-Format Subset of SGML.

114. The details of this tree are defined as in XPath and XPointer.

115. Because it may be desirable to refer to comments or processing instructions that lie outside the document element, or to multiple top-level sibling elements in document fragments, XPath and XPointer use the term root slightly differently to refer to an abstract element one level higher. These Guidelines may be updated to use this definition for compatibility, or may add direct support for XPointer itself.

116. Strictly speaking, |n| (absolute value of n) children.

117. See section 15.3 Spans and Interpretations, where the text from which this fragment is taken is analyzed.

118. The corresp attribute is thus distinct from the target attribute in that it is understood to create a double, rather than a single, link. It is also distinct from the targets attribute in that the latter lists all the identifiers of the elements that are doubly linked, whereas the corresp doubly links the element that bears the attribute with the element(s) that make up the value of the attribute.

119. See William A. Gale and Kenneth W. Church, Program for aligning sentences in bilingual corpora, Computational Linguistics 19 (1993): 75–102, from which the example in the text is taken.

120. Our example uses the English translation of Charles Hoole (1659), and is taken from John E. Sadler: ed., John Amos Comenius Orbis Pictus: a facsimile of the first English edition of 1659 (Oxford: Oxford University Press, 1968) (The Juvenile Library).

121. This sample is taken from a conversation collected and transcribed for the British National Corpus.

122. See section 15.1 Linguistic Segment Categories for discussion of the <w> and <c> tags that can be used in the following examples instead of the <seg type="word"> and <seg type="character"> tags.

123. An alternative way of representing this problem is discussed in chapter 17 Certainty and Responsibility.

124. In this example, we have placed the <link> next to the tags that represent the alternants. It could also have been placed elsewhere in the document, perhaps within a <linkGrp>.

125. The variant readings are found in the commercial sheet music, the performance score, and the Broadway cast recording.

126. Or, as they are widely known, attribute-value pairs; this term should not be confused, however, with SGML or XML attributes and their values, which are similar in concept but distinct in their formal definitions.

127. The rule marks spaces left for the missing name in the manuscript.

128. See G. N. Leech and R. G. Garside, Running a Grammar Factory, in English Computer Corpora: Selected Papers and Research Guide, S. Johansson and A.-B. Stenstr⊘m: ed. (Berlin: de Gruyter; New York: Mouton, 1991), pp. 15–32. This sentence and its analysis are reproduced by kind permission of the University of Lancaster's Unit for Computer Research on the English Language.

129. For the word-class tagging method used by Claws see I. Marshall, Choice of Grammatical Word Class without Global Syntactic Analysis: Tagging Words in the LOB Corpus, in Computers and the Humanities 17 (1983): 139–50. For an overview of the system see R. G. Garside, G. N. Leech, and G. R. Sampson, The Computational Analysis of English: a Corpus-Based Approach (Oxford: Oxford University Press, 1991).

130. We have replaced the Claws code $ for the ‘'s’ morpheme by GEN, as in the tag set used by the British National Corpus (see 16.10 Two Illustrations), and the code . for the final full stop by PUN.

131. For more information about the British National Corpus, see the website at http://www.hcu.ox.ac.uk/BNC/

132. Feature-structure, rather than feature-value, libraries should be used for housing collections of feature structures.

133. An SGML or XML DTD cannot however straightforwardly validate that values for features organized as sets are not repeated; such validation would have to be carried out by an application program. Our method of representing set, bag and list values also does not permit such values to be directly embedded within one another. In order to embed a set within a set, for example, one must specify the embedded set as the value of a feature of a feature-structure value of the including set. Fortunately, this is not as hard as it sounds: the embedding of a list within a list is illustrated in the second example below.

134. Unless the value is the <null> element; see below.

135. We say that one range is less than or equal to another if both the value and valueTo attributes of the first are less than or equal to the corresponding attributes of the second.

136. Typically, there will be no need to use an encoding like this one as the value of a feature, since the <any> element is available for that purpose. However, in setting up the feature declaration for that feature, it may be necessary to use such an encoding, precisely so as to provide an interpretation for the use of the <any> element as the value of that feature.

137. From The Manere of Good Lyuynge, fol 126v of Bodleian MS Laud Misc. 517, plate 8(ii) in English Cursive Book Hands 1250–1500 by M. B. Parkes (Clarendon Press: Oxford, 1969).

138. On fol 65v of Bodleian MS. Rawlinson Poetry 32; in Parkes 12(ii).

139. De Nutrimento et Nutribili, Tractatus 1, fol 217r col b of Merton College Oxford MS O.2.1 (Parkes pl. 16).

140. De moribus et actis primorum Normannie ducum, in fol 4v of British Library MS Harley 3742, Parkes pl 6(i).

141. In Pierpont Morgan MA 3391 (Klinkenborg 123).

142. In Pierpont Morgan MA 1892, (Klinkenborg 129).

143. In Reykjavík, Lbs 1562 4to

144. In Pierpont Morgan MA 310, (Klinkenborg 23).

145. The manuscript contains several other substitutions, ignored here for the sake of clarity.

146. In Klinkenborg 11

147. In earlier version of these Guidelines an attribute hand was used on the <hand> element to carry the same information as the existing id attribute. The hand attribute is retained in the current version of these Guidelines only for backwards compatibility and will be removed at the next release.

148. From the Wiltshire Record Office, Dean of Sarum Churchwardens' presentments, 1731, Hurst; the transcription was provided by Donald A. Spaeth.

149. From folio 52 recto of the Holkham manuscript of Chaucer's Canterbury Tales.

150. Codex Regius, ed. L. F. A. Wimmer and F. Jónsson (Copenhagen 1891).

151. In Pierpont Morgan MA 412 (Klinkenborg 15).

152. Pierpont Morgan MA 310, Klinkenborg 23

153. For the sake of legibility in the example, long marks over vowels are omitted.

154. Strictly, a suitable value such as figurative should be added to the two place names which are presented periphrastically in the second example here, in order to preserve the distinction indicated by the choice of <rs> rather than <name> to encode them in the first version.

155. The treatment here is largely based on the characterizations of graph types in Gary Chartrand and Linda Lesniak, Graphs and Digraphs (Menlo Park, CA: Wadsworth, 1986).

156. That is, the three syntactic interpretations of the clause are mutually exclusive. The notion that the pertinents are in Argyll is clearly not inconsistent with the notion that both the land in Gallachalzie and the pertinents are in Argyll. The graph given here describes the possible interpretations of the clause itself, not the sets of inferences derivable from each syntactic interpretation, for which it would be convenient to use the facilities described in chapter 16 Feature Structures.

157. R. Jackendoff, X-Bar Syntax, 1977

158. The symbols e and t denote special theoretical constructs (empty category and trace respectively), which need not concern us here.

159. In earlier editions of these Guidelines, formulaContent was defined by default as CDATA, which in SGML means that the only parsing carried out is to search for an end-tag; since XML does not include the CDATA element type (it is one of the very few features of SGML that, if used, makes correct parsing in the absence of a DTD intractable), the present edition of these Guidelines defines the content of formulaContent as (#PCDATA).

160. In this case additional redefinitions may also be needed to avoid name clashes with existing TEI elements. For further details see chapter 29 Modifying and Customizing the TEI DTD.

161. We do not show here how the MathML names are to be included in the TEI name space

162. Since no special purpose element is provided for this purpose by the current version of the Guidelines, such information should be provided as one or more distinct paragraphs at the end of the <encDecl> element described in section 5.3 The Encoding Description.

163. Schemes similar to that proposed here were developed in the 1960s and 1970s by researchers such as Hymes, Halliday, and Crystal and Davy, but have rarely been implemented; one notable exception being the pioneering work on the Helsinki Diachronic Corpus of English, on which see M. Kytö and M. Rissanen, The Helsinki Corpus of English Texts, in Corpus Linguistics: hard and soft, M. Kytö, O. Ihalainen, and M. Rissanen: eds. (Amsterdam: Rodopi, 1988).

164. It is particularly useful to define participants in a dramatic text in this way, since it enables the who attribute to be used to link <sp> elements to definitions for their speakers; see further section 10.2.2 Speeches and Speakers.

165. The present proposals do not support the encoding of different settings for the same participant. This is a subject for further work.

166. See in particular chapters 14 Linking, Segmentation, and Alignment, 15 Simple Analytic Mechanisms, and 16 Feature Structures.

167. For more information on UNIMARC, see Brian P. Holt, UNIMARC Manual (London, U.K.: IFLA Universal Bibliographic Control and International MARC Programme, British Library, 1987). For USMARC, see Walt Crawford, MARC for library use: understanding USMARC (Boston: G.K. Hall, 1989), USMARC format for bibliographic data, including content designation (Washington, D.C.: Library of Congress, 1987), and Deborah J. Byrne, MARC manual : understanding and using MARC records (Englewood, Colo.: Libraries Unlimited, Inc., 1991).

168. The primary function of the MARC record when it was first designed in the mid-1960s was to allow for the electronic distribution of cataloguing records in support of card production. See Henriette Avram, The MARC Pilot Project (Washington D.C.: Library of Congress, 1968), p. 3. For discussion of the relationship between the MARC record and the catalogue card, see Michael Gorman, After AACR2R: The Future of the Anglo-American Cataloging Rules, in Richard Smiraglia, ed., Origins, Content and Future of AACR2 Revised (Chicago: American Library Association, 1992).

169. This chapter may be substantially revised or withdrawn in the next edition of these Guidelines

170. As defined by ISO 8601: 2000(E), Data elements and interchange formats — Information interchange — Representation of dates and times, section 5.2.1.1, extended format.

171. Dizionario di Abbreviature latine ed italiane per cura di Adriano Cappelli, 6th ed. (Milan: Ulrico Hoepli, 1979). This work on Latin abbreviations might be less convenient for the purpose than one concentrating on Old French, but it is more widely used than any other.

172. For a fuller discussion of the reasoning behind FSDs and for another complete example, see A rationale for the TEI recommendations for feature-structure markup, by D. Terence Langendoen and Gary F. Simons , in Computers and the Humanities, 29, (1995).

173. In SGML (but not in XML) a feature known as SUBDOC is available which allows a document using one DTD (the FSD) to be nested within another (the feature structure itself); this feature is not available in XML, and is therefore not recommended where usage of XML is intended.

174. In an SGML document, the SUBDOC keyword may be used in place of NDATA and the Notation name to tell the processor that the named file is a self-contained SGML document.

175. Fernando C. N. Pereira, Grammars and logics of partial information, SRI International Technical Note 420 (Menlo Park, CA: SRI International, 1987), and Stuart Shieber, An Introduction to Unification-based Approaches to Grammar, CSLI Lecture Notes 4 (Palo Alto, CA: Center for the Study of Language and Information, 1986).

176. Gerald Gazdar, Ewan Klein, Geoffrey Pullum, and Ivan Sag: Generalized Phrase Structure Grammar, (Harvard University Press, 1985)

177. Some minor revisions have been made in the way that the tag set documentation is used in producing the current XML version of the Guidelines; these are not as yet reflected in the recommendations of the present chapter.

178. The recommendations in this chapter are likely to be substantially revised at the next release.

179. The definition of interchange format may be changed to eliminate the few remaining SGML features that are not in XML (primarily attribute minimization).

180. This is one of several abbreviations allowed by the SHORTTAG feature; the others (omission of attribute names under certain circumstances and omission of non-required attribute values) are allowed by the current release of the Guidelines, but users are cautioned that this may be changed at a subsequent release, in the interests of XML-conformance.

181. Some will regard such simplifications as useful ways of making it easier to develop software which accepts TEI-conformant documents; others will deplore the failure of such software to accept all TEI-conformant documents including those which extend the TEI DTD. In providing the notion of DTD extension for describing what documents are and are not accepted by such software, the TEI acts in the belief that such software will in fact be developed; it neither endorses nor deplores its construction or use.

182. See document TEI PC P1 ‘The Preparation of Text Encoding Guidelines.’

183. For an example of such a tool, see the TEI pizzachef at http://www.tei-c.org/pizza.html.

184. This chapter discusses issues related almost exclusively to the use of SGML-encoded TEI documents in interchange. XML-encoded TEI documents may be safely interchanged without formality over current networks, largely without concern for any of the issues discussed here. This chapter has not therefore been revised, and will probably be withdrawn or substantially modified at the next release.

185. This chapter will be substantially revised and expanded at the next release of these Guidelines.

186. as elsewhere in these Guidelines, empty elements are denoted with a penultimate slash character, which is the XML syntax; in SGML, either omit the slash or modify the SGML declaration to permit it via the NET delimiter.

187. This chapter makes extensive use of the TEI Extended Pointer Notation, and may therefore be revised to discuss use of XPath syntax as a preferable alternative at the next release.

188. Although the scripts run in opposite directions, they write numbers in the same direction; the usual view is that the numbers in Hebrew and Arabic run left to right, like those in Latin script, but it is also possible to claim that the numbers in Latin scripts run right to left, like those in Arabic and Hebrew. There is no single satisfactory answer to this question.

189. Tana de Gámez, ed., Simon and Schuster's International Dictionary (New York: Simon and Schuster, 1973).

190. This example presumes that local extensions to the TEI DTD have been made declaring the <gi> and <att> elements. It is hoped that future versions of these Guidelines will obviate the need for such an extension.

191. This chapter may be substantially revised or withdrawn in the next edition of these Guidelines

192. This section is of relevance only to SGML expressions of the TEI DTDs. The grammar formally defined here is close to, but not identical with, the XML language. This section will be updated or removed in the next edition of these Guidelines.

193. This section has been retained for the most part unchanged since the original publication of the Guidelines in 1994; it is thus of largely historical interest only.

194. This Appendix contains (in reverse chronological order) the `Introductory Notes' prefixed to each revision of the TEI Guidelines since its first publication in 1994.

195. Not all members listed were able to serve throughout the development of the Guidelines.

196. This Workgroup was jointly sponsored by the Association for History and Computing.



March 2002edited by C M Sperberg-McQueen and Lou Burnard, XML conversion by Syd Bauman, Lou Burnard, Steven DeRose, and Sebastian Rahtz.
Copyright TEI Consortium 2003