Text Encoding Initiative
The XML Version of the TEI Guidelines
The recommendations in this chapter are likely to be substantially revised at the next release.
This chapter describes the areas in which these terms are defined and specifies their meaning. It also proposes other terms for related concepts and points out some dangers in the careless use or application of these terms.
The term TEI conformance does not apply to software: programs can be usefully described as accepting or validating TEI-conformant documents or some subset of TEI-conformant documents, but the TEI defines no required processing model against which software could be measured. Programs are thus not themselves conformant or non-conformant and should not be so described.
A TEI-local-processing-format document may be described as requiring DTD extensions if it modifies the TEI-supplied DTDs (or in the case of SGML, the SGML prolog) in any of the ways described under 28.3 Modifications to TEI Document Type Declarations.
A TEI-interchange-format document may be described as requiring DTD extensions if its DTD is modified in any of the ways described in section 28.3 Modifications to TEI Document Type Declarations.
The effective SGML declaration cannot be changed when using XML. When using SGML, the SGML declaration for TEI interchange documents may differ from that provided in TEI documentation in these ways:
The following portions of the SGML declaration may not be modified in TEI interchange documents:
A TEI-conformant document (whether for local processing or for interchange) may make any change to the TEI-supplied document type declarations which is allowed by SGML and the controlling SGML declaration. All such changes should be made (or at least it must be possible to make them) within the SGML DTD subset, by defining TEI DTD modifications files as described in chapter 29 Modifying and Customizing the TEI DTD, and embedding the modification files within the DTD subset of a document whose document type declaration refers to the unmodified TEI main DTD, as in the following fragment:
<?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE TEI.2 PUBLIC "-//TEI P4//DTD Main Document Type//EN" "tei2.dtd" [ <!ENTITY % TEI.extensions.ent PUBLIC "-//ProjectName//ENTITIES Local modifications to TEI main DTD//EN" "project.ent"> <!ENTITY % TEI.extensions.dtd PUBLIC "-//ProjectName//DTD Local element types for TEI main DTD//EN" "project.dtd"> ]>
For reasons of convenience, it may be desirable in practice to create a derived DTD in which the local modifications have been integrated with the TEI main DTD in a single file. If such a one-file DTD is desired, it should be derived automatically from the TEI DTD and the local modifications files using appropriate software, rather than derived by hand-editing the TEI DTD files, as hand-editing increases the chances of error and inconsistency between the DTD modifications files and the one-file DTD. Documents in the TEI interchange format must use the form shown above, with a reference to the unmodified main TEI DTD and declarations of the local modifications files.
The following must remain true of the DTD after modification:
Without requiring DTD extensions, therefore, any TEI document may:
This section is included for illustrative purposes only; it does not restrict the processing of TEI or other documents. It simply distinguishes a number of typical ways in which a project may choose to apply the TEI Guidelines to different kinds of processing.
First, data might be captured by keyboarding into a locally defined data capture format, or by scanning into a locally defined scanner-file format. From these initial forms, transducers might convert the files into a standard local storage format.
The local storage format might be the input format of some application program used frequently by the project. In this case, transducers might be necessary to prepare data for processing by other applications. Alternatively, the local storage format might be independent of the formats used by application programs; transducers would be needed to prepare data for any processing. Such an independent format is useful if the local storage format needs to contain more information than any single application can conveniently handle.
The local storage format might be SGML- or XML-conformant without being TEI-conformant, e.g. because it uses local DTDs instead of the standard TEI DTDs, or because it uses a TEI local processing format. Local software may be used to validate a TEI local-processing format, to transduce documents into the input formats needed by applications, and when appropriate to transform documents into the TEI interchange format for exchange with other sites.
Finally, the TEI interchange format may be used as a local storage format. It is not expected that this will be a very common practice, since it is expected that most sites interested in TEI conformance will eventually acquire markup-aware software which have advantages of compactness or processing. In the absence of such software, however, some projects may find the TEI interchange format (or perhaps a restrictive variant of it) useful, because such a format can be relatively easy to parse with ad hoc software.
Whether the local storage format is strictly TEI conformant or not, it may follow TEI-recommended practice in its selection of textual features to be marked up, in its tag names, in its documentation practices, etc.
Over the course of the project, analysis and processing may result in interim results which may be incorporated into the locally stored copy of the text so that the interim results can be used in later processing. This process of enrichment can be carried out either by manual editing of the documents using conventional text editors, or by application programs.
When a document is to be exchanged with another site using the TEI interchange format, it must first be transduced from the local storage format to TEI interchange form. If local documents are already TEI-conformant, this requires either no processing at all, or a relatively simple normalization which can be handled readily by the normalization facilities of most SGML parsers. If the local storage form is non-SGML conformant (and not XML), some transducer must be used to transform it into the TEI interchange format.
The TEI-interchange-format document must then be packed for shipping into the TEI packed interchange format, using a packing program. This program will gather the constituent parts (files) of a document into a single file, and ensure that the file contains no characters whose safe passage to the recipient of the data is endangered by the transmission path. If the ultimate recipient of the document is unknown, the set of safe characters is very small. The specific transmission character set however is independent of TEI conformance: any convenient set may be used where both parties agree. The packer will ensure that the transmission character set is properly identified.
When a document is received from another site using the TEI packed interchange format, it must first be unpacked into a TEI interchange-format document in the local character set. It may then be necessary to naturalize it by translating it into the local storage format; if the local format is TEI- or SGML-conformant, no processing is needed (although some SGML processors may offer a facility for suppressing omissible markup).
The notions of TEI interchange format and TEI packed interchange format are central to the exchange of documents using the TEI guidelines, whether the local storage format is TEI-conformant or not. The TEI interchange format and the TEI local-processing format may each be used as a local storage format, though the local storage format might well differ from either of these without materially affecting the use of TEI formats for interchange. The TEI interchange format being less flexible than the local-processing format, it is expected that sites using SGML-conformant software may use the latter, while sites without such software may prefer the former.
The notion of TEI recommended practice, it is hoped, will be relevant to decisions about what textual features should be recorded during data capture and will thus affect data-capture formats and the transducers which render captured files into the local storage format.
The TEI abstract structure may be useful in developing local non-SGML markup schemes for data capture or for processing with ad hoc application programs. It is strongly recommended that the TEI recommendations, as well as the TEI abstract structure, be used for such development as well.
Neither the character sets used for local processing nor those used for transmission of interchange documents are restricted by the definition of TEI conformance. For local processing, users will typically use the system character set of their local system or some modification thereof. For exchange with known partners, users should choose any convenient character set; typically the most convenient is the set of all characters which:
For blind exchange with unknown partners a conservative choice of transmission set is needed to ensure that characters arrive correctly. How conservative the choice need be depends on the medium of transmission. The ISO 646 subset defined in section 4.1.3 Characters and their encoding remains the only guaranteed safe set of characters for the regional and international networks most widely used, although larger character sets are increasingly coming into use. This is largely because silent and not always reversible translation between character sets remains a feature of transmissions across current disparate networks. At the present time (1993) therefore, only the ISO 646 subset is recommended for fully blind interchange, although the full complement of ISO character sets may be used successfully in some subdomains.
In transmission by disk or tape, however, no silent translation is likely to occur, and so larger sets may be successfully used in blind interchange. The primary danger is a failure of software in the receiving machine to process the characters correctly; at this time (1991), ASCII or 94-character U.S. EBCDIC appear to represent the largest safe choices; other national character sets may of course be used if good internal documentation is also provided.
Note that the transmission character set does not associate specific binary encodings with the characters in the set. In the technical senses, it is a character set, not a coded character set. This means that a document may undergo various automatic translations from one coded character set to another (notably, in the case of transmission over international networks, from ASCII to EBCDIC or vice versa) without leaving the transmission character set.
For further discussion of the topics addressed in this section, reference should be made to chapter 30 Rules for Interchange.
The utility of various SGML constructs is discussed in section 2.2 of document TEI P1 version 1. The restrictions on SGML declarations and SGML usage in TEI interchange documents discussed above under 28.2 Modifications to TEI SGML Declaration are derived from that discussion. In the case of XML, no SGML declaration changes can be made.
The document type declaration provided by the TEI, whether in its SGML or its XML form, is intended to cover as wide a variety of document types and processing needs as proved feasible. It is impossible, however, for any finite list of text elements to cover every need of textual research and processing. As a result, extension of the TEI DTD has no effect on strict TEI conformance, as long as certain restrictions are observed; these have the effect of ensuring that later users of a file can easily see what changes have been made to the DTDs and what the new tags are intended to mean.
The requirement that all new or modified tags be documented, however, is formally verifiable only to a limited extent. It is possible for a program to verify that for every tag introduced in a DTD modification, a corresponding record exists in a Tag Set Declaration. It is impossible, however, to verify using formal means that the entry in the tag set declaration makes sense. Purely formal conformance measures, therefore, must be supplemented with human inspection of the documentation.
The concept of DTD extension is introduced to allow the concise description of software which is designed to handle documents encoded using the published DTDs but which is not prepared to deal with tags not included there.181
All sections of the TEI DTD are subject to modification by the user, except that a documentary header must be provided and distinguished from the text itself, and that documentary header must include tagged elements identifying the document encoded and those responsible for the encoding. This ensures that all TEI-conformant documents will have at least this bare minimum of accompanying documentation.
The basic design principles of the TEI require the notion of TEI conformance to be applicable to existing electronic documents if they are translated into a proper format, without requiring the insertion of information not captured in the initial preparation of the text.182
At the same time, the TEI is charged with formulating advice to those engaged in the creation of new electronic texts and is required to distinguish what is actively recommended for general use from what is merely optional, provided for use by those engaged in a particular sort of work.
The notion of TEI recommended practice is introduced to allow the concise description of documents in which not only the requirements, but also the recommendations of the Guidelines are followed. It is hoped that while projects to convert existing electronic data may content themselves with achieving TEI conformance, projects to produce new electronic texts will produce documents following TEI recommended practice. To distinguish those projects which follow the TEI's recommendation to use SGML or XML markup from those which capture the same underlying textual features but do so using other markup, the notion of the TEI abstract model is introduced; it is this which another encoding can have in common with the TEI.
In exchanging texts for use by others, the goal of an interchange format is to ensure that the information encoded in an electronic version of a text can be correctly understood and processed by the recipient as well as by the originator of the text. To assure the achievement of this goal, the definition offered here of TEI conformance restricts markup in TEI conformant documents to SGML or XML markup and to other properly declared notations. The latter are explicitly recommended for the encoding of tables, figures, etc. and so cannot reasonably be excluded. Since they do place a burden on the recipient for proper processing, the use of any such notation is defined to fall within the class of DTD extensions.
Because of the escape clause for graphics, etc., it is in principle possible to create a TEI conformant document by embedding a document using any arbitrary markup into a driver file containing a TEI header and a declaration for the appropriate markup as notation. Though it falls within the letter, such a practice falls outside the spirit of TEI-conformant document interchange.