Text Encoding Initiative
The XML Version of the TEI Guidelines
30 Rules for Interchange
This chapter discusses issues related almost exclusively to the use of SGML-encoded TEI documents in interchange. XML-encoded TEI documents may be safely interchanged without formality over current networks, largely without concern for any of the issues discussed here. This chapter has not therefore been revised, and will probably be withdrawn or substantially modified at the next release.
This chapter describes how interested parties can determine and agree on the proper format for the successful interchange of TEI-conformant documents over a given communications link, and how to translate from normal TEI form to the transmission format and back. It also includes recommendations for formats to be used when private arrangements cannot be made, in non-negotiated or `blind' interchange.
When the sender and receiver of a given text know each other's identity and can make appropriate special arrangements for the interchange of the text, the following procedures may be used to ensure the successful interchange of the text without information loss.
The sender and receiver together must first:
The transmission character set is defined as the set of characters in the sender's system character set(s) which survive transmission and are properly recognized in the recipient's system character set. It is therefore by definition a subset of both the sender's and the recipient's system character sets. The bit patterns used to represent the characters may differ in the two systems (e.g. one may use ASCII, the other EBCDIC) if the communications link performs the proper translations.
Current network standards allow — indeed, require — gateway nodes to translate material passing through the gateway from one coded character set into another, when the networks joined by the gateway use different coded character sets. Since there is no universally satisfactory translation among all coded character sets in common use, the transmission character set will normally be the subset which is satisfactorily translated by the gateways encountered in transit between the sender and the receiver of the data.
When material is transmitted on a physical storage medium (e.g. disk or tape), then those exchanging documents have far greater control over the data. If both partners use compatible systems, the transmission character set may be equivalent to their system character sets; otherwise, the transmission character set will include those characters which the recipient can successfully read into the local system character set from the media provided by the sender. For example, if a diskette created by an MS-DOS machine can be mailed to a Macintosh user and read directly by the recipient's system, then the transmission character set is likely to be ISO 646 IRV (equivalent to ANSI X3.4, or ASCII), which both machines have in common; if the recipient's disk-reading utilities are more sophisticated, however, then it may be possible to include some or all of the two machines' non-standard extended characters as well.
After the transmission character set and entity character sets have been defined, the sender must prepare and transmit the document:
The translation of non-transmissible characters into the transmission entity set may be accomplished under automatic control, by software which reads the appropriate local writing system declarations, creates the necessary mapping tables, and packs the document for transmission.
Upon receipt, the receiver must:
It is strongly recommended that when documents are interchanged they be accompanied by any writing system declarations and feature system declarations which are applicable. In TEI-conformant interchange, it is required that documents be accompanied by any applicable tag-set documentation files.
As a first simple example, consider an SGML document containing English, French, and German, to be transmitted from an IBM-compatible personal computer to a Macintosh, over a long-distance network connection. Uploading test files from the PC to the sender's local network node, sending the file via the network, and downloading the document to the Macintosh, reveal (let us assume) that while all the characters of ISO 646 IRV survive intact, the accented characters of French and German do not survive transmission. In this case, the transmission character set is composed of all the characters of 7-bit ISO 646. The entity set required to handle the non-transmissible characters will include the following (assuming they actually occur in the document):
When `packing' the file for transmission, the sender must replace the non-transmissible characters in the document with references to these entities. After this subsitution, the document is a conforming document written entirely in the transmission character set, which can be sent over the communications link without any garbling or loss of information. Upon receipt, the recipient can replace the entity references with the specific coded characters used on the Macintosh to write French and German.
As a second example, consider the same document being transmitted from a VAX running VMS to an IBM mainframe running VM/CMS. Here, the accented characters might be represented on the VAX using the coded character set ISO 8859-1, but since ISO 8859-1 is not always supported, it may be more likely that entity references will be used instead. Let us assume that the network path between the two machines accepts Latin characters and digits, and most punctuation, but garbles square brackets, braces, the hash mark, and the pounds-sterling symbol. In this case, the accented characters require no special work by the sender, since they are already in a network-safe form. Square brackets, etc., must however be replaced by entity references to lbr, rbr, etc. After this is done, the document is no longer conformant SGML, since the square brackets used in certain markup declarations will not be recognized. (It is important, therefore, for validation to be performed before the square brackets are replaced by entity references.)
Upon receipt, the document may be translated into a valid SGML document by replacing all references to lbr and rbr with the appropriate square brackets, etc. If the local system supports one of the IBM code pages with support for French and German characters, then the entity references to those characters may be replaced by the characters in the system character set. More commonly, the entities for French and German characters will be left in place.
As a third example, consider a document containing Greek, as well as Latin, German, French, and English. If the sender's system has a full Greek character set, but the recipient's does not, then the Greek characters must all be replaced either by references to entities or by transliterations into Latin characters (e.g. using the beta code transliteration developed for the Thesaurus Linguę Gręcę). If the text is later transmitted to another system which does have a full Greek character set, the transliterated text or entity references may be translated, under control of the relevant writing system declarations, into the local Greek character set.
As a final example, consider a document written in Japanese, to be transmitted over a network within Japan, or over an international network to a recipient in Europe. Since networks within Japan transmit Japanese text without information loss, and common utilities may be used to recognize any of the existing coded character sets and translate into another, the transmission character set for interchange within Japan may be the same as the system character set. (If user-defined extensions are defined, in order to allow the encoding of kanji not present in the standard character sets, then these non-standard kanji may need to be replaced either by entity references or by `transliterations' into the standard character sets, and the description of the kanji themselves should accompany the documents in which they are used; the writing system declaration may be used for this purpose.)
When transmitting Japanese text outside Japan, the limitations of the networks at the time of transmission must be taken into account; it may be necessary to transliterate the text, or to replace non-transmissible characters with entity references.
In some cases, no negotiation between interchange partners is possible, because they do not know each other's identities. Since it is impossible to discover an appropriate transmission entity set by experiment, such interchange requires the use of extremely conservative assumptions about the frailties of network gateways.
Specifically, it is recommended that for non-negotiated interchange the following practices be adopted:
By these restrictions, these recommendations ensure that documents interchanged in this way will be directly usable on a great variety of systems; moreover, since the allowed character sets are widely known and well documented, users of other systems will normally be able to adjust the documentation and data stream to their local systems without difficulty.
It should be noted that ISO 646 imposes a very restrictive and cumbersome encoding for researchers whose character sets have a large repertoire; it is strongly recommended, therefore, that such materials use XML and Unicode, or that arrangements for their interchange involve explicitly negotiated interchange formats wherever possible.
The rules given here for non-negotiated interchange are not guaranteed to succeed, and negotiation of interchange formats is therefore required, if any of the following apply:
The descriptions of document interchange in this chapter from time to time refer to software used to pack documents for interchange, or to unpack documents upon receipt. The descriptions do not characterize any specific existing software, but attempt to make clear how such software must work, in a general way. It is hoped that the descriptions will be useful to implementors of packing and unpacking software, but the full specification of such packing and unpacking software is beyond the scope of this chapter. All that can be attempted here is to describe some complications which may arise in the packing and unpacking of documents for interchange, of which implementors of such software should be aware. Most of these difficulties do not arise with XML, because XML requires that the character set be Unicode.