<character>

<character> defines one unit in a writing system, supplementing or overriding information provided in the base coded character sets, writing system declarations, and entity sets.

Attributes (In addition to global attributes)

class describes the function of the character using a prescribed classification.
Datatype: (lexical | punc | lexpunc | digit | space | DL | LD | dia | joiner | other)

Legal values are:

lexical character is used in writing words (lexical items) of the language (includes members of syllabaries and ideographic systems, as well as composite letter-plus-diacritic combinations)

punc character is a punctuation mark which does not appear within lexical items

lexpunc character can appear as a normal punctuation mark, but can also appear within a lexical item (and should usually, when occurring between two lexical characters, be treated as lexical—in English, hyphen and apostrophe are typically treated as members of this class)

digit character is an Arabic decimal numeral (0, 1, ... 9) (does not include superscript numbers, circled numbers, numeric dingbats, etc.)

space character represents some form of white space (space character, horizontal or vertical tab, newline, etc.)

dl character is a diacritic applying to the following lexical character

ld character is a diacritic applying to the preceding lexical character

dia character is a diacritic which is explicitly joined to a lexical character by a joiner character

joiner character is used to join a diacritic to the lexical character to which it applies (in some encoding schemes, the backspace control character may be used as a joiner; in others, a graphic character is used for the same function)

other character does not fall into any of the other classes (dingbats and other unusual characters fall here)

Default: lexical

Example:

Note
The classification of characters provided by this attribute serves both informative and normative purposes: it helps identify the character being described, and the classification is used to define the meaning of the special character-class codes in the TEI extended pointer syntax described in chapter 14 Linking, Segmentation, and Alignment.

Example

Note
The notion of `characters' as units in a writing system is widely spread, but not consistently defined; the <character> element should be used to identify whatever units the encoder wishes to distinguish as the meaningfully distinct graphic units of the writing system. In most cases, these will correspond to the units of coded character sets, but that this is not a requirement: a-umlaut, for example, may be treated as one character or two, depending on the user's preference, regardless of how the coded character set in use treats it. In most cases, also, the units distinguished by the <character> element will be the `graphemic' units of the writing system in question; however, since experts disagree on whether items like umlaut (let alone a given set of Chinese characters with regional variations in China, Korea, and Japan) are best treated as distinct graphemes or not, the association of <character> elements with the graphemes of a writing system provides at most a heuristic device for making reasonable decisions, rather than a definitive unambiguous test.

Different forms of the same `character' may be distinguished for whatever reason, as in the three-R example of chapter 4 Languages and Character Sets. In this case the different letter forms are distinguished by documenting them in different <form> elements; the fact that the different letter shapes do not make a lexical difference in the text may be expressed by grouping all three letter forms under the same <character> element. (Alternatively, the three forms may be treated as three distinct characters, for convenience or for whatever reason, by defining a distinct <character> element for each.)

Module Declared in file teiwsd2; Auxiliary tag set for Writing System Declarations

Data Description May contain one or more description elements (optional), a series of one or more <form> elements identifying different forms of the character, and an optional series of notes.

May contain desc form note

May occur within

Declaration
<!ELEMENT character %om.RO; (desc*, form+, note*)> <!ATTLIST character %a.global; class (lexical | punc | lexpunc | digit | space | DL | LD | dia | joiner | other) "lexical">

See further 25.4.2 Exceptions in the WSD

Up: 35 Elements

Text Encoding Initiative

The XML Version of the TEI Guidelines

<character>