Text Encoding Initiative

The XML Version of the TEI Guidelines

<form>


<form> (letter form) identifies one letter form taken by a particular character in a writing system declaration.
Attributes (In addition to global attributes)
string gives the byte string used to encode the letter form in the text.
Datatype: CDATA
Values: any string of characters (often a single byte)
Default: #IMPLIED
Example:
<form string="a/">
    <desc>lowercase Greek alpha with acute accent</desc>
 </form>
Note

If the character is encoded only using entity references, then the value of string should be '' (the empty string).

In coded character sets which use character-set shifting (e.g. JIS 0208), the string attribute should typically contain the required shift characters, in order to render the value unambiguous. In such a case, there is no expectation that every occurrence of the character will be immediately preceded by the shift sequence; processing software is responsible for understanding the shift mechanism and acting accordingly.

The same string value may not appear on more than one <form> elements (except the empty string), unless each occurrence is associated with a different coded character set.

codedCharSet (coded character set) specifies which base coded character set the string value occurs in.
Datatype: IDREF
Values: a reference to the identifier of a <codedCharSet> element in the current writing system declaration.
Default: #IMPLIED
Example:

Note

If more than one <codedCharSet> is specified as a base component of the writing system declaration, then it is expected that character-set shifting is in use, as described in ISO 2022 or some equivalent. In this case, each <form> element which has a value for the string attribute should also identify, by means of the codedCharSet attribute, which identifies which coded character set actually contains the string in question. Proper shifting among character sets is the responsibility of the user.

entityStd (standard entity name) gives the name of one or more entities defined for this character form in some standard entity set(s).
Datatype: ENTITIES
Values: One or more valid SGML entity names declared in the document type definition of the WSD; the entity must also be included in an entity set mentioned in an <entitySet> declaration in the current writing system declaration or in some base writing system referred to by a <baseWsd> element.
Default: #IMPLIED
Example:
<form entityStd="thorn">
  <desc>lowercase Old English or Icelandic thorn</desc>
</form>
Note

If the same letter form is defined by more than one public entity set, more than one value may appear in this attribute.

The same entity name may not appear in the entityStd or entityLoc attributes of more than one <form> element.

entityLoc (local entity name) gives one or more entity names used locally for this character form.
Datatype: ENTITIES
Values: One or more valid SGML entity names declared in the document type definition of the WSD; the entity must also be included in an entity set mentioned in an <entitySet> declaration in the current writing system declaration or in some base writing system referred to by a <baseWsd> element.
Default: #IMPLIED
Example:
<form entityStd="thorn" entityLoc="t">
   <desc>lowercase Old English or Icelandic thorn</desc>
   <note>The standard entity name is <ident>thorn</ident>; the local 
entity <ident>t</ident> 
         is used for brevity and legibility.</note>
</form>
Note

The same entity name may not appear in the entityStd or entityLoc attributes of more than one <form> element.

ucs-4 (universal-character-set code) gives the position of the character form in the thirty-two bit `universal character set' defined by ISO 10646.
Datatype: CDATA
Values: one or more sets of two or four two-digit hexadecimal numbers giving a valid ISO 10646 code point for the character form; for legibility the two-digit hexadecimal numbers should be separated by hyphens. If more than one UCS-4 code is associated with a given character form, the two UCS-4 codes should be given separated by blanks. If the character form is associated with a sequence of UCS-4 codes (e.g. a base character followed by one or more non-spacing diacritics), then the components of the sequence should be separated by +.
Default: #IMPLIED
Example:

Note

The same UCS-4 code (or sequence) may not appear within more than one <character> element within the writing system declaration. It may however appear on several forms of the same character.

Multiple UCS-4 codes can be given for a single character; this allows sequences treated as distinct by ISO 10646 to be documented as referring to a single `character' as defined by the WSD (e.g. ‘lowercase a-umlaut’ and ‘lowercase a’ plus ‘umlaut’).

If a single UCS-4 code is to be treated as relating to two distinct `characters' as defined by the WSD (e.g. to reverse the effects of Han unification on some character), then one of the <character> elements should be associated with the UCS-4 code in the normal way, and the others should call attention to the relevant UCS-4 code by a comment in a <note> element.

Example

Note

The <form> element documents one form of a character; in most cases, there will be only one. If more than one form is given, in general, they are to be regarded as free variants of the character unless otherwise specified in the notes.

The distinction between <character> and <form> makes it possible to distinguish, in an encoding, among different letter forms (which may have historical, aesthetic, linguistic, or other significance) without having to claim that the different forms constitute different `characters' in any normal sense. (Using the technical terms occasionally encountered, the <form> element can be used to record each allograph of a given character or grapheme.) The concepts of `character' and `letter form', however, vary from analyst to analyst; the decision to treat a given set of forms as a single character or as a set of characters is not always obvious, and may require the application of considerable learning and judgement. The <note> element should be used to record the reasoning behind any particularly difficult decision.

Module Declared in file teiwsd2; Auxiliary tag set for Writing System Declarations
Data Description May contain a series of description element, optionally one or more figure elements showing the character form in question, and optionally a series of notes.
May contain desc extFigure figure note
May occur within dictScrap eg entry entryFree form hom re sense superEntry trans
Declaration
<!ELEMENT form %om.RO;  (desc+, (figure | extFigure)*, note*)> 
<!ATTLIST form  
      %a.global;
      string CDATA #IMPLIED
      codedCharSet IDREF #IMPLIED
      entityStd ENTITIES #IMPLIED
      entityLoc ENTITIES #IMPLIED
      ucs-4 CDATA #IMPLIED>
See further 25.4.2 Exceptions in the WSD

Up: 35 Elements