- tag minimization: SGML provides many means for minimizing the amount of
markup in a text via mechanisms such as start and end tag omission, short start and
end-tag, minimization of attribute values, etc. For example, the following
definitions allow end tag omission:
<!ELEMENT w - O (orth, pos,lem) >
<!ELEMENT orth - O (#PCDATA) >
<!ELEMENT pos - O (#PCDATA) >
<!ELEMENT lem - O (#PCDATA) >
The following is a full markup for the sentence fragment "The boat sinks...":
<s>
<w><orth>The</orth><pos>DET</pos><lem>the</le
m></w>
<w><orth>boat</orth><pos>NNS</pos><lem>boat</
lem></w>
<w><orth>sinks</orth><pos>VBZ</pos><lem>sink<
/lem></w>
...
</s>
With end tag omission this could be replaced by
<s>
<w><orth>The<pos>DET<lem>the
<w><orth>boat<pos>NNS<lem>boat
<w><orth>sinks<pos>VBZ<lem>sink
...
</s>
which in this case is a nearly 50% reduction in the number of characters.
- SGML entities:SGML allows string substitution via entity replacement.
Entity references can be used in place of any string, possibly including markup. So,
for example, a complex feature structure specification which occurs frequently in
the text can be replaced by an entity reference consisting of only a few characters.
The TEI feature structure
<fs type='word structure' id=vbidprx0sgp3>
<f name=category><sym value=verb></f>
<f name=mood><sym value=indic></f>
<f name=tense><sym value=pres></f>
<f name=auxiliary><minus></f>
<f name=agreement>
<fs type='agreement structure' id=sgp3>
<f name=number><sym value=sg></f>
<f name=person><sym value=3></f>
</fs>
</f>
</fs>
could be replaced by the entity reference &VBZ;. Analogous substitutions
for other word categories could yield the following encoding:
<s>
<w><orth>the&DET;<lem>the
<w><orth>boat&NNS;<lem>boat
<w><orth>sinks&VBZ;<lem>sink
...
</s>
- DATATAG feature: When certain tag sequences occur with regularity, it is
possible to define a certain character to be interpreted as the end tag of an
element. For example, the following declarations specify that the character "|" can
be interpreted as the end tag for <orth> and <pos>:
<!ELEMENT w - O ([orth,"|"], [pos,"|"], lem) >
<!ELEMENT orth O O (#PCDATA) >
<!ELEMENT pos O O (#PCDATA) >
<!ELEMENT lem O O (#PCDATA) >
<orth>, <pos>, and <lem> are also defined so
as to allow omission of both the start and end tags. This yields the following
possible encoding:
<s>
<w>the|DET|the
<w>wash|NNS|wash
<w>sinks|VBZ|sink
...
</s>
If we also specify that the carriage return implies the end-tag of element
<w>, the encoding could be reduced even further to
<s>
the|DET|the
wash|NNS|wash
sinks|VBZ|sink
...
</s>
- non-SGML notations: It is also possible to use private, less verbose non-
SGML schemes within tags or as attribute values. For example, the encoder could
decide to use a private notation within the <s> element in the example
above--if that notation uses the pipe sign as a separator between word, part of
speech, and lemma, the encoding would be exactly as given above. However, the DTD
would simply specify
<!ELEMENT s - - (#PCDATA) >
which means that the SGML parser will not process the content of the
<s> tag in any way. The content would have to be processed by other
software. This is in contrast to the use of DATATAG above, where the SGML parser
(assuming the optional feature DATATAG is implemented) will understand and process
the content of the <s> tag as consisting of three elements.