Talk given at
Tsujii Lab, 
University of Tokyo
Tuesday, 29 January 2002
This presentation is available at 
http://nl.ijs.si/et/talks/tei-genia/
Overview:
- TEI: history and organisation
- TEI in Asia
- The architecture of the TEI Guidelines
- The (dis)advantages of using TEI
- TEI core tagset: text and header
- TEI analysis: words, segments and clauses
- Genia lexicon: TEI dictionaries or terminological 
     databases?
- TEI features structures
- Multiple hierarchies
TEI: history and organisation
- The Text Encoding Initiative, established in 1987 under
the joint sponsorship of the Association for Computers and the
Humanities, the Association for Computational Linguistics, and the
Association for Literary and Linguistic Computing.
- TEI became the only systematised attempt to develop a fully
general text encoding model and set of encoding conventions based upon
it, suitable for processing and analysis of any type of text, in any
language, and intended to serve the increasing range of existing (and
potential) applications and use.
- TEI Guidelines for Electronic Text Encoding and Interchange (TEI
P3) were first published in April 1994 in two substantial green
volumes.
- In May 1999, a revised edition of TEI P3 was produced,
correcting several typographic and other errors. This Revised reprint
is, at the time of writing, the official edition of TEI P3 (it is
also called the `P4beta') and is available online in various formats.
- In December 2000 the TEI Consortium was set up to maintain
and develop the TEI standard. The Consortium has executive offices in
Bergen, Norway, and hosts at four universities.  The Consortium is
managed by a Board of Directors, and its technical work is overseen by
an elected Council.  Lou Burnard is European editor and Syd Bauman is
North American Editor. Institutions and individuals can become
Consortium members or subscribers, which gives them certain 
benefits inside
the consortium.
- In June 2001, the TEI Consortium announced availability of a
preliminary version of a major revision of TEI P3, the TEI P4,
the object of which would be to provide equal support for XML and SGML
applications using the TEI scheme.  The revisions needed to make TEI
P4 have been deliberately restricted to error correction only, with a
view to ensuring that documents conforming to TEI P3 will not become
illegal when processed with TEI P4.
- Many possibilities for other, more fundamental, changes have been
identified. With the establishment of the new TEI Council, it becomes
possible to agree on a programme of work to enhance and modify the
Guidelines more fundamentally over the coming years.  TEI P5
will be the next full revision of the Guidelines. No date has yet been
fixed for its appearance, but work on it will commence early in 2002.
TEI and Asia/Japan
While the TEI is an international effort, the nations in question are
almost exclusively in North America and Europe. All the 
50 members of the Consortium
are located there, as are all the members of the 
Board of Directors
and the 
Council, 
with the exception of Christian Wittern, of Kyoto University. 
Of the 88 projects
listed on the TEI pages as using the Guidelines, 
six are centred on Asian languages:
- 
Chinese Buddhist Electronic Text Association
 Contact: Christian Witern, Kyoto University
- 
Chinese/Japanese/Korean-English Dictionary
 Contact: Charles Muller, Toyo Gakuen University, Chiba, Japan
- 
The Digital Dictionary of Buddhism
 Contact: Charles Muller, Toyo Gakuen University, Chiba, Japan
- 
Japanese/English Bilingual Corpus
 Contact: Francis Bond, NTT Communication Science Laboratories, Japan
- 
Japanese Map Task Dialogue Corpus
 Contact: Syun Tutiya, Chiba University, Japan
- 
Japanese Text Initiative
 Contact: Sachiko Iwabuchi, University of Virginia
Why isn't TEI used more in Asia? I see three possibilities:
- There is, in general, less work of the kind where TEI would be
helpful, i.e. less corpora are produced, and less (historical, classical) 
texts are being made available electronically. Maybe this could be due
to differing funding priorities: in Europe there has been a number of
EU project calls focused on language resources, while the US seems to
fund many projects on digitisation of literary materials. However,
at least some Asian countries seem not to lag in this respect. Korea,
for example, has a number of projects aimed at building electronic
text databases (Jin-Dong Kim, personal communication).
- TEI is not known in the region. In the USA and Europe TEI has had
significant exposure, which is rather natural as this is where it was
conceived and where the work of took place. TEI has also been
featured at various - mostly corpus oriented - events.
- Different character sets used in Asian languages and/or English
documentation and tag names make TEI unappealing in contrast to
locally produced encodings. And even though TEI/SGML/XML are 
able to handle arbitrary character sets, significant expertise is
needed to implement their input and display, while local software
offers ready-made solutions.
The architecture of TEI Guidelines
The Guidelines consist of:
- The 'syntax': the SGML/XML DTD set 
(download or bake with
TEI Pizza Chef)
- The 'semantics': the documentation (in TEI and derived formats) 
The TEI encoding scheme consists of number of modules ('tagsets') or
DTD fragments.
The DTD fragments from which a specific TEI DTD is
constructed are:
- core DTD fragments 
- 
  standard components of the TEI main DTD in all its forms; these are
  always included without any special action by the encoder;
The core consists of:
 
- core text tags:
  These include paragraph, <p>, highlighting 
  (<emph>, and rend attribute),
  quotation, <q>, names, numbers, dates,
  abbreviations, <term>, etc.
- the TEI header:
  describes an encoded work so that the text itself, its source, its
  encoding, and its revisions are all thoroughly documented
 
- base DTD fragments 
- 
  basic building blocks for specific text types; exactly one base must
  be selected by the encoder (unless one of the combined bases is
  used);
Those interesting for GENIA are:
 
-  TEI.prose: the base tag set for prose
-  ?TEI.dictionaries: the base tag set for print dictionaries
-  ?TEI.terminology: the base tag set for terminological data files
-  ?TEI.general: the generic mixed-mode base base tag set
 
- additional DTD fragments
- 
  extra tags useful for particular purposes. All additional tag sets
  are compatible with all bases and with each other; an encoder may
  therefore add them to the selected base in any combination desired.
Those interesting for GENIA are:
 
-  TEI.linking: tags for linking, segmentation, and alignment
-  TEI.analysis: tags for simple analytic mechanisms
-  ?TEI.fs: tags for feature structure analysis
-  ?TEI.nets: tags for graphs, digraphs, trees, and other networks
-  TEI.corpus: tags for additional tags for language corpora
 
- user defined DTD fragments
- 
  give the possibility of extending / modifying / localising 
  the Guidelines
TEI Lite
A particular parametrisation of TEI (a DTD), which implements a useful
`starter set', comprising the elements which almost every user should
know about. 
Some characteristics of TEI Lite: 
- includes most of the TEI `core' tag set;
- handles a reasonably wide variety of texts;
- is useful for the production of new documents as well as
  encoding of existing ones;
- is usable with a wide range of existing SGML software; 
- is derivable from the full TEI DTD using the extension mechanisms
- is as small and simple as is consistent with the other goals. 
Parametrisation
Developing a version of 
GPML
and XLiNo that is compatible with 
TEI guidelines?
A preliminary SGML prolog:
<!DOCTYPE TEI.2  SYSTEM "tei2.dtd"  [
  <!ENTITY % TEI.general "INCLUDE">
  <!ENTITY % TEI.prose "INCLUDE">
  <!ENTITY % TEI.dictionaries "INCLUDE">
  <!ENTITY % TEI.terminology "INCLUDE">
  <!ENTITY % TEI.linking "INCLUDE">
  <!ENTITY % TEI.analysis "INCLUDE">
  <!ENTITY % TEI.fs "INCLUDE">
  <!ENTITY % TEI.corpus "INCLUDE">
]>
Encoding Corpora
Given the multiplicity of annotations, not a simple task.
A overview is given in the presentation by L.Burnard, available at
http://users.ox.ac.uk/~tei/Presentations/TEIcorpus/.
To address this issue, esp. for corpora for language engineering, 
the CES, Corpus Encoding
Specification was developed. It much more focussed than TEI, but not
extensible or parametrisable.
CES has become a part of the 
"Expert Advisory Group on Language Engineering Standards",
EAGLES, as the
"Recommendations on corpus encoding".
It should be noted that the sucessor of EAGLES is the project
ISLE,
International Standards for Language Engineering, which also contains
a working group on Computational Lexicons. However, the work of the
group seems to be focused on multilingual lexica, and, at the time of
this writing, they have not yet produced any publicly available
documents.
The (dis)advantages of using TEI
Advantages:
- Using a wide-coverage, well-designed (modular and extensible),
widely accepted and maintained architecture
- TEI has extensive documentation: Guidelines as well as papers and 
documentation of various projects
- Support: specific problems might have been encountered
before (the tei-l public discussion list)
- Various software already exists, and more is likely to become available
- Contributing to open standards and recommendations
Disadvantages:
- "Tag abuse":
 TEI might not have elements / attributes with the exact meaning we
require; results in a tendency to misuse tags for purposes they were
not meant for
- "Tag bloat": 
 being a general purpose recommendation, it can never be optimal for a
specific application; a custom developed DTD will be leaner; have less
(redundant) tags
- "TEI for humanities"
 maybe the least developed for "high level" NLP
applications: is problematic for encoding ontologies and lexical
databases, feature structures
Possible solutions:
- Ignore TEI completely
 fast development of own DTD, but "reinventing the wheel"; also, low 
interchange value.
- Manage with "out of the box" TEI-Lite encoding
 Used to be done a lot, but now parametrisation is much simpler, so not
much sense in shoehorning project specifics into this particular
subset of TEI.
- Parametrise TEI, and tighten up DTD as project proceeds
 The "proper" way to apply TEI; provides a good development
environment. If necessary, when encoding is fixed, can develop
strict DTD which validates our documents.
- Start with own DTD, develop mapping to TEI for better
interchange
 The case in GENIA
For an "all bases covered" approach" c.f. paper by Gary F. Simons
Using architectural processing to derive small, problem-specific
XML applications from large widely used SGML applications. 
It offers a path to derive a focused small project specific
TEI compatible DTD; however, the technology it proposes is no longer
used very much - but same effect could be achieved with XSLT.
TEI core and prose tagsets
The header consists of four parts:
-  a file description, <fileDesc>, containing a full
  bibliographical description of the computer file itself, from which
  proper bibliographic citation can be derived. The file description
  includes information about the source or sources from which the
  electronic text was derived.
-  an encoding description, <encodingDesc>, which describes the
  relationship between an electronic text and its source or sources.
  It allows for detailed description of whether (or how) the text was
  normalised during transcription, how the encoder resolved
  ambiguities in the source, what levels of encoding or analysis were
  applied, and similar matters.
-  a text profile, <profileDesc>, containing classificatory
  and contextual information about the text, such as its subject
  matter, the situation in which it was produced, the individuals
  described by or participating in producing it, and so forth.  Such a
  text profile is of particular use in highly structured composite
  texts such as corpora or language collections, where it is often
  highly desirable to enforce a controlled descriptive vocabulary or
  to perform retrievals from a body of text in terms of text type or
  origin. 
- a revision history, <revisionDesc>, which allows the encoder
  to provide a history of changes made during the development of the
  electronic text. The revision history is important for version
  control and for resolving questions about the history of a file. 
An example of a header.Text example
<body>
<div type="article">
  <head>Retinoic acid downmodulates erythroid differentiation and 
    <term ana="sem-000">GATA1 expression</term> in <term ana="sem-001">purified
    adult-progenitor culture</term>.
   </head>
   <bibl><xref>MEDLIN:94129004</xref></bibl>
   <div type="abstract">
     <p>
       <s><term ana="sem-002">All-trans retinoic acid</term> (<term
       ana="sem-003">RA</term>) is an important <term
       ana="sem-004">morphogen</term> in vertebrate development, a
       normal constituent in <term ana="sem-005">human adult
       blood</term> and is also involved in the control of cell growth
       and differentiation in <term ana="sem-006">acute promyelocytic
       leukemia</term>.</s>
...
TEI.ana: words, segments and clauses
The TEI module for simple linguistic
analysis contains elements for arbitrary (possibly nested)
segments of text, <seg>, as well as elements for words and clauses:
<div type="text">
  <p>
    <s>
      <cl ana="lex201 lex202" function="(AND lex201 lex202)">
        <term ana="lex203"><w ana="FW">hypo-</w></term>
        <w ana="CC">and</w>
        <term ana="lex204"><w ana="JJ">hyper</w></term>
        <term ana="lex205"><w ana="NN">cortisolism</w></term>
        <c>.</c>
      </cl>
    </s>
  </p>
</div>
Note: if the current GPML model is used, where text and lexica
co-exist in a document, then TEI.general must be used, which enables
the combination of several base tagsets.
Either is problematic for encoding a lexical database or ontology:
- TEI.dictionaries is oriented towards
printed dictionaries
- TEI.terminology is closer to what is
needed for GENIA, but:
Since its first publication, this chapter has been rendered obsolete in several respects,
     chiefly as a result of the publication of ISO 12200, and a variant of it (TBX) which has been
     recently adopted by LISA, the Localisation Industry Standard Association. Work is
     currently ongoing in the ISO community to define a generic platform for terminological
     markup (ISO CD 16642, TMF : Terminological Markup Framework), in the light of which
     it is anticipated that the recommendations of the present chapter will be substantially
     revised. Readers are cautioned in particular that the discussion below of `nested' and `flat'
     structures is now far removed from current practices in the terminological field. A major
     revision of this chapter is planned for the next edition of these Guidelines. 
 
TEI features structures
Background:
A rationale for the TEI recommendations for feature-structure
markup, by D. Terence Langendoen and Gary F. Simons, Computers and the Humanities, 29, (1995).
The proposal is rather complicated and is composed of two parts:
- the 
TEI Feature Structures (TEI.fs) is an
additional tagset for marking-up the
text with feature structures, and 
- the TEI Feature Structure Declaration (FSD)
with a special DTD, for defining feature
values and names, their descriptions and constraints on valid feature
structures.
Disadvantages:
- Not many applications of these guidelines (any?)
- Tailored towards GPSG rather than HPSG: limited support for a type
hierarchy, no marking for reentrancy
- SGML version uses (unsupported) SUB-DOC for linking the FSD
with the annotated document
- No mechanisms for checking validity of specified feature structures
Below is an example for morphosyntactic tagging, from 
ELAN corpus:
<seg id="ecmr.en.3663" corresp="ecmr.sl.3663">
<w ana="Dd" ctag="DT DT" lemma="the">The</w> 
<w ana="Afs" ctag="JJS JJS" lemma="high">highest</w> 
<w ana="Ncns" ctag="NN VB" lemma="pay">pay</w> 
<w ana="Ncns" ctag="NN NN" lemma="increase">increase</w> 
<w ana="Vais3s" ctag="VBD BEDZ" lemma="be">was</w> 
<w ana="Vmps" ctag="VBN VBN" lemma="record">recorded</w> 
<w ana="Sp" ctag="IN IN" lemma="in">in</w> 
<w ana="Ncns" ctag="NN NN" lemma="manufacture">manufacturing</w>
<c ctag=".">.</c>
</seg>                                                          
<fs type="Verb" id="Vmn"      select="en sl" feats="V1.m V2.n"></fs>
<fs type="Verb" id="Vmnp"     select="en" feats="V1.m V2.n V3.p"></fs>
<fs type="Verb" id="Vmp--dfp" select="sl" feats="V1.m V2.p V5.d V6.f V7.p"></fs>
<fs type="Verb" id="Vmp--dmp" select="sl" feats="V1.m V2.p V5.d V6.m V7.p"></fs>
<fs type="Verb" id="Vmp--dnp" select="sl" feats="V1.m V2.p V5.d V6.n V7.p"></fs>
<fs type="Verb" id="Vmp--pfp" select="sl" feats="V1.m V2.p V5.p V6.f V7.p"></fs>
<f select="bg cs en et hu ro sl" id="N1.p" name="Type"><sym value="proper"></f>
<f select="bg cs en ro sl" id="N2.m" name="Gender"><sym value="masculine"></f>
<f select="bg cs en et hu ro sl" id="N3.p" name="Number"><sym value="plural"></f>
<f select="cs hu sl" id="N4.a" name="Case"><sym value="accusative"></f>
<f select="cs hu sl" id="N4.d" name="Case"><sym value="dative"></f>
<f select="bg ro" id="N5.n" name="Definiteness"><sym value="no"></f>
<f select="bg ro" id="N5.y" name="Definiteness"><sym value="yes"></f>
Another example
tries to implement 
a few definitions form HPSG, as given 
in the formalism of the 
Attribute Logic Engine:
<!DOCTYPE teiFsd2 PUBLIC "-//TEI P4//DTD Auxiliary Document Type: 
    Feature System Declaration//EN">
     <teiFsd2>
        <teiHeader>
           <!-- The header is as for any TEI.2 document -->
        </teiHeader>
<!-- ALE HPSG:
  bot sub [bool, case, cat, c_inds, conx,...].
  bool sub [minus, plus].
    minus sub [].
    plus sub [].
-->
        <fsDecl type='bool'  baseType='bot'></fsDecl> 
        <fsDecl type='plus'  baseType='bool'></fsDecl>
        <fsDecl type='minus' baseType='bool'></fsDecl>
<!-- The preceeding currently illegal! -->
<!-- ALE HPSG:
    subst sub [adj, noun, prep, reltvzr, verb]
          intro [prd:bool, 
                 mod:mod_synsem].
...
      verb sub [] 
           intro [aux:bool, 
                  inv:bool, 
                  vform:vform].
-->
        <fsDecl type='verb' baseType='sub'>
           <fsDescr>Type definition for verbs</fsDescr>
           <fDecl name='vform'>
            <sym value='vform'/>                     <!--option 1-->
           </fDecl>
           <fDecl name='aux'>
             <vRange><fs type='bool'></fs></vRange> <!--option 2-->
           </fDecl>
           <fDecl name='inv'>
             <vRange>
               <vAlt><plus/><minus/></vAlt>         <!--option 3-->
             </vRange>
           </fDecl>
        </fsDecl>
     </teiFsd2>
Multiple hierarchies
Design of multiple hierarchies for GENIA annotation;
hot topic in XML world.
C.f. the paper
Implementing Concurrent 
Markup in XML.
One possibility: use of stand-off markup, as advocated in e.g.  
CES and by
LTG, which 
can be implemented using XML
XLink.
XML and relatives
Links:
- XML: used in TEI P4
- XSLT: useful for
converting GPML to TEI and for corpus renderings/visualisations
- XPath: used in XSLT
- XPointer: for
connecting GENIA resources
Tomaž Erjavec, 
2002-02-12