TEI and GENIA

Talk given at Tsujii Lab, University of Tokyo

Tomaž Erjavec

Tuesday, 29 January 2002

This presentation is available at http://nl.ijs.si/et/talks/tei-genia/

Overview:

  1. TEI: history and organisation
  2. TEI in Asia
  3. The architecture of the TEI Guidelines
  4. The (dis)advantages of using TEI
  5. TEI core tagset: text and header
  6. TEI analysis: words, segments and clauses
  7. Genia lexicon: TEI dictionaries or terminological databases?
  8. TEI features structures
  9. Multiple hierarchies




TEI: history and organisation


TEI and Asia/Japan

While the TEI is an international effort, the nations in question are almost exclusively in North America and Europe. All the 50 members of the Consortium are located there, as are all the members of the Board of Directors and the Council, with the exception of Christian Wittern, of Kyoto University.

Of the 88 projects listed on the TEI pages as using the Guidelines, six are centred on Asian languages:

  1. Chinese Buddhist Electronic Text Association
    Contact: Christian Witern, Kyoto University
  2. Chinese/Japanese/Korean-English Dictionary
    Contact: Charles Muller, Toyo Gakuen University, Chiba, Japan
  3. The Digital Dictionary of Buddhism
    Contact: Charles Muller, Toyo Gakuen University, Chiba, Japan
  4. Japanese/English Bilingual Corpus
    Contact: Francis Bond, NTT Communication Science Laboratories, Japan
  5. Japanese Map Task Dialogue Corpus
    Contact: Syun Tutiya, Chiba University, Japan
  6. Japanese Text Initiative
    Contact: Sachiko Iwabuchi, University of Virginia
Why isn't TEI used more in Asia? I see three possibilities:

The architecture of TEI Guidelines

The Guidelines consist of:
  1. The 'syntax': the SGML/XML DTD set (download or bake with TEI Pizza Chef)
  2. The 'semantics': the documentation (in TEI and derived formats)
The TEI encoding scheme consists of number of modules ('tagsets') or DTD fragments.

The DTD fragments from which a specific TEI DTD is constructed are:

core DTD fragments
standard components of the TEI main DTD in all its forms; these are always included without any special action by the encoder;

The core consists of:

base DTD fragments
basic building blocks for specific text types; exactly one base must be selected by the encoder (unless one of the combined bases is used);

Those interesting for GENIA are:

additional DTD fragments
extra tags useful for particular purposes. All additional tag sets are compatible with all bases and with each other; an encoder may therefore add them to the selected base in any combination desired.

Those interesting for GENIA are:

user defined DTD fragments
give the possibility of extending / modifying / localising the Guidelines

TEI Lite

A particular parametrisation of TEI (a DTD), which implements a useful `starter set', comprising the elements which almost every user should know about. Some characteristics of TEI Lite:

Parametrisation

Developing a version of GPML and XLiNo that is compatible with TEI guidelines?

A preliminary SGML prolog:

<!DOCTYPE TEI.2  SYSTEM "tei2.dtd"  [
  <!ENTITY % TEI.general "INCLUDE">
  <!ENTITY % TEI.prose "INCLUDE">
  <!ENTITY % TEI.dictionaries "INCLUDE">
  <!ENTITY % TEI.terminology "INCLUDE">
  <!ENTITY % TEI.linking "INCLUDE">
  <!ENTITY % TEI.analysis "INCLUDE">
  <!ENTITY % TEI.fs "INCLUDE">
  <!ENTITY % TEI.corpus "INCLUDE">
]>

Encoding Corpora

Given the multiplicity of annotations, not a simple task. A overview is given in the presentation by L.Burnard, available at http://users.ox.ac.uk/~tei/Presentations/TEIcorpus/.

To address this issue, esp. for corpora for language engineering, the CES, Corpus Encoding Specification was developed. It much more focussed than TEI, but not extensible or parametrisable.

CES has become a part of the "Expert Advisory Group on Language Engineering Standards", EAGLES, as the "Recommendations on corpus encoding".

It should be noted that the sucessor of EAGLES is the project ISLE, International Standards for Language Engineering, which also contains a working group on Computational Lexicons. However, the work of the group seems to be focused on multilingual lexica, and, at the time of this writing, they have not yet produced any publicly available documents.


The (dis)advantages of using TEI

Advantages: Disadvantages: Possible solutions:
  1. Ignore TEI completely
    fast development of own DTD, but "reinventing the wheel"; also, low interchange value.
  2. Manage with "out of the box" TEI-Lite encoding
    Used to be done a lot, but now parametrisation is much simpler, so not much sense in shoehorning project specifics into this particular subset of TEI.
  3. Parametrise TEI, and tighten up DTD as project proceeds
    The "proper" way to apply TEI; provides a good development environment. If necessary, when encoding is fixed, can develop strict DTD which validates our documents.
  4. Start with own DTD, develop mapping to TEI for better interchange
    The case in GENIA
For an "all bases covered" approach" c.f. paper by Gary F. Simons Using architectural processing to derive small, problem-specific XML applications from large widely used SGML applications. It offers a path to derive a focused small project specific TEI compatible DTD; however, the technology it proposes is no longer used very much - but same effect could be achieved with XSLT.

TEI core and prose tagsets

The TEI Header

The header consists of four parts: An example of a header.

Text example

<body>
<div type="article">
  <head>Retinoic acid downmodulates erythroid differentiation and 
    <term ana="sem-000">GATA1 expression</term> in <term ana="sem-001">purified
    adult-progenitor culture</term>.
   </head>
   <bibl><xref>MEDLIN:94129004</xref></bibl>
   <div type="abstract">
     <p>
       <s><term ana="sem-002">All-trans retinoic acid</term> (<term
       ana="sem-003">RA</term>) is an important <term
       ana="sem-004">morphogen</term> in vertebrate development, a
       normal constituent in <term ana="sem-005">human adult
       blood</term> and is also involved in the control of cell growth
       and differentiation in <term ana="sem-006">acute promyelocytic
       leukemia</term>.</s>
...

TEI.ana: words, segments and clauses

The TEI module for simple linguistic analysis contains elements for arbitrary (possibly nested) segments of text, <seg>, as well as elements for words and clauses:
<div type="text">
  <p>
    <s>
      <cl ana="lex201 lex202" function="(AND lex201 lex202)">
        <term ana="lex203"><w ana="FW">hypo-</w></term>
        <w ana="CC">and</w>
        <term ana="lex204"><w ana="JJ">hyper</w></term>
        <term ana="lex205"><w ana="NN">cortisolism</w></term>
        <c>.</c>
      </cl>
    </s>
  </p>
</div>

Genia lexicon: TEI.dictionaries or TEI.terminology?

Note: if the current GPML model is used, where text and lexica co-exist in a document, then TEI.general must be used, which enables the combination of several base tagsets. Either is problematic for encoding a lexical database or ontology:

TEI features structures

Background: A rationale for the TEI recommendations for feature-structure markup, by D. Terence Langendoen and Gary F. Simons, Computers and the Humanities, 29, (1995).

The proposal is rather complicated and is composed of two parts:

  1. the TEI Feature Structures (TEI.fs) is an additional tagset for marking-up the text with feature structures, and
  2. the TEI Feature Structure Declaration (FSD) with a special DTD, for defining feature values and names, their descriptions and constraints on valid feature structures.
Disadvantages: Below is an example for morphosyntactic tagging, from ELAN corpus:
<seg id="ecmr.en.3663" corresp="ecmr.sl.3663">
<w ana="Dd" ctag="DT DT" lemma="the">The</w> 
<w ana="Afs" ctag="JJS JJS" lemma="high">highest</w> 
<w ana="Ncns" ctag="NN VB" lemma="pay">pay</w> 
<w ana="Ncns" ctag="NN NN" lemma="increase">increase</w> 
<w ana="Vais3s" ctag="VBD BEDZ" lemma="be">was</w> 
<w ana="Vmps" ctag="VBN VBN" lemma="record">recorded</w> 
<w ana="Sp" ctag="IN IN" lemma="in">in</w> 
<w ana="Ncns" ctag="NN NN" lemma="manufacture">manufacturing</w>
<c ctag=".">.</c>
</seg>                                                          

<fs type="Verb" id="Vmn"      select="en sl" feats="V1.m V2.n"></fs>
<fs type="Verb" id="Vmnp"     select="en" feats="V1.m V2.n V3.p"></fs>
<fs type="Verb" id="Vmp--dfp" select="sl" feats="V1.m V2.p V5.d V6.f V7.p"></fs>
<fs type="Verb" id="Vmp--dmp" select="sl" feats="V1.m V2.p V5.d V6.m V7.p"></fs>
<fs type="Verb" id="Vmp--dnp" select="sl" feats="V1.m V2.p V5.d V6.n V7.p"></fs>
<fs type="Verb" id="Vmp--pfp" select="sl" feats="V1.m V2.p V5.p V6.f V7.p"></fs>

<f select="bg cs en et hu ro sl" id="N1.p" name="Type"><sym value="proper"></f>
<f select="bg cs en ro sl" id="N2.m" name="Gender"><sym value="masculine"></f>
<f select="bg cs en et hu ro sl" id="N3.p" name="Number"><sym value="plural"></f>
<f select="cs hu sl" id="N4.a" name="Case"><sym value="accusative"></f>
<f select="cs hu sl" id="N4.d" name="Case"><sym value="dative"></f>
<f select="bg ro" id="N5.n" name="Definiteness"><sym value="no"></f>
<f select="bg ro" id="N5.y" name="Definiteness"><sym value="yes"></f>
Another example tries to implement a few definitions form HPSG, as given in the formalism of the Attribute Logic Engine:
<!DOCTYPE teiFsd2 PUBLIC "-//TEI P4//DTD Auxiliary Document Type: 
    Feature System Declaration//EN">

     <teiFsd2>
        <teiHeader>
           <!-- The header is as for any TEI.2 document -->
        </teiHeader>

<!-- ALE HPSG:
  bot sub [bool, case, cat, c_inds, conx,...].

  bool sub [minus, plus].
    minus sub [].
    plus sub [].
-->

        <fsDecl type='bool'  baseType='bot'></fsDecl> 
        <fsDecl type='plus'  baseType='bool'></fsDecl>
        <fsDecl type='minus' baseType='bool'></fsDecl>
<!-- The preceeding currently illegal! -->

<!-- ALE HPSG:
    subst sub [adj, noun, prep, reltvzr, verb]
          intro [prd:bool, 
                 mod:mod_synsem].
...
      verb sub [] 
           intro [aux:bool, 
                  inv:bool, 
                  vform:vform].
-->

        <fsDecl type='verb' baseType='sub'>
           <fsDescr>Type definition for verbs</fsDescr>
           <fDecl name='vform'>
            <sym value='vform'/>                     <!--option 1-->
           </fDecl>
           <fDecl name='aux'>
             <vRange><fs type='bool'></fs></vRange> <!--option 2-->
           </fDecl>
           <fDecl name='inv'>
             <vRange>
               <vAlt><plus/><minus/></vAlt>         <!--option 3-->
             </vRange>
           </fDecl>
        </fsDecl>
     </teiFsd2>

Multiple hierarchies

Design of multiple hierarchies for GENIA annotation; hot topic in XML world.

C.f. the paper Implementing Concurrent Markup in XML.

One possibility: use of stand-off markup, as advocated in e.g. CES and by LTG, which can be implemented using XML XLink.


XML and relatives

Links:
Tomaž Erjavec, 2002-02-12