TEI and GENIA

Talk given at Tsujii Lab, University of Tokyo

Tomaž Erjavec

Tuesday, 29 January 2002

This presentation is available at http://nl.ijs.si/et/talks/tei-genia/

Overview:

TEI: history and organisation
TEI in Asia
The architecture of the TEI Guidelines
The (dis)advantages of using TEI
TEI core tagset: text and header
TEI analysis: words, segments and clauses
Genia lexicon: TEI dictionaries or terminological databases?
TEI features structures
Multiple hierarchies

TEI: history and organisation

The Text Encoding Initiative, established in 1987 under the joint sponsorship of the Association for Computers and the Humanities, the Association for Computational Linguistics, and the Association for Literary and Linguistic Computing.
TEI became the only systematised attempt to develop a fully general text encoding model and set of encoding conventions based upon it, suitable for processing and analysis of any type of text, in any language, and intended to serve the increasing range of existing (and potential) applications and use.
TEI Guidelines for Electronic Text Encoding and Interchange (TEI P3) were first published in April 1994 in two substantial green volumes.
In May 1999, a revised edition of TEI P3 was produced, correcting several typographic and other errors. This Revised reprint is, at the time of writing, the official edition of TEI P3 (it is also called the `P4beta') and is available online in various formats.
In December 2000 the TEI Consortium was set up to maintain and develop the TEI standard. The Consortium has executive offices in Bergen, Norway, and hosts at four universities. The Consortium is managed by a Board of Directors, and its technical work is overseen by an elected Council. Lou Burnard is European editor and Syd Bauman is North American Editor. Institutions and individuals can become Consortium members or subscribers, which gives them certain benefits inside the consortium.
In June 2001, the TEI Consortium announced availability of a preliminary version of a major revision of TEI P3, the TEI P4, the object of which would be to provide equal support for XML and SGML applications using the TEI scheme. The revisions needed to make TEI P4 have been deliberately restricted to error correction only, with a view to ensuring that documents conforming to TEI P3 will not become illegal when processed with TEI P4.
Many possibilities for other, more fundamental, changes have been identified. With the establishment of the new TEI Council, it becomes possible to agree on a programme of work to enhance and modify the Guidelines more fundamentally over the coming years. TEI P5 will be the next full revision of the Guidelines. No date has yet been fixed for its appearance, but work on it will commence early in 2002.

TEI and Asia/Japan

While the TEI is an international effort, the nations in question are almost exclusively in North America and Europe. All the 50 members of the Consortium are located there, as are all the members of the Board of Directors and the Council, with the exception of Christian Wittern, of Kyoto University.

Of the 88 projects listed on the TEI pages as using the Guidelines, six are centred on Asian languages:

Chinese Buddhist Electronic Text Association
Contact: Christian Witern, Kyoto University
Chinese/Japanese/Korean-English Dictionary
Contact: Charles Muller, Toyo Gakuen University, Chiba, Japan
The Digital Dictionary of Buddhism
Contact: Charles Muller, Toyo Gakuen University, Chiba, Japan
Japanese/English Bilingual Corpus
Contact: Francis Bond, NTT Communication Science Laboratories, Japan
Japanese Map Task Dialogue Corpus
Contact: Syun Tutiya, Chiba University, Japan
Japanese Text Initiative
Contact: Sachiko Iwabuchi, University of Virginia

Why isn't TEI used more in Asia? I see three possibilities:

There is, in general, less work of the kind where TEI would be helpful, i.e. less corpora are produced, and less (historical, classical) texts are being made available electronically. Maybe this could be due to differing funding priorities: in Europe there has been a number of EU project calls focused on language resources, while the US seems to fund many projects on digitisation of literary materials. However, at least some Asian countries seem not to lag in this respect. Korea, for example, has a number of projects aimed at building electronic text databases (Jin-Dong Kim, personal communication).
TEI is not known in the region. In the USA and Europe TEI has had significant exposure, which is rather natural as this is where it was conceived and where the work of took place. TEI has also been featured at various - mostly corpus oriented - events.
Different character sets used in Asian languages and/or English documentation and tag names make TEI unappealing in contrast to locally produced encodings. And even though TEI/SGML/XML are able to handle arbitrary character sets, significant expertise is needed to implement their input and display, while local software offers ready-made solutions.

The architecture of TEI Guidelines

The Guidelines consist of:

The 'syntax': the SGML/XML DTD set (download or bake with TEI Pizza Chef)
The 'semantics': the documentation (in TEI and derived formats)

The TEI encoding scheme consists of number of modules ('tagsets') or DTD fragments.

The DTD fragments from which a specific TEI DTD is constructed are:

core DTD fragments

standard components of the TEI main DTD in all its forms; these are always included without any special action by the encoder;

The core consists of:

core text tags: These include paragraph, <p>, highlighting (<emph>, and rend attribute), quotation, <q>, names, numbers, dates, abbreviations, <term>, etc.
the TEI header: describes an encoded work so that the text itself, its source, its encoding, and its revisions are all thoroughly documented

base DTD fragments

basic building blocks for specific text types; exactly one base must be selected by the encoder (unless one of the combined bases is used);

Those interesting for GENIA are:

TEI.prose: the base tag set for prose
?TEI.dictionaries: the base tag set for print dictionaries
?TEI.terminology: the base tag set for terminological data files
?TEI.general: the generic mixed-mode base base tag set

additional DTD fragments

extra tags useful for particular purposes. All additional tag sets are compatible with all bases and with each other; an encoder may therefore add them to the selected base in any combination desired.

Those interesting for GENIA are:

TEI.linking: tags for linking, segmentation, and alignment
TEI.analysis: tags for simple analytic mechanisms
?TEI.fs: tags for feature structure analysis
?TEI.nets: tags for graphs, digraphs, trees, and other networks
TEI.corpus: tags for additional tags for language corpora

user defined DTD fragments

give the possibility of extending / modifying / localising the Guidelines

TEI Lite

A particular parametrisation of TEI (a DTD), which implements a useful `starter set', comprising the elements which almost every user should know about. Some characteristics of TEI Lite:

includes most of the TEI `core' tag set;
handles a reasonably wide variety of texts;
is useful for the production of new documents as well as encoding of existing ones;
is usable with a wide range of existing SGML software;
is derivable from the full TEI DTD using the extension mechanisms
is as small and simple as is consistent with the other goals.

Parametrisation

Developing a version of GPML and XLiNo that is compatible with TEI guidelines?

A preliminary SGML prolog:

<!DOCTYPE TEI.2  SYSTEM "tei2.dtd"  [
  <!ENTITY % TEI.general "INCLUDE">
  <!ENTITY % TEI.prose "INCLUDE">
  <!ENTITY % TEI.dictionaries "INCLUDE">
  <!ENTITY % TEI.terminology "INCLUDE">
  <!ENTITY % TEI.linking "INCLUDE">
  <!ENTITY % TEI.analysis "INCLUDE">
  <!ENTITY % TEI.fs "INCLUDE">
  <!ENTITY % TEI.corpus "INCLUDE">
]>

Encoding Corpora

Given the multiplicity of annotations, not a simple task. A overview is given in the presentation by L.Burnard, available at http://users.ox.ac.uk/~tei/Presentations/TEIcorpus/.

To address this issue, esp. for corpora for language engineering, the CES, Corpus Encoding Specification was developed. It much more focussed than TEI, but not extensible or parametrisable.

CES has become a part of the "Expert Advisory Group on Language Engineering Standards", EAGLES, as the "Recommendations on corpus encoding".

It should be noted that the sucessor of EAGLES is the project ISLE, International Standards for Language Engineering, which also contains a working group on Computational Lexicons. However, the work of the group seems to be focused on multilingual lexica, and, at the time of this writing, they have not yet produced any publicly available documents.

The (dis)advantages of using TEI

Advantages:

Using a wide-coverage, well-designed (modular and extensible), widely accepted and maintained architecture
TEI has extensive documentation: Guidelines as well as papers and documentation of various projects
Support: specific problems might have been encountered before (the tei-l public discussion list)
Various software already exists, and more is likely to become available
Contributing to open standards and recommendations

Disadvantages:

"Tag abuse":
TEI might not have elements / attributes with the exact meaning we require; results in a tendency to misuse tags for purposes they were not meant for
"Tag bloat":
being a general purpose recommendation, it can never be optimal for a specific application; a custom developed DTD will be leaner; have less (redundant) tags
"TEI for humanities"
maybe the least developed for "high level" NLP applications: is problematic for encoding ontologies and lexical databases, feature structures

Possible solutions:

Ignore TEI completely
fast development of own DTD, but "reinventing the wheel"; also, low interchange value.
Manage with "out of the box" TEI-Lite encoding
Used to be done a lot, but now parametrisation is much simpler, so not much sense in shoehorning project specifics into this particular subset of TEI.
Parametrise TEI, and tighten up DTD as project proceeds
The "proper" way to apply TEI; provides a good development environment. If necessary, when encoding is fixed, can develop strict DTD which validates our documents.
Start with own DTD, develop mapping to TEI for better interchange
The case in GENIA

For an "all bases covered" approach" c.f. paper by Gary F. Simons Using architectural processing to derive small, problem-specific XML applications from large widely used SGML applications. It offers a path to derive a focused small project specific TEI compatible DTD; however, the technology it proposes is no longer used very much - but same effect could be achieved with XSLT.

TEI core and prose tagsets

The TEI Header

The header consists of four parts:

a file description, <fileDesc>, containing a full bibliographical description of the computer file itself, from which proper bibliographic citation can be derived. The file description includes information about the source or sources from which the electronic text was derived.
an encoding description, <encodingDesc>, which describes the relationship between an electronic text and its source or sources. It allows for detailed description of whether (or how) the text was normalised during transcription, how the encoder resolved ambiguities in the source, what levels of encoding or analysis were applied, and similar matters.
a text profile, <profileDesc>, containing classificatory and contextual information about the text, such as its subject matter, the situation in which it was produced, the individuals described by or participating in producing it, and so forth. Such a text profile is of particular use in highly structured composite texts such as corpora or language collections, where it is often highly desirable to enforce a controlled descriptive vocabulary or to perform retrievals from a body of text in terms of text type or origin.
a revision history, <revisionDesc>, which allows the encoder to provide a history of changes made during the development of the electronic text. The revision history is important for version control and for resolving questions about the history of a file.

An example of a header.

Text example

<body>
<div type="article">
  <head>Retinoic acid downmodulates erythroid differentiation and 
    <term ana="sem-000">GATA1 expression</term> in <term ana="sem-001">purified
    adult-progenitor culture</term>.
   </head>
   <bibl><xref>MEDLIN:94129004</xref></bibl>
   <div type="abstract">
     <p>
       <s><term ana="sem-002">All-trans retinoic acid</term> (<term
       ana="sem-003">RA</term>) is an important <term
       ana="sem-004">morphogen</term> in vertebrate development, a
       normal constituent in <term ana="sem-005">human adult
       blood</term> and is also involved in the control of cell growth
       and differentiation in <term ana="sem-006">acute promyelocytic
       leukemia</term>.</s>
...

TEI.ana: words, segments and clauses

The TEI module for simple linguistic analysis contains elements for arbitrary (possibly nested) segments of text, <seg>, as well as elements for words and clauses:

<div type="text">
  <p>
    <s>
      <cl ana="lex201 lex202" function="(AND lex201 lex202)">
        <term ana="lex203"><w ana="FW">hypo-</w></term>
        <w ana="CC">and</w>
        <term ana="lex204"><w ana="JJ">hyper</w></term>
        <term ana="lex205"><w ana="NN">cortisolism</w></term>
        <c>.</c>
      </cl>
    </s>
  </p>
</div>

Genia lexicon: TEI.dictionaries or TEI.terminology?

Note: if the current GPML model is used, where text and lexica co-exist in a document, then TEI.general must be used, which enables the combination of several base tagsets. Either is problematic for encoding a lexical database or ontology:

TEI.dictionaries is oriented towards printed dictionaries
TEI.terminology is closer to what is needed for GENIA, but:
Since its first publication, this chapter has been rendered obsolete in several respects, chiefly as a result of the publication of ISO 12200, and a variant of it (TBX) which has been recently adopted by LISA, the Localisation Industry Standard Association. Work is currently ongoing in the ISO community to define a generic platform for terminological markup (ISO CD 16642, TMF : Terminological Markup Framework), in the light of which it is anticipated that the recommendations of the present chapter will be substantially revised. Readers are cautioned in particular that the discussion below of `nested' and `flat' structures is now far removed from current practices in the terminological field. A major revision of this chapter is planned for the next edition of these Guidelines.

TEI features structures

Background: A rationale for the TEI recommendations for feature-structure markup, by D. Terence Langendoen and Gary F. Simons, Computers and the Humanities, 29, (1995).

The proposal is rather complicated and is composed of two parts:

the TEI Feature Structures (TEI.fs) is an additional tagset for marking-up the text with feature structures, and
the TEI Feature Structure Declaration (FSD) with a special DTD, for defining feature values and names, their descriptions and constraints on valid feature structures.

Disadvantages:

Not many applications of these guidelines (any?)
Tailored towards GPSG rather than HPSG: limited support for a type hierarchy, no marking for reentrancy
SGML version uses (unsupported) SUB-DOC for linking the FSD with the annotated document
No mechanisms for checking validity of specified feature structures

Below is an example for morphosyntactic tagging, from ELAN corpus:

<seg id="ecmr.en.3663" corresp="ecmr.sl.3663">
<w ana="Dd" ctag="DT DT" lemma="the">The</w> 
<w ana="Afs" ctag="JJS JJS" lemma="high">highest</w> 
<w ana="Ncns" ctag="NN VB" lemma="pay">pay</w> 
<w ana="Ncns" ctag="NN NN" lemma="increase">increase</w> 
<w ana="Vais3s" ctag="VBD BEDZ" lemma="be">was</w> 
<w ana="Vmps" ctag="VBN VBN" lemma="record">recorded</w> 
<w ana="Sp" ctag="IN IN" lemma="in">in</w> 
<w ana="Ncns" ctag="NN NN" lemma="manufacture">manufacturing</w>
<c ctag=".">.</c>
</seg>                                                          

<fs type="Verb" id="Vmn"      select="en sl" feats="V1.m V2.n"></fs>
<fs type="Verb" id="Vmnp"     select="en" feats="V1.m V2.n V3.p"></fs>
<fs type="Verb" id="Vmp--dfp" select="sl" feats="V1.m V2.p V5.d V6.f V7.p"></fs>
<fs type="Verb" id="Vmp--dmp" select="sl" feats="V1.m V2.p V5.d V6.m V7.p"></fs>
<fs type="Verb" id="Vmp--dnp" select="sl" feats="V1.m V2.p V5.d V6.n V7.p"></fs>
<fs type="Verb" id="Vmp--pfp" select="sl" feats="V1.m V2.p V5.p V6.f V7.p"></fs>

<f select="bg cs en et hu ro sl" id="N1.p" name="Type"><sym value="proper"></f>
<f select="bg cs en ro sl" id="N2.m" name="Gender"><sym value="masculine"></f>
<f select="bg cs en et hu ro sl" id="N3.p" name="Number"><sym value="plural"></f>
<f select="cs hu sl" id="N4.a" name="Case"><sym value="accusative"></f>
<f select="cs hu sl" id="N4.d" name="Case"><sym value="dative"></f>
<f select="bg ro" id="N5.n" name="Definiteness"><sym value="no"></f>
<f select="bg ro" id="N5.y" name="Definiteness"><sym value="yes"></f>

Another example tries to implement a few definitions form HPSG, as given in the formalism of the Attribute Logic Engine:

<!DOCTYPE teiFsd2 PUBLIC "-//TEI P4//DTD Auxiliary Document Type: 
    Feature System Declaration//EN">

     <teiFsd2>
        <teiHeader>
           <!-- The header is as for any TEI.2 document -->
        </teiHeader>

<!-- ALE HPSG:
  bot sub [bool, case, cat, c_inds, conx,...].

  bool sub [minus, plus].
    minus sub [].
    plus sub [].
-->

        <fsDecl type='bool'  baseType='bot'></fsDecl> 
        <fsDecl type='plus'  baseType='bool'></fsDecl>
        <fsDecl type='minus' baseType='bool'></fsDecl>
<!-- The preceeding currently illegal! -->

<!-- ALE HPSG:
    subst sub [adj, noun, prep, reltvzr, verb]
          intro [prd:bool, 
                 mod:mod_synsem].
...
      verb sub [] 
           intro [aux:bool, 
                  inv:bool, 
                  vform:vform].
-->

        <fsDecl type='verb' baseType='sub'>
           <fsDescr>Type definition for verbs</fsDescr>
           <fDecl name='vform'>
            <sym value='vform'/>                     <!--option 1-->
           </fDecl>
           <fDecl name='aux'>
             <vRange><fs type='bool'></fs></vRange> <!--option 2-->
           </fDecl>
           <fDecl name='inv'>
             <vRange>
               <vAlt><plus/><minus/></vAlt>         <!--option 3-->
             </vRange>
           </fDecl>
        </fsDecl>
     </teiFsd2>

Multiple hierarchies

Design of multiple hierarchies for GENIA annotation; hot topic in XML world.

C.f. the paper Implementing Concurrent Markup in XML.

One possibility: use of stand-off markup, as advocated in e.g. CES and by LTG, which can be implemented using XML XLink.

XML and relatives

Links:

XML: used in TEI P4
XSLT: useful for converting GPML to TEI and for corpus renderings/visualisations
XPath: used in XSLT
XPointer: for connecting GENIA resources

Tomaž Erjavec, 2002-02-12