The Text Encoding Initiative and the GENIA Corpus

Tomaž Erjavec
Department of Intelligent Systems
Jožef Stefan Institute
Ljubljana
Slovenia
January - July 2002:
Tsujii Laboratory
Department of Information Science
University of Tokyo
Tokyo

Talk given at

National Institute of Informatics

Tokyo

June 26, 2002

These slides can be found at
http://nl.ijs.si/et/talks/nii02/ and http://www-tsujii.is.s.u-tokyo.ac.jp/~et/talk-nii/
TOC | First

Abstract

The talk first introduces the Text Encoding Initiative, an international effort established in 1987 under the joint sponsorship of the Association for Computers and the Humanities, the Association for Computational Linguistics, and the Association for Literary and Linguistic Computing. TEI is the only systematised attempt to develop a fully general text encoding model and set of encoding conventions based upon it. It is suitable for processing and analysis of any type of text, in any language, and intended to serve the increasing range of existing (and potential) applications and use.
We overview the history and organisation of the TEI and introduce its main achievement, the TEI Guidelines, a set of recommendations for text encoding based on SGML and, recently, on XML. We explain the modular and parametrisable architecture of the Guidelines and give some advantages and disadvantages of using TEI. Projects using TEI are mentioned, with special emphasis on those dealing with Asian languages.
The second part of the talk discusses the GENIA corpus, which is being compiled at the Tsujii Laboratory, Department of Information Science, University of Tokyo. The corpus consists of annotated abstracts taken from National Library of Medicine's MEDLINE database. We discuss the encoding of the corpus in the GENIA markup language GPML, and its TEI incarnation, automatically derived via XSLT. Special emphasis is given on developing a TEI parametrisation suitable for encoding the growing body of biomedical resources. The talk concludes with recent developments of the GENIA corpus and plans for the future.


Overview

  1. The Text Encoding Initiative
  2. Encoding the GENIA corpus in TEI
  3. Further annotation of GENIA with XML tools

TEI History: Establishment and Motivations

TEI History: Basics and First Drafts

TEI Guidelines: P3

What is XML?

TEI Guidelines: P4 (& P5)

Breaking news:
  • A major revision of TEI P3, the TEI P4 was published on the WWW in early 2002.
  • In June 2002 P4 was published also in print, in two beautiful blue volumes. The Press release is also available in Japanese.
  • TEI P4 addresses the following issues:
    • error correction, while maintaining backward compatibility;
    • provides support for XML (as well as SGML).
The future:
  • Many possibilities for other, more fundamental, changes have been identified, e.g. the utilisation of XLink & XPointer, XML Schemas, etc.
  • TEI P5 will be the next full revision of the Guidelines. No date has yet been fixed for its appearance.

The TEI Consortium

Projects Using the TEI

There are 88 listed and 75 “live” projects that have used the TEI, as given on the TEI project page. Some examples, with last significant update of the entry:

TEI in Japan

Six listed projects are centered on Asian languages:
  1. Chinese/Japanese/Korean-English Dictionary
    Contact: Charles Muller, Toyo Gakuen University, Chiba
  2. The Digital Dictionary of Buddhism
    Contact: Charles Muller, Toyo Gakuen University, Chiba
  3. Chinese Buddhist Electronic Text Association
    Contact: Christian Witern, Kyoto University
  4. Japanese Text Initiative
    Contact: Sachiko Iwabuchi, University of Virginia
  5. (Japanese/English Bilingual Corpus)
    Contact: Francis Bond, NTT Communication Science Laboratories
  6. (Japanese Map Task Dialogue Corpus)
    Contact: Syun Tutiya, Chiba University

Why not more?

Why isn't TEI used more in Asia? I see three possibilities:
  • There is, in general, less work of the kind where TEI would be helpful, i.e. less corpora are produced, and less (historical, classical) texts are being made available electronically. Maybe this could be due to differing funding priorities? In Europe, there has been a number of EU project calls focused on language resources, while the US funds many projects on digitisation of literary materials.
  • TEI is not known in the region. In the USA and Europe TEI has had significant exposure, which is rather natural as this is where it was concieved and where the work of took place. TEI has also been featured at various - mostly corpus oriented - events.
  • Different character sets used in Asian languages and/or English documentation and tag names make TEI unapppealing in contrast to localy produced encodings. This is more a wild guess - after all, TEI/SGML is able to handle arbitrary character sets, and probably any interchange format will have its documentation in English...

The TEI Guidelines

Structure of the TEI DTD

The formal SGML/XML part of TEI comes as a set of DTD fragments or tag sets. A TEI DTD for a particular application is constructed by selecting an appropriate combination of such tag sets, which include:
core tag sets
standard components of the TEI main DTD in all its forms; these are always included without any special action by the encoder.
base tag sets
basic building blocks for specific text types; exactly one base must be selected by the encoder, unless one of the ‘combined’ bases is used.
additional tag sets
extra tags useful for particular purposes. These tag sets are compatible with all bases and with each other; an encoder may therefore add them to the selected base in any combination desired.
user defined tag sets
these extra tags give the possibility of extending and overriding the definitions provided in the TEI tag set.

The Core Tagset

The core tagset, which is always available, consists of:
Core tags
Used in the text, and are, for the most part, in-line elements with no consistent internal structure, e.g. highlighting (<emph>), quotation, <q>, names <name>, etc.
TEI header
Describes an encoded work so that the text itself, its source, its encoding, and its revisions are all thoroughly documented.

Base Tagsets

Only one base can be chosen, unless a combined base is also selected:
TEI.prose
the base tag set for prose
TEI.verse
the base tag set for verse
TEI.drama
the base tag set for drama
TEI.spoken
the base tag set for transcriptions of spoken texts
TEI.dictionaries
the base tag set for print dictionaries
TEI.terminology
the base tag set for terminological data files
TEI.general
the generic mixed-mode base base tag set
TEI.mixed
the base tag set for free mixed-mode texts

Additional Tagsets

These tagsets represents additional interpretations of text, and an arbitrary number can be chosen:
TEI.linking
tags for linking, segmentation, and alignment
TEI.analysis
tags for simple analytic mechanisms
TEI.fs
tags for feature structure analysis
TEI.certainty
tags for indicating uncertainty and probability in the markup
TEI.transcr
tags for manuscripts, analytic bibliography, and transcription of primary sources
TEI.textcrit
tags for critical editions
TEI.names.dates
specialized tags for names and dates
TEI.nets
tags for graphs, digraphs, trees, and other networks
TEI.figures
tags for graphics, figures, illustrations, tables, and formulae
TEI.corpus
tags for additional tags for language corpora

Examples of TEI Use

A newspaper story:

<div1 type="story">
  <head rend="large underlined" type="sub">
    President pledges safeguards for 2,400 British troops 
in Bosnia
  </head>
  <head rend="very large bold" type="main">
    Major agrees to enforced no-fly zone
  </head>
  <byline>
    By George Jones, Political Editor, in Washington
  </byline>
  <p>
    Greater Western intervention in the conflict in
former Yugoslavia was pledged by President Bush ...
  </p>
</div1>

TEI.analysis Example


<seg id="orwl.en.24" corresp="orwl.sl.24">
  <s id="Oen.1.1.4.5">
    <q>
    <w ana="Af" lemma="big">Big</w> 
    <w ana="Ncms" lemma="brother">Brother</w> 
    <w ana="Vaip3s" lemma="be">is</w> 
    <w ana="Vmpp" lemma="watch">watching</w> 
    <w ana="Pp2" lemma="you">you</w>
    </q>
    <w ana="Dd" lemma="the">the</w> 
    <w ana="Ncns" lemma="caption">caption</w> 
    <w ana="Vmis" lemma="say">said</w>
    <c ana="Cs" lemma="while">while</w> 
    <w ana="Dd" lemma="the">the</w> 
    <w ana="Af" lemma="dark">dark</w> 
    <w ana="Ncnp" lemma="eye">eyes</w> 
    <w ana="Vmis" lemma="look">looked</w> 
    <w ana="Rmp" lemma="deep">deep</w> 
    <w ana="Sp" lemma="into">into</w> 
    <w ana="Np" lemma="winston">Winston</w>
    <w type="rsplit" ana="St" lemma="'s">'s</w> 
    <w ana="Ps3" lemma="own">own</w>
    <c ctag=".">.</c>
  </s>
</seg>

TEI.fs Example


<fsLib>
  <fs type="Noun" id="Ncfda" feats="N1.c N2.f N3.d N4.a"/>
  <fs type="Noun" id="Ncfdd" feats="N1.c N2.f N3.d N4.d"/>
  <fs type="Noun" id="Ncfdg" feats="N1.c N2.f N3.d N4.g"/>
  ...
</fsLib>

<fLib>
  <f id="N1.c"  select="en ro sl cs bg et hu hr" name="Type">
    <sym value="common"/>
  </f>
  <f id="N1.p"  select="en ro sl cs bg et hu hr" name="Type">
    <sym value="proper"/>
  </f>
  <f id="N2.m"  select="en ro sl cs bg       hr" name="Gender">
    <sym value="masculine"/>
  </f>
  <f id="N2.f"  select="en ro sl cs bg       hr" name="Gender">
    <sym value="feminine"/>
  </f>
  <f id="N2.n"  select="en ro sl cs bg       hr" name="Gender">
    <sym value="neuter"/>
  </f>
  ...
</fLib>

TEI Lite

TEI Lite is a particular parametrisation of TEI (a DTD), which implements a useful “starter set”, comprising the elements which almost every user should know about.
Some characteristcs of TEI Lite:
  • includes most of the TEI “core” tag set, since this contains elements relevant to virtually all text types and all kinds of text-processing work;
  • handles a reasonably wide variety of texts, at the level of detail found in existing practice;
  • is useful for the production of new documents as well as encoding of existing ones;
  • is as small and simple as is consistent with the other goals;
  • has been translated into a number of languages, among them Japanese.

The Advanatages of using TEI

Quality
Using a wide-coverage, well-designed (modular and extesible), widely accepted and maintained architecture.
Documentation
TEI is documented in the Guidelines as well as papers and support documentation of various projects, e.g. “best practice guides”.
Training
Tutorials on TEI are given at various conferences and are actively promoted by by the TEI Consortium.
Support
Specific problems might have been encountered before, and people exist that know how to solve them, e.g. on the tei-l public discussion list.
Software
Various software to process TEI already exists, and more is likely to become available.
Political correctness: Using TEI means contributing to open standards and recommendations.

The Disadvanatages of Using TEI

Tag abuse
TEI might not have elements / attributes with the exact meaning we require; this results in a tendency to misuse tags for purposes they were not meant for.
Tag bloat
being a general purpose recommendation, it can never be optimal for a specific application, i.e. a custom developed DTD will be leaner; have less (redundant) tags.
TEI for humanities
TEI is maybe the least developed for “high level” NLP applications: is problematic for encoding ontologies and lexical databases; not many applications of feature structures.

The GENIA Project

The GENIA Corpus

GPML and TEI

Implementing the Conversion

The TEI.GENIA DTD

<!DOCTYPE teiCorpus.2
    PUBLIC "-//TEI P4//DTD Main Document Type//EN" 
           "http://www.tei-c.org/P4X/DTD/tei2.dtd" [ 
  <!ENTITY % TEI.XML          "INCLUDE" >
  <!ENTITY % TEI.general      "INCLUDE">
  <!ENTITY % TEI.prose        "INCLUDE">
  <!ENTITY % TEI.dictionaries "INCLUDE">
  <!ENTITY % TEI.terminology  "INCLUDE">
  <!ENTITY % TEI.linking      "INCLUDE">
  <!ENTITY % TEI.analysis     "INCLUDE">
  <!ENTITY % TEI.fs           "INCLUDE">
  <!ENTITY % TEI.corpus       "INCLUDE">
  <!ENTITY % TEI.extensions.ent SYSTEM 'geniaex.ent'>
  <!ENTITY % TEI.extensions.dtd SYSTEM 'geniaex.dtd'>
]>

Overall Corpus Structure

<!DOCTYPE teiCorpus.2 SYSTEM "genia-tei.dtd">

<TEIcorpus.2>
  <teiHeader type="corpus">*Corpus_header*</teiHeader>
  <TEI.2 id="*MEDLINE_ID*">
    <teiHeader type="text">*Article_header*</teiHeader>
    <text>
      <body>
        <div type="abstract">
           <head>*Title_of_article*</head>
           <p>*Abstract_of_article*</p>
        </div>
        <div type="ontology">*Local_ontology*</div>
        <div type="lexicon">*Local_lexicon*</div>
      </body>
    </text>
  </TEI.2>
  *More_articles*
</TEIcorpus.2>

The TEI Header

To illustrate the information contained in the TEI header, we give below a part of the Encoding Description in the GENIA corpus header:
<encodingDesc>
  <projectDesc>
    <p>The GENIA project seeks to automatically extract ...</p>
  </projectDesc>
  <samplingDecl>
    <p>The corpus consits of abstracts found by ...</p>
  </samplingDecl>
  <tagsDecl>
    <tagUsage gi="body" occurs="670"></tagUsage>
    <tagUsage gi="cl" occurs="491"></tagUsage>
    <tagUsage gi="div" occurs="2010"></tagUsage>
    <tagUsage gi="entry" occurs="19472"></tagUsage>
    <tagUsage gi="form" occurs="19472"></tagUsage>
    <tagUsage gi="head" occurs="670"></tagUsage>
    <tagUsage gi="p" occurs="670"></tagUsage>
    <tagUsage gi="ptr" occurs="27305"></tagUsage>
    <tagUsage gi="s" occurs="5109"></tagUsage>
    <tagUsage gi="term" occurs="48906"></tagUsage>
    <tagUsage gi="termEntry" occurs="22707"></tagUsage>
    <tagUsage gi="tig" occurs="22707"></tagUsage>
    <tagUsage gi="xptr" occurs="14874"></tagUsage>
    <tagUsage gi="xr" occurs="19472"></tagUsage>
  </tagsDecl>
</encodingDesc>
<profileDesc>
  <langUsage>
    <language id="en">English</language>
    <language id="la">Latin</language>
  </langUsage>
</profileDesc>

Text Annotation

<div type="abstract">
  <head>Retinoic acid downmodulates erythroid differentiation and 
    <term ana="SEM-94.000">GATA1 expression</term> in 
    <term ana="SEM-94.001">purified adult-progenitor culture</term>.
  </head>
  <p>
    <s>
      In 
      <cl ana="SEM-94.011 SEM-94.012" 
          function="(OR SEM-94.011 SEM-94.012)">
        <term ana="SEM-94.013">clonogenetic fetal calf serum</term>
        <term ana="SEM-94.014">-supplemented (FCS+)</term> 
        or 
        <term ana="SEM-94.015">-nonsupplemented (FCS-)</term> 
        <term ana="SEM-94.016">culture</term>
      </cl> 
      treated with saturating levels of 
      <term ana="SEM-94.018">interleukin-3</term> 
      (<term ana="SEM-94.019">IL-3</term>) 
      <term ana="SEM-94.020">granulocyte- macrophage 
         colony-stimulating factor</term> ...
    </s> 
    ...

Further Annotation

Some next useful steps in annotation:
  • tokenisation into words and punctuaiton marks
  • part-of-speech tagging
  • annotation of named entities: abbreviations, names, chemical formulas, numbers, ...
  • identification of nominal compounds
  • shallow parsing

LTG XML Tools

Processing OHSUMED

The LTG XML tools have already been used to process a corpus of MEDLINE abstracts:

Using LTG XML Tools on GENIA

The LTG tools and rulesets they have developed for processing OHSUMED (c.f. HTML example) are being currently adapted for processing the GENIA corpus:
<SENTENCE>
<W C='W' P='DT'  C2='DD'>Some</W> 
<W C='W' P='VBN' C2='VVN' LM='convert'>converted</W>
<W C='W' P='IN'  C2='II'>from</W>
<W C='W' P='JJ'  C2='JJ'>ventricular</W>
<W C='W' P='NN'  C2='NN1' LM='fibrillation' VSTEM='fibrillate'>fibrillation</W>
<W C='W' P='TO'  C2='II'>to</W>
<W C='W' P='JJ'  C2='JJ' VSTEM='organize'>organized</W>
<W C='W' P='NNS' C2='NN2' LM='rhythm'>rhythms</W>
<W C='W' P='IN'  C2='II'>by</W>
<W C='HYW' P='JJ'>defibrillation-trained</W>
<W C='W' P='NN' C2='NN1' LM='ambulance'>ambulance</W>
<W C='W' P='NNS' C2='NN2' LM='technician'>technicians</W>
<PHR C='BR'>
  <W C='BR' P='(' C2='('>(</W>
  <W C='ABBR' P='NNS' C2='NP1'>EMT-Ds</W>
  <W C='BR' P=')' C2=')'>)</W>
</PHR>
<W C='W' P='MD' C2='VM' LM='will'>will</W>
<W C='W' P='VB' C2='VV0' LM='refibrillate'>refibrillate</W>
<W C='W' P='IN' C2='II'>before</W>
<W C='W' P='NN' C2='NN1' LM='hospital'>hospital</W>
<W C='W' P='NN' C2='NN1' LM='arrival' VSTEM='arrive'>arrival</W>
<W C='.' P='.'  C2='.'>.</W>
</SENTENCE>

Conclusions