The Text Encoding Initiative and the GENIA Corpus

Tomaž Erjavec
Department of Intelligent Systems
Jožef Stefan Institute
Ljubljana
Slovenia

January - July 2002:
Tsujii Laboratory
Department of Information Science
University of Tokyo
Tokyo

Talk given at

National Institute of Informatics

Tokyo

June 26, 2002

These slides can be found at
http://nl.ijs.si/et/talks/nii02/ and http://www-tsujii.is.s.u-tokyo.ac.jp/~et/talk-nii/

TOC | First

Abstract

← ↑ →

The talk first introduces the Text Encoding Initiative, an international effort established in 1987 under the joint sponsorship of the Association for Computers and the Humanities, the Association for Computational Linguistics, and the Association for Literary and Linguistic Computing. TEI is the only systematised attempt to develop a fully general text encoding model and set of encoding conventions based upon it. It is suitable for processing and analysis of any type of text, in any language, and intended to serve the increasing range of existing (and potential) applications and use.

We overview the history and organisation of the TEI and introduce its main achievement, the TEI Guidelines, a set of recommendations for text encoding based on SGML and, recently, on XML. We explain the modular and parametrisable architecture of the Guidelines and give some advantages and disadvantages of using TEI. Projects using TEI are mentioned, with special emphasis on those dealing with Asian languages.

The second part of the talk discusses the GENIA corpus, which is being compiled at the Tsujii Laboratory, Department of Information Science, University of Tokyo. The corpus consists of annotated abstracts taken from National Library of Medicine's MEDLINE database. We discuss the encoding of the corpus in the GENIA markup language GPML, and its TEI incarnation, automatically derived via XSLT. Special emphasis is given on developing a TEI parametrisation suitable for encoding the growing body of biomedical resources. The talk concludes with recent developments of the GENIA corpus and plans for the future.

1. Overview
2. TEI History: Establishment and Motivations
3. TEI History: Basics and First Drafts
4. TEI Guidelines: P3
5. What is XML?
6. TEI Guidelines: P4 (& P5)
7. The TEI Consortium
8. Projects Using the TEI
9. TEI in Japan
10. Why not more?
11. The TEI Guidelines
12. Structure of the TEI DTD
13. The Core Tagset
14. Base Tagsets
15. Additional Tagsets
16. Examples of TEI Use
17. TEI.analysis Example
18. TEI.fs Example
19. TEI Lite
20. The Advanatages of using TEI
21. The Disadvanatages of Using TEI
22. The GENIA Project
23. The GENIA Corpus
24. GPML and TEI
25. Implementing the Conversion
26. The TEI.GENIA DTD
27. Overall Corpus Structure
28. The TEI Header
29. Text Annotation
30. Further Annotation
31. LTG XML Tools
32. Processing OHSUMED
33. Using LTG XML Tools on GENIA
34. Conclusions

Overview

← ↑ →

The Text Encoding Initiative
Encoding the GENIA corpus in TEI
Further annotation of GENIA with XML tools

TEI History: Establishment and Motivations

← ↑ →

The Text Encoding Initiative was established in 1987 under the joint sponsorship of the:
- ACH: Association for Computers and the Humanities
- ACL: Association for Computational Linguistics
- ALLC: Association for Literary and Linguistic Computing.
Impetus for the project came from the humanities computing community, which sought a common encoding scheme for complex textual structures in order to:
- reduce the diversity of existing encoding practices,
- simplify processing by machine,
- encourage sharing of electronic texts.
It soon became apparent that a sufficiently flexible scheme could provide solutions for text encoding problems generally.

TEI History: Basics and First Drafts

← ↑ →

TEI became the only systematized attempt to develop a fully general text encoding model and set of encoding conventions based upon it, suitable for processing and analysis of:
- any type of text,
- in any language,
- and intended to serve the increasing range of existing (and potential) applications and use.
SGML, the Standard Generalized Markup Language, an ISO standard, was chosen as the underlying standard for the TEI Guidelines.
The first draft of the TEI Guidelines for Electronic Text Encoding and Interchange, TEI P1 were published in 1990; the second draft, TEI P2 followed in 1993.

TEI Guidelines: P3

← ↑ →

TEI P3, the first non-draft version, were published in 1994 in two substantial green volumes (1200pp); soon P3 also became available on the Web.
In the years since, TEI P3 has become the de facto standard for scholarly work with digital text.
In 1999, a revised edition of TEI P3 was produced, correcting several typographic and other errors.
But in 1998, Version 1.0 of the XML specification was published by the W3C, and soon becomes an unprecedented success.

What is XML?

← ↑ →

XML is a definition of device-independent, system-independent methods of storing and processing texts in electronic form
XML is a metalanguage - a language for describing other languages - which lets you design your own customized markup languages for different types of documents
XML is a project of the World Wide Web Consortium (W3C), and the development of the specification is being supervised by their XML Working Group; hence, it is an open and non-proprietary specification
XML is a subset of SGML, the international standard metalanguage for text markup systems (ISO 8879)

TEI Guidelines: P4 (& P5)

← ↑ →

Breaking news:

A major revision of TEI P3, the TEI P4 was published on the WWW in early 2002.
In June 2002 P4 was published also in print, in two beautiful blue volumes. The Press release is also available in Japanese.
TEI P4 addresses the following issues:
- error correction, while maintaining backward compatibility;
- provides support for XML (as well as SGML).

The future:

Many possibilities for other, more fundamental, changes have been identified, e.g. the utilisation of XLink & XPointer, XML Schemas, etc.
TEI P5 will be the next full revision of the Guidelines. No date has yet been fixed for its appearance.

The TEI Consortium

← ↑ →

In December 2000 the TEI Consortium was set up to maintain and develop the TEI standard.
The Consortium is a non-profit corporation, has executive offices in Bergen, Norway, and hosts at at the University of Bergen, Brown University, Oxford University, and the University of Virginia.
The Consortium is managed by a Board of Directors, and its technical work is overseen by an elected Council. Lou Burnard is European editor and Syd Bauman is North American Editor.
Institutions and individuals can become Consortium members or subscribers, which gives them certain benefits inside the consortium.
The consortium has now over 50 members, from small university research projects to major academic libraries and institutions.

Projects Using the TEI

← ↑ →

There are 88 listed and 75 “live” projects that have used the TEI, as given on the TEI project page. Some examples, with last significant update of the entry:

American Numismatic Society [20 May 2002]
American Theological Library Association [18 April 2002]
The Legacy Tobacco Documents Library [6 February 2002]
Emblem Project Utrecht [1 February 2002]
Medieval Nordic Text Archive [30 January 2002]
Oxford Text Archive [22 January 2002]
African Languages Lexicon Project [21 January 2001]
British National Corpus [21 January 2002]
The World of Dante [21 January 2001]
Victorian Women Writers' Project [16 January 2002]
Henrik Ibsen's Writings [11 January 2002]
The Digital Dictionary of Buddhism [18 December 2001]
The English-Norwegian Parallel Corpus [18 December 2001]
The FIDA Corpus of Slovene Language [18 December 2001]
The Oslo Multilingual Corpus [18 December 2001]
Slovene-English Parallel Corpus [18 December 2001]
Multext-East [17 December 2001]

TEI in Japan

← ↑ →

Six listed projects are centered on Asian languages:

Chinese/Japanese/Korean-English Dictionary
Contact: Charles Muller, Toyo Gakuen University, Chiba
The Digital Dictionary of Buddhism
Contact: Charles Muller, Toyo Gakuen University, Chiba
Chinese Buddhist Electronic Text Association
Contact: Christian Witern, Kyoto University
Japanese Text Initiative
Contact: Sachiko Iwabuchi, University of Virginia
(Japanese/English Bilingual Corpus)
Contact: Francis Bond, NTT Communication Science Laboratories
(Japanese Map Task Dialogue Corpus)
Contact: Syun Tutiya, Chiba University

Why not more?

← ↑ →

Why isn't TEI used more in Asia? I see three possibilities:

There is, in general, less work of the kind where TEI would be helpful, i.e. less corpora are produced, and less (historical, classical) texts are being made available electronically. Maybe this could be due to differing funding priorities? In Europe, there has been a number of EU project calls focused on language resources, while the US funds many projects on digitisation of literary materials.
TEI is not known in the region. In the USA and Europe TEI has had significant exposure, which is rather natural as this is where it was concieved and where the work of took place. TEI has also been featured at various - mostly corpus oriented - events.
Different character sets used in Asian languages and/or English documentation and tag names make TEI unapppealing in contrast to localy produced encodings. This is more a wild guess - after all, TEI/SGML is able to handle arbitrary character sets, and probably any interchange format will have its documentation in English...

The TEI Guidelines

← ↑ →

TEI Guidelines define a language for describing how texts are constructed and propose names for their components.
There are many such standard vocabularies in the industrial world (e.g. banking, aircraft maintenance, chemical modelling); TEI's achievement has been to try to do the same thing for textual and linguistic data.
TEI Guidelines consist of:
- the formal specification, consisting of a set of SGML/XML Document Type Definition (DTD) fragments; these DTDs specify the element grammar of valid TEI documents;
- the accompanying documentation, explaining the background and overall structure of TEI document modelling and describing the meaning of the formaly defined elements.
The Guidelines follow Knuth's literate programming practice, where the source (itself written in TEI) contains the formal specification as well as the documentation.

Structure of the TEI DTD

← ↑ →

The formal SGML/XML part of TEI comes as a set of DTD fragments or tag sets. A TEI DTD for a particular application is constructed by selecting an appropriate combination of such tag sets, which include:

core tag sets: standard components of the TEI main DTD in all its forms; these are always included without any special action by the encoder.
base tag sets: basic building blocks for specific text types; exactly one base must be selected by the encoder, unless one of the ‘combined’ bases is used.
additional tag sets: extra tags useful for particular purposes. These tag sets are compatible with all bases and with each other; an encoder may therefore add them to the selected base in any combination desired.
user defined tag sets: these extra tags give the possibility of extending and overriding the definitions provided in the TEI tag set.

The Core Tagset

← ↑ →

The core tagset, which is always available, consists of:

Core tags: Used in the text, and are, for the most part, in-line elements with no consistent internal structure, e.g. highlighting (<emph>), quotation, <q>, names <name>, etc.
TEI header: Describes an encoded work so that the text itself, its source, its encoding, and its revisions are all thoroughly documented.

Base Tagsets

← ↑ →

Only one base can be chosen, unless a combined base is also selected:

TEI.prose: the base tag set for prose
TEI.verse: the base tag set for verse
TEI.drama: the base tag set for drama
TEI.spoken: the base tag set for transcriptions of spoken texts
TEI.dictionaries: the base tag set for print dictionaries
TEI.terminology: the base tag set for terminological data files
TEI.general: the generic mixed-mode base base tag set
TEI.mixed: the base tag set for free mixed-mode texts

Additional Tagsets

← ↑ →

These tagsets represents additional interpretations of text, and an arbitrary number can be chosen:

TEI.linking: tags for linking, segmentation, and alignment
TEI.analysis: tags for simple analytic mechanisms
TEI.fs: tags for feature structure analysis
TEI.certainty: tags for indicating uncertainty and probability in the markup
TEI.transcr: tags for manuscripts, analytic bibliography, and transcription of primary sources
TEI.textcrit: tags for critical editions
TEI.names.dates: specialized tags for names and dates
TEI.nets: tags for graphs, digraphs, trees, and other networks
TEI.figures: tags for graphics, figures, illustrations, tables, and formulae
TEI.corpus: tags for additional tags for language corpora

Examples of TEI Use

← ↑ →

A newspaper story:


<div1 type="story">
  <head rend="large underlined" type="sub">
    President pledges safeguards for 2,400 British troops 
in Bosnia
  </head>
  <head rend="very large bold" type="main">
    Major agrees to enforced no-fly zone
  </head>
  <byline>
    By George Jones, Political Editor, in Washington
  </byline>
  <p>
    Greater Western intervention in the conflict in
former Yugoslavia was pledged by President Bush ...
  </p>
</div1>

TEI.analysis Example

← ↑ →


<seg id="orwl.en.24" corresp="orwl.sl.24">
  <s id="Oen.1.1.4.5">
    <q>
    <w ana="Af" lemma="big">Big</w> 
    <w ana="Ncms" lemma="brother">Brother</w> 
    <w ana="Vaip3s" lemma="be">is</w> 
    <w ana="Vmpp" lemma="watch">watching</w> 
    <w ana="Pp2" lemma="you">you</w>
    </q>
    <w ana="Dd" lemma="the">the</w> 
    <w ana="Ncns" lemma="caption">caption</w> 
    <w ana="Vmis" lemma="say">said</w>
    <c ana="Cs" lemma="while">while</w> 
    <w ana="Dd" lemma="the">the</w> 
    <w ana="Af" lemma="dark">dark</w> 
    <w ana="Ncnp" lemma="eye">eyes</w> 
    <w ana="Vmis" lemma="look">looked</w> 
    <w ana="Rmp" lemma="deep">deep</w> 
    <w ana="Sp" lemma="into">into</w> 
    <w ana="Np" lemma="winston">Winston</w>
    <w type="rsplit" ana="St" lemma="'s">'s</w> 
    <w ana="Ps3" lemma="own">own</w>
    <c ctag=".">.</c>
  </s>
</seg>

TEI.fs Example

← ↑ →


<fsLib>
  <fs type="Noun" id="Ncfda" feats="N1.c N2.f N3.d N4.a"/>
  <fs type="Noun" id="Ncfdd" feats="N1.c N2.f N3.d N4.d"/>
  <fs type="Noun" id="Ncfdg" feats="N1.c N2.f N3.d N4.g"/>
  ...
</fsLib>

<fLib>
  <f id="N1.c"  select="en ro sl cs bg et hu hr" name="Type">
    <sym value="common"/>
  </f>
  <f id="N1.p"  select="en ro sl cs bg et hu hr" name="Type">
    <sym value="proper"/>
  </f>
  <f id="N2.m"  select="en ro sl cs bg       hr" name="Gender">
    <sym value="masculine"/>
  </f>
  <f id="N2.f"  select="en ro sl cs bg       hr" name="Gender">
    <sym value="feminine"/>
  </f>
  <f id="N2.n"  select="en ro sl cs bg       hr" name="Gender">
    <sym value="neuter"/>
  </f>
  ...
</fLib>

TEI Lite

← ↑ →

TEI Lite is a particular parametrisation of TEI (a DTD), which implements a useful “starter set”, comprising the elements which almost every user should know about.

Some characteristcs of TEI Lite:

includes most of the TEI “core” tag set, since this contains elements relevant to virtually all text types and all kinds of text-processing work;
handles a reasonably wide variety of texts, at the level of detail found in existing practice;
is useful for the production of new documents as well as encoding of existing ones;
is as small and simple as is consistent with the other goals;
has been translated into a number of languages, among them Japanese.

The Advanatages of using TEI

← ↑ →

Quality: Using a wide-coverage, well-designed (modular and extesible), widely accepted and maintained architecture.
Documentation: TEI is documented in the Guidelines as well as papers and support documentation of various projects, e.g. “best practice guides”.
Training: Tutorials on TEI are given at various conferences and are actively promoted by by the TEI Consortium.
Support: Specific problems might have been encountered before, and people exist that know how to solve them, e.g. on the tei-l public discussion list.
Software: Various software to process TEI already exists, and more is likely to become available.; Political correctness: Using TEI means contributing to open standards and recommendations.

The Disadvanatages of Using TEI

← ↑ →

Tag abuse: TEI might not have elements / attributes with the exact meaning we require; this results in a tendency to misuse tags for purposes they were not meant for.
Tag bloat: being a general purpose recommendation, it can never be optimal for a specific application, i.e. a custom developed DTD will be leaner; have less (redundant) tags.
TEI for humanities: TEI is maybe the least developed for “high level” NLP applications: is problematic for encoding ontologies and lexical databases; not many applications of feature structures.

The GENIA Project

← ↑ →

The GENIA project is being undertaken at Tsujii Laboratory, University of Tokyo and is partially supported by JSPS Research for the Future program.
GENIA seeks to automatically extract useful information from texts written by scientists to help overcome the problems caused by information overload.
The prototype domain of application is bio-medicine, in particular extracting event information about protein interactions.
The project is building a corpus, which, supported by other types of resources, is to be used as a testbed for the application domain.

The GENIA Corpus

← ↑ →

The GENIA corpus consists of annotated abstracts taken from National Library of Medicine's MEDLINE database.
The corpus includes local (abstract specific) lexica and ontologies.
The corpus is encoded in XML, using the GPML (GENIA Project Markup Language) DTD.
Version 1.1 was released in September 2001, and consists of 670 abstracts (cca 130,000 words), and is publicly available.
Version 1.1 is sentence segmented, and manually annotated for terms and, in some cases clauses.
Version 2.0 is to be larger, and is to contain further linguistic and domain specific annotation, e.g. tokenisaton, part-of-speech tags, (shallow) parsing, lexical information, etc.

GPML and TEI

← ↑ →

The GENIA markup language is GMPL, an XML DTD specifically defined for the GENIA corpus.
TEI is well-designed and widely accepted XML architecture, which has been often used for annotating language corpora
By re-coding the GPML corpus to TEI:
- GENIA can gain new insights into possible encoding practices (e.g. header information)
- the GENIA corpus might become better suited for interchange
- the TEI encoding can serve as a blueprint for a general standard for encoding bio-medical data, a rapidly expanding field.
With an automatic transform to TEI there is also no need to abandon the GPML format, which, as it has been crafted specially for GENIA, provides a tighter encoding than can be possible with the more general TEI.

Implementing the Conversion

← ↑ →

The conversion process takes advantage of the fact that both the input (GPML) and output (TEI) are encoded in XML: it is an XSLT stylesheet.
The W3C recommendation XSL Transformation Language, XSLT enables the declarative specification of transformation between XML documents.
There exist a number of free XSLT processors; the best current implementations of the specification seem to be Mike Kay's Saxon and Daniel Veillard's libxslt.

The TEI.GENIA DTD

← ↑ →

<!DOCTYPE teiCorpus.2
    PUBLIC "-//TEI P4//DTD Main Document Type//EN" 
           "http://www.tei-c.org/P4X/DTD/tei2.dtd" [ 
  <!ENTITY % TEI.XML          "INCLUDE" >
  <!ENTITY % TEI.general      "INCLUDE">
  <!ENTITY % TEI.prose        "INCLUDE">
  <!ENTITY % TEI.dictionaries "INCLUDE">
  <!ENTITY % TEI.terminology  "INCLUDE">
  <!ENTITY % TEI.linking      "INCLUDE">
  <!ENTITY % TEI.analysis     "INCLUDE">
  <!ENTITY % TEI.fs           "INCLUDE">
  <!ENTITY % TEI.corpus       "INCLUDE">
  <!ENTITY % TEI.extensions.ent SYSTEM 'geniaex.ent'>
  <!ENTITY % TEI.extensions.dtd SYSTEM 'geniaex.dtd'>
]>

Overall Corpus Structure

← ↑ →

<!DOCTYPE teiCorpus.2 SYSTEM "genia-tei.dtd">

<TEIcorpus.2>
  <teiHeader type="corpus">*Corpus_header*</teiHeader>
  <TEI.2 id="*MEDLINE_ID*">
    <teiHeader type="text">*Article_header*</teiHeader>
    <text>
      <body>
        <div type="abstract">
           <head>*Title_of_article*</head>
           <p>*Abstract_of_article*</p>
        </div>
        <div type="ontology">*Local_ontology*</div>
        <div type="lexicon">*Local_lexicon*</div>
      </body>
    </text>
  </TEI.2>
  *More_articles*
</TEIcorpus.2>

The TEI Header

← ↑ →

To illustrate the information contained in the TEI header, we give below a part of the Encoding Description in the GENIA corpus header:

<encodingDesc>
  <projectDesc>
    <p>The GENIA project seeks to automatically extract ...</p>
  </projectDesc>
  <samplingDecl>
    <p>The corpus consits of abstracts found by ...</p>
  </samplingDecl>
  <tagsDecl>
    <tagUsage gi="body" occurs="670"></tagUsage>
    <tagUsage gi="cl" occurs="491"></tagUsage>
    <tagUsage gi="div" occurs="2010"></tagUsage>
    <tagUsage gi="entry" occurs="19472"></tagUsage>
    <tagUsage gi="form" occurs="19472"></tagUsage>
    <tagUsage gi="head" occurs="670"></tagUsage>
    <tagUsage gi="p" occurs="670"></tagUsage>
    <tagUsage gi="ptr" occurs="27305"></tagUsage>
    <tagUsage gi="s" occurs="5109"></tagUsage>
    <tagUsage gi="term" occurs="48906"></tagUsage>
    <tagUsage gi="termEntry" occurs="22707"></tagUsage>
    <tagUsage gi="tig" occurs="22707"></tagUsage>
    <tagUsage gi="xptr" occurs="14874"></tagUsage>
    <tagUsage gi="xr" occurs="19472"></tagUsage>
  </tagsDecl>
</encodingDesc>
<profileDesc>
  <langUsage>
    <language id="en">English</language>
    <language id="la">Latin</language>
  </langUsage>
</profileDesc>

Text Annotation

← ↑ →

<div type="abstract">
  <head>Retinoic acid downmodulates erythroid differentiation and 
    <term ana="SEM-94.000">GATA1 expression</term> in 
    <term ana="SEM-94.001">purified adult-progenitor culture</term>.
  </head>
  <p>
    <s>
      In 
      <cl ana="SEM-94.011 SEM-94.012" 
          function="(OR SEM-94.011 SEM-94.012)">
        <term ana="SEM-94.013">clonogenetic fetal calf serum</term>
        <term ana="SEM-94.014">-supplemented (FCS+)</term> 
        or 
        <term ana="SEM-94.015">-nonsupplemented (FCS-)</term> 
        <term ana="SEM-94.016">culture</term>
      </cl> 
      treated with saturating levels of 
      <term ana="SEM-94.018">interleukin-3</term> 
      (<term ana="SEM-94.019">IL-3</term>) 
      <term ana="SEM-94.020">granulocyte- macrophage 
         colony-stimulating factor</term> ...
    </s> 
    ...

Further Annotation

← ↑ →

Some next useful steps in annotation:

tokenisation into words and punctuaiton marks
part-of-speech tagging
annotation of named entities: abbreviations, names, chemical formulas, numbers, ...
identification of nominal compounds
shallow parsing

LTG XML Tools

← ↑ →

Edinburgh's Language Technology Group (LTG) has produced an XML toolchest for natural language processing.
The tools are modular with stream input/output, and can be combined together in a pipeline.
The release available runs on Sun/Solaris and is provided free of charge under a research license.
The main component of the toolchest is a general purpose cascaded transducer which processes an input stream deterministically and rewrites it according to a set of rules provided in a grammar file.
With the toolchest come various grammars, most importantly one to tokenise text. The grammars are accompanied by documentation which allows users to alter grammars to suit your own needs or develop new rule sets for particular purposes.
The system also contains two statistical components: a part-of-speech tagger and a sentence boundary disambiguator.

Processing OHSUMED

← ↑ →

The LTG XML tools have already been used to process a corpus of MEDLINE abstracts:

Claire Grover, Colin Matheson, Andrei Mikheev and Marc Moens (2000): LT TTT - A Flexible Tokenisation Tool. In Proceedings of Second International Conference on Language Resources and Evaluation (LREC 2000).
Claire Grover and Alex Lascarides (2001): XML-Based Data Preparation for Robust Deep Parsing. ACL-EACL 2001.
Claire Grover, Ewan Klein, Alex Lascarides and Maria Lapata (2002): XML-based NLP Tools for Analysing and Annotating Medical Language. 2nd Workshop on NLP and XML (NLPXML-2002), Taipei, September 1, 2002. (CoLing Workshop)

Using LTG XML Tools on GENIA

← ↑ →

The LTG tools and rulesets they have developed for processing OHSUMED (c.f. HTML example) are being currently adapted for processing the GENIA corpus:

<SENTENCE>
<W C='W' P='DT'  C2='DD'>Some</W> 
<W C='W' P='VBN' C2='VVN' LM='convert'>converted</W>
<W C='W' P='IN'  C2='II'>from</W>
<W C='W' P='JJ'  C2='JJ'>ventricular</W>
<W C='W' P='NN'  C2='NN1' LM='fibrillation' VSTEM='fibrillate'>fibrillation</W>
<W C='W' P='TO'  C2='II'>to</W>
<W C='W' P='JJ'  C2='JJ' VSTEM='organize'>organized</W>
<W C='W' P='NNS' C2='NN2' LM='rhythm'>rhythms</W>
<W C='W' P='IN'  C2='II'>by</W>
<W C='HYW' P='JJ'>defibrillation-trained</W>
<W C='W' P='NN' C2='NN1' LM='ambulance'>ambulance</W>
<W C='W' P='NNS' C2='NN2' LM='technician'>technicians</W>
<PHR C='BR'>
  <W C='BR' P='(' C2='('>(</W>
  <W C='ABBR' P='NNS' C2='NP1'>EMT-Ds</W>
  <W C='BR' P=')' C2=')'>)</W>
</PHR>
<W C='W' P='MD' C2='VM' LM='will'>will</W>
<W C='W' P='VB' C2='VV0' LM='refibrillate'>refibrillate</W>
<W C='W' P='IN' C2='II'>before</W>
<W C='W' P='NN' C2='NN1' LM='hospital'>hospital</W>
<W C='W' P='NN' C2='NN1' LM='arrival' VSTEM='arrive'>arrival</W>
<W C='.' P='.'  C2='.'>.</W>
</SENTENCE>

Conclusions

← ↑ →

TEI, the framework to annotate linguistic data
Applying TEI to the GENIA corpus
Further processing of the GENIA corpus using XML technology

The Text Encoding Initiative and the GENIA Corpus

Tomaž Erjavec Department of Intelligent Systems Jožef Stefan Institute Ljubljana Slovenia January - July 2002: Tsujii Laboratory Department of Information Science University of Tokyo Tokyo

Talk given at

National Institute of Informatics

Tokyo

June 26, 2002

Abstract

Overview

TEI History: Establishment and Motivations

TEI History: Basics and First Drafts

TEI Guidelines: P3

What is XML?

TEI Guidelines: P4 (& P5)

The TEI Consortium

Projects Using the TEI

TEI in Japan

Why not more?

The TEI Guidelines

Structure of the TEI DTD

The Core Tagset

Base Tagsets

Additional Tagsets

Examples of TEI Use

TEI.analysis Example

TEI.fs Example

TEI Lite

The Advanatages of using TEI

The Disadvanatages of Using TEI

The GENIA Project

The GENIA Corpus

GPML and TEI

Implementing the Conversion

The TEI.GENIA DTD

Overall Corpus Structure

The TEI Header

Text Annotation

Further Annotation

LTG XML Tools

Processing OHSUMED

Using LTG XML Tools on GENIA

Conclusions

Tomaž Erjavec
Department of Intelligent Systems
Jožef Stefan Institute
Ljubljana
Slovenia

January - July 2002:
Tsujii Laboratory
Department of Information Science
University of Tokyo
Tokyo