Abstract
The talk first introduces the Text Encoding Initiative, an
international effort established in 1987 under the joint sponsorship
of the Association for Computers and the Humanities, the Association
for Computational Linguistics, and the Association for Literary and
Linguistic Computing. TEI is the only systematised attempt to develop
a fully general text encoding model and set of encoding conventions
based upon it. It is suitable for processing and analysis of any type
of text, in any language, and intended to serve the increasing range
of existing (and potential) applications and use.
We overview the history and organisation of the TEI and introduce its
main achievement, the TEI Guidelines, a set of recommendations for
text encoding based on SGML and, recently, on XML. We explain the
modular and parametrisable architecture of the Guidelines and give
some advantages and disadvantages of using TEI. Projects using TEI are
mentioned, with special emphasis on those dealing with Asian
languages.
The second part of the talk discusses the GENIA corpus, which is being
compiled at the Tsujii Laboratory, Department of Information Science,
University of Tokyo. The corpus consists of annotated abstracts taken
from National Library of Medicine's MEDLINE database. We discuss the
encoding of the corpus in the GENIA markup language GPML, and its TEI
incarnation, automatically derived via XSLT. Special emphasis is given
on developing a TEI parametrisation suitable for encoding the growing
body of biomedical resources. The talk concludes with recent
developments of the GENIA corpus and plans for the future.
33. Using LTG XML Tools on GENIA
The LTG tools and rulesets they have
developed for
processing OHSUMED (c.f.
HTML example)
are being
currently adapted for processing the GENIA corpus:
<SENTENCE>
<W C='W' P='DT' C2='DD'>Some</W>
<W C='W' P='VBN' C2='VVN' LM='convert'>converted</W>
<W C='W' P='IN' C2='II'>from</W>
<W C='W' P='JJ' C2='JJ'>ventricular</W>
<W C='W' P='NN' C2='NN1' LM='fibrillation' VSTEM='fibrillate'>fibrillation</W>
<W C='W' P='TO' C2='II'>to</W>
<W C='W' P='JJ' C2='JJ' VSTEM='organize'>organized</W>
<W C='W' P='NNS' C2='NN2' LM='rhythm'>rhythms</W>
<W C='W' P='IN' C2='II'>by</W>
<W C='HYW' P='JJ'>defibrillation-trained</W>
<W C='W' P='NN' C2='NN1' LM='ambulance'>ambulance</W>
<W C='W' P='NNS' C2='NN2' LM='technician'>technicians</W>
<PHR C='BR'>
<W C='BR' P='(' C2='('>(</W>
<W C='ABBR' P='NNS' C2='NP1'>EMT-Ds</W>
<W C='BR' P=')' C2=')'>)</W>
</PHR>
<W C='W' P='MD' C2='VM' LM='will'>will</W>
<W C='W' P='VB' C2='VV0' LM='refibrillate'>refibrillate</W>
<W C='W' P='IN' C2='II'>before</W>
<W C='W' P='NN' C2='NN1' LM='hospital'>hospital</W>
<W C='W' P='NN' C2='NN1' LM='arrival' VSTEM='arrive'>arrival</W>
<W C='.' P='.' C2='.'>.</W>
</SENTENCE>