Annotating the
GENIA
corpus
From January to June 2002 I have been a visiting researcher at Tsujii Laboratory,
Department of Information Science, University of Tokyo, where - apart
from having a very good time - I worked on the GENIA project.
GENIA
seeks to develop information extraction techniques for scientific
texts using NLP technology. The key task addressed by GENIA is the
extraction of event information about protein interactions.
To this end - and to provide much-needed annotated
language resources for biomedical informatics - the project is
developing a corpus of MEDLINE abstracts, which is being marked-up for
terms from a domain-specific ontology, as well as for other types of
linguistic knowledge.
The corpus in encoded in the
GENIA
Corpus Markup Language (GPML), which is an XML DTD.
The GENIA corpus has already been released in several
versions, which are
freely
available via the WWW.
My work focused on two aspects of the annotation of the GENIA corpus:
- (re)encoding of the corpus according to the TEI Guidelines,
- linguistic annotation of the corpus with the LTG toolkit.
The Text Encoding Initiative Guidelines and the GENIA Corpus
The purpose of this work was to suggest an encoding of the corpus
according to the Text Encoding
Initiative Guidelines P4, and specify a constructive mapping (an
XSLT transform) to this
encoding. The motivation for this re-encoding is that TEI is a
well-designed and widely accepted architecture, which has been often
used for annotating language corpora, and by porting to it, GENIA can
gain new insights into possible encoding practises and maybe make the
corpus better suited for interchange. As the transformation to TEI is
fully automatic, there is also no need to abandon the GPML format,
which, as it has been crafted specially for GENIA, provides a tighter
encoding than can be possible with the more general TEI.
The following documents further describe the TEI parametrisation and
conversion process:
Below is our parametrisation of TEI and the conversion stylesheets:
- TEI parametrisation for GENIA:
- The XSLT stylesheets for converting the GENIA corpus to TEI:
To convert to TEI with
xsltproc, do:
$ xsltproc genia30tei.xsl GENIAcorpus3.0.xml > GENIAcorpus3.0tei.xml
$ xsltproc gpml2tei.xsl GENIA.gpml > GENIAtei.xml
Annotating the GENIA Corpus with LTG Tools
Further Links
Page last updated 2004-01-07,
et