Annotating the GENIA corpus

Tomaž Erjavec

From January to June 2002 I have been a visiting researcher at Tsujii Laboratory, Department of Information Science, University of Tokyo, where - apart from having a very good time - I worked on the GENIA project.

GENIA seeks to develop information extraction techniques for scientific texts using NLP technology. The key task addressed by GENIA is the extraction of event information about protein interactions. To this end - and to provide much-needed annotated language resources for biomedical informatics - the project is developing a corpus of MEDLINE abstracts, which is being marked-up for terms from a domain-specific ontology, as well as for other types of linguistic knowledge. The corpus in encoded in the GENIA Corpus Markup Language (GPML), which is an XML DTD. The GENIA corpus has already been released in several versions, which are freely available via the WWW.

My work focused on two aspects of the annotation of the GENIA corpus:

  1. (re)encoding of the corpus according to the TEI Guidelines,
  2. linguistic annotation of the corpus with the LTG toolkit.

The Text Encoding Initiative Guidelines and the GENIA Corpus

The purpose of this work was to suggest an encoding of the corpus according to the Text Encoding Initiative Guidelines P4, and specify a constructive mapping (an XSLT transform) to this encoding. The motivation for this re-encoding is that TEI is a well-designed and widely accepted architecture, which has been often used for annotating language corpora, and by porting to it, GENIA can gain new insights into possible encoding practises and maybe make the corpus better suited for interchange. As the transformation to TEI is fully automatic, there is also no need to abandon the GPML format, which, as it has been crafted specially for GENIA, provides a tighter encoding than can be possible with the more general TEI.

The following documents further describe the TEI parametrisation and conversion process:

Below is our parametrisation of TEI and the conversion stylesheets: To convert to TEI with xsltproc, do:
   $ xsltproc genia30tei.xsl GENIAcorpus3.0.xml > GENIAcorpus3.0tei.xml
   $ xsltproc gpml2tei.xsl GENIA.gpml > GENIAtei.xml

Annotating the GENIA Corpus with LTG Tools


Further Links


Page last updated 2004-01-07, et