Introductory course at ESSLII 2002

Annotation of Language Resources

Lecture III.

Annotation Software

Tomaž Erjavec
Department of Intelligent Systems
Institute Jožef Stefan
Jamova 39, SI-1000 Ljubljana
Slovenia

Abstract

This lecture presents, for the most part, XML-related software, i.e. parsers, transformation engines and XML modules to use with general purpose programming languages; a section is also devoted to editors, in particular Emacs. Other linguistic annotation software is also mentioned, in particular several corpus workbenches and statistic tool packages. The lecture concludes with a case study on annotating the GENIA Corpus with LTG tools.


1. Software

1.1. Processing XML

1.1.1. The SP suite

  • SP is the best known suite of basic SGML tools
  • written by James Clark, available from http://jclark.com/sp/
  • SP programs are written in C++, freely available
  • maintenance has been taken over by an Open Source development team, hosted at http://openjade.sourceforge.net/
  • SP consists of the following programs:
    nsgmls
    validating parser
    sx
    converter from SGML to XML
    sgmlnorm
    markup normaliser
    spam
    markup stream editor
    spent
    interface to SP entity manager

1.1.2. Other SW by James Clark

Maintenance of software written by James Clark mostly taken over by open source developers:
jade
implementation of style language part of DSSSL. Extensively used for formatting DocBook documents.
expat
the “XML Parser Toolkit”, a library for XML parsing in C. This parser is used to add XML support to Netscape 5 and Perl.
XP
a high-performance XML parser in Java
XT
a Java implementation of XSLT

1.1.3. Libxml2 & Libxslt

  • Libxml is the XML C library developed for the Gnome project by Daniel Veillard; home page is at http://xmlsoft.org/
  • portable across operating systems, and a variety of language bindings makes the library available in non-C environments, e.g. Perl
  • Libxml2 implements a number of existing standards related to markup languages: XML, XML Namespaces, XML Base, URI, XPath, HTML4, XPointer, XInclude, XML Catalogs, Canonical XML
  • A companion library is Libxslt, an XSLT C library
  • Libxslt includes xsltproc, a command line XSLT processing program; it is known to be one of the better implementations of XSLT

1.1.4. Apache XML Project

  • The Apache Software Foundation provides support for the Apache community of open-source software projects.
  • One of their projects is the Apache XML Project, with the home page at http://xml.apache.org/
  • The goals of the Apache XML Project are to:
    • provide commercial-quality standards-based XML solutions that are developed in an open and cooperative fashion,
    • provide feedback to standards bodies (such as IETF and W3C) from an implementation perspective, and
    • be a focus for XML-related activities within Apache projects

1.1.5. Apache XML sub-projects

Sub-projects of the Apache XML Project focus on different aspects of XML:
Xerces
XML parsers in Java and C++ (with Perl and COM bindings)
Xalan
XSLT stylesheet processors, in Java and C++
Cocoon
XML-based web publishing, in Java
AxKit
XML-based web publishing, in mod_perl
FOP
XSL formatting objects, in Java
Xang
Rapid development of dynamic server pages, in JavaScript
SOAP
Simple Object Access Protocol
Batik
A Java based toolkit for Scalable Vector Graphics (SVG)
Crimson
A Java XML parser derived from the Sun Project X Parser

1.1.6. XSLT processors

1.1.7. XML and Perl

  • The Perl programming language is popular for text processing: strong pattern matching, possible to quickly write throw-away programs
  • Useful for up-translation to XML
  • Perl has a wealth of support for XML:
    • XML::Parser, the Perl interface to James Clark's expat parser
    • SAX, DOM, XPath
    • wrappers for Libxml2
  • Lots of information by Kip Hampton on xml.com

1.2. XML-aware editors

1.2.1. Emacs

  • A very powerful and extensible editor
  • Developed for Unix, but works also on Windows
  • Two versions: GNU emacs and XEmacs fork
  • Idiosyncratic and rather difficult to learn
  • The psgml mode customises emacs for working with SGML and XML files
  • For (valid) XML documents psgml allows:
    • element and attribute insertion and completion
    • element-directed movement
    • hiding and folding elements
    • syntax highlighting
    • finding and reporting errors
  • psgml does not style the document

1.2.2. Other XML editors

Many XML editors exist, quite a few not for free:
  • jEdit: a programmer's text editor written in Java, with many plugins; XML and XSLT included; free under the GNU General Public License
  • XXE: XMLmind XML Editor, supports CSS; free version
  • oXygen: XSL and XSLT support; cost 65$ - 25$ (academic use)
  • XMLwriter: incorporates XSLT, IDE,...; cost 40$
  • XMLpro: includes Near & Far DTD designer; cost 150$
  • XMLspy: a powerful XML development environment (XSLT, Schema generation, IDE,...); cost 400$

1.2.3. Open Office

  • OpenOffice.org was initiated and is supported by Sun Microsystems
  • Mission Statement: To create, as a community, the leading international office suite that will run on all major platforms and provide access to all functionality and data through open-component based APIs and an XML-based file format.
  • Open office includes import/export interfaces with various office productivity applications produced by Microsoft, e.g. Word.
  • KOffice has similar goals: it is a free, integrated office suite for KDE, the K Desktop Environment. It also uses XML as the common file format.

1.3. Corpus Workbenches

Many pointers are available from the Natural Language Software Registry and its successor Language Technology World

1.3.1. MATE

  • MATE stands for “Multilevel Annotation, Tools Engineering”.
  • The MATE EU project aimed to facilitate re-use of language resources by addressing the problems of creating, acquiring, and maintaining language corpora.
  • The MATE workbench is a program designed to aid in the display, editing and querying of annotated speech corpora. It can also be used for arbitrary sets of hyperlinked XML encoded files.
  • The workbench is written in Java and available under GPL.
  • It comprises an editing tool, and (idiosyncratic) query and a styling language.
  • The current version is reported to have problems, but a new project, NITE (Natural Interactivity Tools Engineering) should produce a new and much improved version...

1.3.2. GATE

  • GATE is a is software architecture for language engineering in development since 1995 at the University of Sheffield.
  • GATE is made up of three elements:
    • an architecture describing how language processing systems are made up of components;
    • a framework (or class library, or SDK), written in Java and tested on Linux, Windoze and Solaris;
    • a graphical development environment built on the framework.
  • Uses XML I/O and standoff markup.
  • Includes ANNIE - A Nearly-New Information Extraction system, which contains a tokeniser, gazetteer, sentence splitter, part-of-speech tagger, name entity recognition grammars, and orthographic co-reference module. Also has some IR features, e.g. indexing, clustering.

1.3.3. ALEMBIC

  • ALEMBIC Workbench is used for the creation of a natural language engineering environment for the development of tagged corpora.
  • It makes it extremely easy to annotate textual data with fully customisable tagsets. Among the various methods used to expedite the tagging process is the application of machine learning to bootstrap the human annotation process.
  • It provides evaluation tools to analyse annotated data, whether it is for the purpose of assessing machine information extraction performance, or for measuring inter-annotator agreement for a particular corpus or task.
  • Uses SGML I/O.
  • Runs on Unix/Windows/Mac and is freely available.

1.3.4. AGTK: Annotation Graph Toolkit

  • AGTK is a C++ library implementing Annotation Graphs. Offers a common high-level application programming interface for representing the data and managing input/output, together with a common architecture for managing the interaction of multiple components. Various tools can be built on the Annotation Graph Toolkit.
  • Annotation Graphs (Bird and Liberman) are a formal framework for representing linguistic annotations of time series data. Annotation graphs abstract away from file formats, coding schemes and user interfaces, providing a logical layer for annotation systems.
  • AGs have been used mostly for time-series data, e.g. TableTrans for observational coding, using a spreadsheet whose rows are aligned to a signal; MultiTrans for transcribing multi-party communicative interactions recorded using multi-channel signals; InterTrans for creating interlinear text aligned to audio; TreeTrans is for creating and manipulating syntactic trees.

1.3.5. ATLAS

  • A project closely connected to AG is ATLAS, Architecture and Tools for Linguistic Analysis Systems, an initiative involving NIST, LDC and MITRE.
  • ATLAS addresses an array of applications needs spanning corpus construction, evaluation infrastructure, and multi-modal visualization and uses Annotation Graphs.
  • ATLAS is made of four main components:
    • an annotation ontology,
    • an Application Programming Interface,
    • an interchange format for linguistic data and
    • MAIA, a type definition infrastructure

1.4. Statistic tools

1.4.1. Part-of-Speech Taggers

  • TnT - A Statistical Part-of-Speech Tagger: Fast HMM tagger; good unknown word guessing; Eng/Ger pre-compiled models; Solaris and Linux.
  • Brill's Transformation-Based Learning Tagger: a C symbolic tagger; flexible; can be slow. Other implementations are fnTBL (faster); mu-TBL (Prolog implementation).
  • TreeTagger: decision tree based tagger from Stuttgart; pre-compiled models for Eng/Ger/Fre/Ita; Solaris and Linux.
  • Maximum Entropy part of speech tagger: By Adwait Ratnaparkhi; downloadable (compiled) Java version; a sentence boundary detector is also available.
  • QTAG Part of speech tagger: An HMM-based Java POS tagger from Birmingham U. (Oliver Mason).
  • ICOPOST: C taggers by Ingo Schröder that implement maximum entropy, HMM trigram, and transformation-based learning. C source available under GNU public license.
  • LT POS and LT TTT: Edinburgh Language Technology Group tagger and text tokeniser (and sentence splitter). Binary only for Solaris. Doesn't allow you to train your own taggers.
  • TATOO, The ISSCO tagger: HMM tagger.

1.4.2. CMU-Cambridge Language Modelling toolkit

  • CMU-Cambridge Language Modelling toolkit is a suite of UNIX software tools to facilitate the construction and testing of statistical (N-gram) language models.
  • Some of the tools are used to process general textual data into:
    • word frequency lists and vocabularies
    • (vocabulary-specific) word bi-gram and trigram counts and statistics
    • various Backoff bi-gram and trigram language models
  • Other tools use the resulted language models to compute:
    • perplexity; Out-Of-Vocabulary (OOV) rate
    • bi-gram- and trigram-hit ratios; distribution of Backoff cases
    • annotation of test data with language scores
  • Freely available but, by now, somewhat old (1999)

1.4.3. SRILM

  • SRILM, the “SRI Language Modelling Toolkit” is a toolkit for building and applying statistical language models, primarily for use in speech recognition, statistical tagging and segmentation.
  • SRILM consists of the following components:
    • A set of C++ class libraries implementing language models, supporting data structures and miscellaneous utility functions.
    • A set of executable programs built on top of these libraries to perform standard tasks such as training LMs and testing them on data, tagging or segmenting text, etc.
    • A collection of miscellaneous scripts facilitating minor related tasks.
  • SRILM runs on UNIX and Windows platforms and can be downloaded free of charge under an "open source community license".

1.4.4. NSP

  • Ted Pedersen's “N-gram Statistics Package
  • NSP allows you to identify word n-grams that appear in large corpora using standard tests of association such as:
    • Fisher's exact test,
    • log likelihood ratio,
    • Pearson's chi-squared text,
    • Dice Coefficient.
  • NSP has been designed to allow a user to add their own tests with minimal effort.
  • NSP is written in Perl, and the source code is distributed under the GNU CopyLeft.
  • The same author also offers SENSEVAL related tools for Word-Sense Disambiguation: SenseTools

1.4.5. Bow

  • Andrew McCallum's “Bag Of Words
  • Bow is a Toolkit for Statistical Language Modelling, Text Retrieval, Classification and Clustering.
  • Bow (or libbow) is a library of C code useful for writing statistical text analysis, language modelling and information retrieval programs. The current distribution includes the library, as well as front-ends for
    • document classification (rainbow),
    • document retrieval (arrow),
    • document clustering (crossbow).
  • Distributed in source code, under the Library GNU Public License (LGPL).

1.5. LTG Tools and the GENIA Corpus: a Case Study

1.5.1. LTG XML Tools

  • Edinburgh's Language Technology Group, LTG, has produced an XML toolchest for natural language processing.
  • LT XML tools are modular with stream input/output, and are typically combined together in a pipeline.
  • LT TTT, the “LTG Text Tokenisation Tools”, are separate from the other XML tools, although they use the XML-handling API provided with the LT XML library
  • One of the great strengths of the LTG tools is that it they are aware of the XML structure of the file. This means that the tools can be instructed to process only certain elements from the input; the LTG tools - most importantly the tokeniser - also preserve whitespace from the input.

1.5.2. LT TTT

  • The main component of the LT TTT system is a program called fsgmatch. This is a general purpose cascaded transducer which processes an input stream deterministically and rewrites it according to a set of rules provided in a grammar file. It can be used to alter the input in a variety of ways, although the grammars provided with the LT TTT system are all used simply to add mark-up information.
  • With LT TTT come various grammars, most importantly one to tokenise text. The grammars are accompanied by documentation which allows users to alter grammars to suit your own needs or develop new rule sets for particular purposes.
  • The LT TTT system also contains components where the rules result from machine learning: the first is a part-of-speech tagger which assigns syntactic category labels to words; the second is a sentence boundary disambiguator which determines whether a full-stop is part of an abbreviation or a marker of a sentence boundary.

1.5.3. LTG Tools and the GENIA Corpus: a Case Study

In the rest of this lecture we review the work done on processing a specific corpus with the LTG toolset. This is documented in the the report Annotating the GENIA Corpus with LTG Tools.