Institut für Informationsverarbeitung
Geisteswissenschaftliche Fakultät
Karl-Franzens-Universität Graz
Academic year 2006/2007

Annotating language data

Tomaž Erjavec

Page http://nl.ijs.si/et/teach/graz06/annotation/ last updated 2006-12-02
Summary: The course discusses linguistic annotation in corpora. Many annotated corpora already exist, as do tools to annotate them for a variety of languages. Defining and producing such annotations is interesting from a linguistic perspective, as well as enabling further research and useful applications, e.g. complex concordance searches, multilingual and term lexicon extraction, or word-sense tagging. The course surveys various levels of annotation, primarily that of part-of-speech, syntax and lexical semantics. Addressed are various automatic and semi-automatic methods for corpus annotation, where special attention is given to statistical and machine learning methods. The discussion is exemplified by considering existing annotated corpora and tools for producing and using annotations. The course should enable students to understand the theoretical and practical issues involved in linguistic analysis of corpora, and to use such corpora for research.
Related course: Standards for digital encoding

Timetable "Annotating language data" 2006/2007

Lectures and lab sessions are on Fridays 3pm-5.30pm (3 x 45 minutes + breaks). Consultations are in the breaks between the lectures or by appointment.

Week Date Topics Lecture Lab session Assignment
1 3/11/06 Introduction: computer corpora
2 10/11/06 Basic linguistic annotation : tokenisation and morphosyntax Corpus concordances on the Internet: Assignment 1
3 17/11/06 Syntax : syntactic formalisms, treebanks Information on student projects
4 24/11/06 Lexical semantics : word-senses and word-sense disambiguation, WordNet Assignment 2
5 1/12/06 More annotation, Web as corpus project presentations

Assesment and Due Dates

The course score is computed on the basis of:

Ackowledgements

A big Thank You to Sabine Schulte im Walde and Heike Zinsmeister for allowing me to use their course materials Introduction to Corpus Resources, Annotation and Access, given as a Foundational Course at ESSLLI 2006, the 18th European Summer School in Logic, Language and Information. Thanks also to Manfred Pinkal for allowing the use the "Dolphin document Wordnet Exercise", from his and A.Koller's "Semantic Theory" 2005 class given at the Computational Linguistics and Phonetics Department at Saarland University.
Valid HTML 4.01!