Graz Uni 2006/2007: T. Erjavec: Annotating language data

Summary: The course discusses linguistic annotation in corpora. Many annotated corpora already exist, as do tools to annotate them for a variety of languages. Defining and producing such annotations is interesting from a linguistic perspective, as well as enabling further research and useful applications, e.g. complex concordance searches, multilingual and term lexicon extraction, or word-sense tagging. The course surveys various levels of annotation, primarily that of part-of-speech, syntax and lexical semantics. Addressed are various automatic and semi-automatic methods for corpus annotation, where special attention is given to statistical and machine learning methods. The discussion is exemplified by considering existing annotated corpora and tools for producing and using annotations. The course should enable students to understand the theoretical and practical issues involved in linguistic analysis of corpora, and to use such corpora for research.

Related course: Standards for digital encoding

Timetable "Annotating language data" 2006/2007

Lectures and lab sessions are on Fridays 3pm-5.30pm (3 x 45 minutes + breaks). Consultations are in the breaks between the lectures or by appointment.

Week Date Topics Lecture Lab session Assignment

1 3/11/06 Introduction: computer corpora

Lecture:
Slides .ppt, handout .pdf
Slides based on:
S. Schulte im Walde, H. Zinsmeister: ESSLLI 2006 course, part 1:
Introduction

2 10/11/06 Basic linguistic annotation : tokenisation and morphosyntax

S. Schulte im Walde, H. Zinsmeister: ESSLLI 2006 course, part 2:
Tokenisation and Morpho-Syntactic Annotation
Corpus concordances on the Internet:

Leeds Internet Corpora
Searching the BNC with VIEW
Assignment 1

3 17/11/06 Syntax : syntactic formalisms, treebanks

Slides .ppt, handout .pdf
Slides based on:
S. Schulte im Walde, H. Zinsmeister: ESSLLI 2006 course, part 4:
Syntactic Annotation

Exploring TreeBanks: Using TIGERsearch
Information on student projects

4 24/11/06 Lexical semantics : word-senses and word-sense disambiguation, WordNet

Slides .ppt, handout .pdf
Slides based on:
S. Schulte im Walde, H. Zinsmeister: ESSLLI 2006 course, part 5:
Semantic Annotation

Using TIGERsearch, cont.
Exploring WordNet:
SIMS 202 Assignment 4, WordNet Excerice (1)
Assignment 2

5 1/12/06 More annotation, Web as corpus

S. Schulte im Walde, H. Zinsmeister: ESSLLI 2006 course:

More Levels of Corpus Annotation
Web as corpus

Making a Web corpus: BootCat ; c.f. also BootCat excercies
project presentations

Week	Date	Topics	Lecture	Lab session	Assignment
1	3/11/06	Introduction: computer corpora	Lecture: Slides .ppt, handout .pdf Slides based on: S. Schulte im Walde, H. Zinsmeister: ESSLLI 2006 course, part 1: Introduction
2	10/11/06	Basic linguistic annotation : tokenisation and morphosyntax	S. Schulte im Walde, H. Zinsmeister: ESSLLI 2006 course, part 2: Tokenisation and Morpho-Syntactic Annotation	Corpus concordances on the Internet: Leeds Internet Corpora Searching the BNC with VIEW	Assignment 1
3	17/11/06	Syntax : syntactic formalisms, treebanks	Slides .ppt, handout .pdf Slides based on: S. Schulte im Walde, H. Zinsmeister: ESSLLI 2006 course, part 4: Syntactic Annotation	Exploring TreeBanks: Using TIGERsearch	Information on student projects
4	24/11/06	Lexical semantics : word-senses and word-sense disambiguation, WordNet	Slides .ppt, handout .pdf Slides based on: S. Schulte im Walde, H. Zinsmeister: ESSLLI 2006 course, part 5: Semantic Annotation	Using TIGERsearch, cont. Exploring WordNet: SIMS 202 Assignment 4, WordNet Excerice (1)	Assignment 2
5	1/12/06	More annotation, Web as corpus	S. Schulte im Walde, H. Zinsmeister: ESSLLI 2006 course: More Levels of Corpus Annotation Web as corpus	Making a Web corpus: BootCat ; c.f. also BootCat excercies	project presentations

Assesment and Due Dates

The course score is computed on the basis of:

Assignments (30%): two assignments, to be handed in one, max. two weeks after receiving the assignment.
Project (70%): composed of the practical work + written report, formatted as a usual conference paper. The project work is to be presented at the last lecture (1.12.2006) and the report handed in by the end of the term, 1.2.2007) at the latest.

Ackowledgements

A big Thank You to Sabine Schulte im Walde and Heike Zinsmeister for allowing me to use their course materials Introduction to Corpus Resources, Annotation and Access, given as a Foundational Course at ESSLLI 2006, the 18th European Summer School in Logic, Language and Information. Thanks also to Manfred Pinkal for allowing the use the "Dolphin document Wordnet Exercise", from his and A.Koller's "Semantic Theory" 2005 class given at the Computational Linguistics and Phonetics Department at Saarland University.

Institut für Informationsverarbeitung
Geisteswissenschaftliche Fakultät
Karl-Franzens-Universität Graz

Academic year 2006/2007

Annotating language data

Tomaž Erjavec

Related course: Standards for digital encoding

Timetable "Annotating language data" 2006/2007

Assesment and Due Dates

Ackowledgements

Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz

Academic year 2006/2007

Annotating language data

Tomaž Erjavec

Related course: Standards for digital encoding

Timetable "Annotating language data" 2006/2007

Assesment and Due Dates

Ackowledgements

Institut für Informationsverarbeitung
Geisteswissenschaftliche Fakultät
Karl-Franzens-Universität Graz