Summary:
The course discusses linguistic annotation in corpora.
Many annotated corpora already exist, as do tools to annotate them for a variety of languages.
Defining and producing such annotations is interesting from a linguistic perspective,
as well as enabling further research and useful applications, e.g. complex concordance searches,
multilingual and term lexicon extraction, or word-sense tagging.
The course surveys various levels of annotation, primarily that of part-of-speech, syntax and lexical semantics.
Addressed are various automatic and semi-automatic methods for corpus annotation,
where special attention is given to statistical and machine learning methods.
The discussion is exemplified by considering existing annotated corpora and tools
for producing and using annotations.
The course should enable students to understand the theoretical and practical issues involved in
linguistic analysis of corpora, and to use such corpora for research.
Lectures and lab sessions are on Fridays 3pm-5.30pm (3 x 45 minutes + breaks).
Consultations are in the breaks between the lectures or by appointment.
Week
|
Date
|
Topics
|
Lecture
|
Lab session
|
Assignment
|
1 |
3/11/06 |
Introduction: computer corpora
|
|
|
|
2 |
10/11/06 |
Basic linguistic annotation
: tokenisation and morphosyntax |
|
Corpus concordances on the Internet:
|
Assignment 1 |
3 |
17/11/06 |
Syntax
: syntactic formalisms, treebanks |
|
|
Information on student projects |
4 |
24/11/06 |
Lexical semantics
: word-senses and word-sense disambiguation, WordNet |
|
|
Assignment 2 |
5 |
1/12/06 |
More annotation, Web as corpus
|
|
|
project presentations |
Assesment and Due Dates
The course score is computed on the basis of:
- Assignments (30%): two assignments, to be handed in one, max. two weeks after receiving the assignment.
- Project (70%): composed of the practical work + written report, formatted as a usual conference paper. The project work is to be presented at the last lecture (1.12.2006) and the report handed in by the end of the term, 1.2.2007) at the latest.
Ackowledgements
A big Thank You to
Sabine Schulte im Walde and
Heike Zinsmeister for allowing me to use their course materials
Introduction to Corpus Resources, Annotation and Access,
given as a Foundational Course at
ESSLLI 2006,
the 18th European Summer School in Logic, Language and Information.
Thanks also to
Manfred Pinkal
for allowing the use the "Dolphin document Wordnet Exercise", from his and A.Koller's
"
Semantic Theory" 2005 class given at the
Computational Linguistics and Phonetics Department at Saarland University.