Course
Advanced Language Technologies
Programme "Information and Communication Technologies"
Research Area "Knowledge Technologies"
Jožef Stefan International Postgraduate School
Winter 2009 / Spring 2010
Lecturer
Course timetable and materials
October 21st, 2009 15:15 - 18:00, MPŠ
- Introduction to LT
[PPT]
[PDF]
March 31th, 2010, 15:15 - 18:00, MPŠ
-
Computer corpora and Morphosyntactic tagging
[PPT]
[PDF]
Assessment
Seminar work, consisting of an experiment (to be determined in
consultation with the lecturer), accompanied by a report (3,000 words),
describing the problem; approach taken to solving it;
related work; and the evaluation of the results.
Suggestions for seminar topics
- Train and test the
Brill tagger
on the
JOS
corpus
- Make and analysis of the JOS treebank (an example is
here) and try to train and test
MALT
parser on it.
- Use the
Slovene WordNet
for various tasks.
Literature list
-
The main textbook for the field is:
Daniel Jurafsky, James H. Martin.
Speech
and Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and Speech
Recognition. Prentice-Hall, 2000.
Contents:
I. Words,
II. Syntax,
III. Semantics,
IV. Pragmatics,
V. Multilingual Processing.
- All slides accompanying the lectures are available on the Web (links
next to the lectures above)
- Supplementary reading for the course topics are the following papers:
- Machine Learning of Morphosyntactic Structure:
Lemmatising Unknown Slovene Words.
Tomaž Erjavec and Sašo Džeroski.
Applied Artificial Intelligence, 18(1), pp. 17-40, 2004.
- A Machine Learning Approach to Automatic
Functor Assignment in the Prague Dependency Treebank.
Zdenek Žabokrtsky, Petr Sgall, Saso Džeroski.
In Proceedings of the Third International Conference on Language Resources
and Evaluation, LREC'02.
-
MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications,
Lexicons and Corpora.
Tomaž Erjavec.
In Proceedings of the Fourth International Conference on Language Resources
and Evaluation, LREC'04.
- Slovenian Text-to-Speech Synthesis for Speech
User Interfaces.
Jerneja Žganec Gros, Aleš Mihelič, Nikola Pavešič, Mario Žganec, Stanislav Gruden.
Proceedings of the Third World Enformatika Conference, WEC 2005.
- The VoiceTRAN Speech-to-Speech
Communicator. Jerneja Žganec Gros, France Mihelič, Tomaž
Erjavec, and Špela Vintar. Proceedings of the 8th International
Conference on Text, Speech and Dialogue, TSD 2005. (Lecture notes in
computer science, Lecture notes in artificial intelligence,
3658. Berlin: Springer)
-
Digitisation of Literary
Heritage Using Open Standards.
Tomaž Erjavec, Matija Ogrin. In Proceedings of eChallenges 2005,
19 - 21 October 2005, Ljubljana.
- The following books are also available:
Available datasets:
Last updated 2010-03-31,
et