Presentation at
Tsujii Lab,
University of Tokyo
Tuesday, 22 January 2002
This presentation is available at
http://nl.ijs.si/et/talks/tsujiilab/
Overview:
IJS AI and NLP
My work and history
Education:
- 1984 BSc, Computer Science, University of Ljubljana
- 1990 MSc, Computer Science, University of Ljubljana
- 1992 MSc, Cognitive Science, University of Edinburgh
- 1997 PhD, Computer Science, University of Ljubljana
"Unification, Inheritance and Paradigms in the Morphology of Natural
Languages"
Supervisors: Ivan Bratko (Ljubljana), Ewan Klein (Edinburgh)
International projects:
- ILD
UK SALT project (1994-1996)
- The Integrated Language Database
(RA, Centre for Cognitive Science, 1994)
- MULTEXT-EAST
Copernicus Joint Project COP 106 (1996-1997):
- Multilingual Texts and Corpora for Eastern and Central
European Languages
- ELAN
MLIS EU Project (1998-1999):
- European Language Activity Network
(local page)
- TELRI
Copernicus Concerted Action (1999-2001, 1995-1997):
- Trans-European Language Resources Infrastructure II
(local page)
- CONCEDE
Copernicus Joint Project (1998-2000):
- Consortium for Central European Dictionary Encoding
(local page)
Slovene projects:
- Ministry of Information Society Project (2001)
- Localisation of Open Source Spell Checkers
ispell and aspell
(collaborator)
- MZT
L2-0461-0106 (1998-2001)
- Development of Digital Publishing with Distance Learning Support
(project leader)
- MZT
T2-0409 (1998-2000)
- Speech Copora and Tools for the Slovenian Language
- FIDA (1996-1999)
- Reference corpus of the Slovene Language
(TEI/SGML consulting)
- GNUsl (1995--)
- A GNU effort for the Slovene Language
(server maintenance, resource contribution)
Summer school teaching:
Functions:
- President of
SDJT, the
Slovenian Language Technologies Society
- Advisory board member of
EACL,
the European Chapter of the Association for Computational Linguistics
- Council member of the
TEI,
the Text Encoding Initiative Consortium
- Editorial board member of
IJCL,
the International Journal for Corpus Linguistics (Bejamins publishers)
- Commissioning editor for
CHum,
the journal ``Computers in the Humanities'' (Kluwer Academic Publishers)
Research interests
Before 1995 (Edinburgh, PhD):
- Morphology, relation to syntax and phonology
- Head-driven Phrase Structure Grammar
- Inheritance hierarchies, typed featrue structures, lexical rules
- Finite state automata and trasducers for morphology and phonology
Since (EU projects):
- Corpus linguistics and "language technologies" for Slovene language:
- Collection and annotation of Slovene and multilingual corpora
- Development of other language resources (lexica, tagsets)
- Ensuring Web and "open source" availability of these resources
- Support for localisation of open source software
- Tools for annotation:
- most work on part-of-speech tagging for Slovene
- also application of segmenters, tokenisers, aligners and
bi-lingual lexicon extractors
- Markup languages
- Corpus Encoding
Standard use and development
- Design of
EAGLES
and
MULTEXT
compliant part-of-speech tagsets
- Use of
TEI
in a number of projects / resources
- Familiar with
XML,
XSLT, RDF, XML Schemas... (but still a lot to learn here!)
- Learning Language in Logic:
Work in Tokyo
Work will concentrate on
GENIA
corpus and resources
TEI encoding
Developing a version of
GPML
and XLiNo that is compatible with
TEI guidelines.
A preliminary SGML prolog:
<!DOCTYPE TEI.2 SYSTEM "tei2.dtd" [
<!ENTITY % TEI.prose "INCLUDE">
<!ENTITY % TEI.dictionaries "INCLUDE">
<!ENTITY % TEI.terminology "INCLUDE">
<!ENTITY % TEI.general "INCLUDE">
<!ENTITY % TEI.linking "INCLUDE">
<!ENTITY % TEI.analysis "INCLUDE">
<!ENTITY % TEI.fs "INCLUDE">
<!ENTITY % TEI.corpus "INCLUDE">
]>
Multiple Hierarchies
Design of multiple hierarchies for GENIA annotation;
hot topic in XML world - see e.g. the paper
Implementing Concurrent Markup in XML.
One possibility: use of stand-off markup, as advocated in e.g.
CES and
can be implemented using XML
XLink.
Transformations
Using XSLT
(with XPath and
XPointer)
to implement various corpus renderings (visualisations).
Machine learning
Work on ILP learning or literature based discovery on MEDLINE
abstracts, together with Saso Dzeroski.
Also use of other information soruces connected to
MEDLINE, i.e.
MeSH and
UMLS.
Tomaž Erjavec,
2002-01-22