Standards for Language Encoding
Here you can find the materials for the one-week foundational course
given at the 23rd European Summer School in Logic, Language and Information
which took place at 1-12 August, 2011) in Ljubljana.
The course deals with digital encoding of language data, an
increasingly important area due to the growing production and
interchange of annotated language resources. Computational linguistics
has experienced a shift towards machine learning and statistics-based
approaches on the one hand, and towards empirical evaluation of
experimental results on the other; for both, language resources are
needed, and in order to use and produce them, an awareness of
standards for their encoding is necessary.
- Lecture I: Introduction; Unicode
- We start with the concept of standardisation, giving a brief
history, the benefits and problems associated with using standard
solutions, and the main standardisation bodies. The issues of
character encoding are presented next, i.e. what are character sets, a
brief history of their evolution, and a more detailed explanation of
the Unicode standard.
- Lecture II: XML
- We address the basic standard for language encoding, XML, briefly
give its motivation and structure, and then move to related standards
that give XML its power: XML namespaces, schema languages, XPath,
XSLT, XQuery, XForms, etc. We also briefly mention various software
tools that enable direct XML processing.
- Lecture III: TEI
- The Text Encoding Initiative Guidelines, TEI, have a special place
in the plethora of language resource annotation; while used more in
the digital library world, they also offer possibilities of encoding
language resources for computational linguistics; recently, these
possibilities have been taken up by various EU initiatives, e.g. the
CLARIN initiative. The course will introduce the TEI Guidelines,
explain their structure, and give some examples, esp. as regards
encoding of corpora and feature-structures.
- Lecture IV: ISO
- This lecture introduces ISO, the International Organisation for
Standardisation. We cover the structure of ISO and how standards are
developed, and then introduce some standards relevant for language
resources in particular those for the encoding of dates and times and
for the identification of languages. We then take a closer look at the
work of ISO TC 37, which is involved in developing standards for
language resources. The course will give an overview of these
standards, in particular the Linguistic Annotation Framework LAF, Data
Category Registries, DCR, Lexical Markup Framework LMF,
Morpho-syntactic Annotation Framework MAF, and the Syntactic
Annotation Framework SynAF.
- Lecture V: Metadata; Sematic Web
- We discuss encoding of meta-data, esp. Dublin Core, but also the
TEI header and initiatives to share meta-data, such as OLAC. Finally,
we also mention the more important Semantic Web related initiatives,
i.e. RDF, RDFS, and OWL.
Page last updated 2011-08-14,