Course Materials

Standards for Language Encoding

Tomaž Erjavec

ESSLLI 2011

Here you can find the materials for the one-week foundational course given at the 23rd European Summer School in Logic, Language and Information which took place at 1-12 August, 2011) in Ljubljana.

The course deals with digital encoding of language data, an increasingly important area due to the growing production and interchange of annotated language resources. Computational linguistics has experienced a shift towards machine learning and statistics-based approaches on the one hand, and towards empirical evaluation of experimental results on the other; for both, language resources are needed, and in order to use and produce them, an awareness of standards for their encoding is necessary.

Lecture I: Introduction; Unicode
We start with the concept of standardisation, giving a brief history, the benefits and problems associated with using standard solutions, and the main standardisation bodies. The issues of character encoding are presented next, i.e. what are character sets, a brief history of their evolution, and a more detailed explanation of the Unicode standard.
Lecture II: XML
We address the basic standard for language encoding, XML, briefly give its motivation and structure, and then move to related standards that give XML its power: XML namespaces, schema languages, XPath, XSLT, XQuery, XForms, etc. We also briefly mention various software tools that enable direct XML processing.
Lecture III: TEI
The Text Encoding Initiative Guidelines, TEI, have a special place in the plethora of language resource annotation; while used more in the digital library world, they also offer possibilities of encoding language resources for computational linguistics; recently, these possibilities have been taken up by various EU initiatives, e.g. the CLARIN initiative. The course will introduce the TEI Guidelines, explain their structure, and give some examples, esp. as regards encoding of corpora and feature-structures.
Lecture IV: ISO
This lecture introduces ISO, the International Organisation for Standardisation. We cover the structure of ISO and how standards are developed, and then introduce some standards relevant for language resources in particular those for the encoding of dates and times and for the identification of languages. We then take a closer look at the work of ISO TC 37, which is involved in developing standards for language resources. The course will give an overview of these standards, in particular the Linguistic Annotation Framework LAF, Data Category Registries, DCR, Lexical Markup Framework LMF, Morpho-syntactic Annotation Framework MAF, and the Syntactic Annotation Framework SynAF.
Lecture V: Metadata; Sematic Web
We discuss encoding of meta-data, esp. Dublin Core, but also the TEI header and initiatives to share meta-data, such as OLAC. Finally, we also mention the more important Semantic Web related initiatives, i.e. RDF, RDFS, and OWL.

Valid HTML 4.01!

Page last updated 2011-08-14, et