Next: What is SGML? Up: Standards for Language Encoding Previous: Contents

Course Description

Recent years have seen a rapid growth of interest in the acquisition, distribution and utilisation of large scale natural language resources. For text based resources, such as text corpora and lexica, the de-facto standard of encoding has become the Text Encoding Initiative's Guidelines, which are based on the ISO Standard Generalized Markup Language. The course introduces SGML and TEI and gives examples of their use in the encoding of natural language resources.

The course starts with an introduction to the SGML standard, covers the motivation and principles behind SGML, the structure of SGML document type definitions, entities and documents. Presented are some better known SGML document types and related standards, as well as expected future developments of the standard (XML). Conversion to and from SGML documents is discussed, and some freely available tools for processing SGML documents are presented.

The TEI Guidelines are described next. We give the structure of conformant TEI documents, the documents types covered by TEI and the method of applying and parameterizing TEI for various text types and specific projects. The mark-up of text corpora is covered in more depth, esp. for the purpose of language engineering. We discuss the Corpus Encoding Standard, an application of TEI, and overview TEI corpus and text headers, structural and linguistic markup. Case studies of existing corpora are given, concentrating on multilingual and parallel corpora. Finally, we introduce the encoding of machine readable dictionaries and the issues in moving from the so called editorial view of the dictionary into its lexical view are mentioned.

The course should give a good grounding in TEI and enable students to construct or exploit TEI annotated language resources.

This component of the ESSLLI'99 programme is sponsored by the European Chapter of the Association for Computational Linguistics (EACL).

Next: What is SGML? Up: Standards for Language Encoding Previous: Contents

Tomaz Erjavec
1/9/2000