Migrating Language Resources from SGML to XML: the Text Encoding Initiative Recommendations

Presentation at LREC'04

Syd Bauman, Alejandro Bia, Lou Burnard, Tomaž Erjavec, Christine Ruotolo, Susan Schreibman

Wednesday 26th May 2004



1. Background

1.1. Text Encoding Initiative

  • '80: standardisation of computer encoding of language resources
  • Text Encoding Initiative, established in 1987
  • Relation to LREC: encoding of corpora (TEI & CES)
  • SGML 1986 ... TEI P3 1994 ... 2000 XML!

1.2. Benefits of migration

  • dusting: re-examination of encoding practices, validation
  • scarcity of SGML-aware / abundance of XML-based software and tools
  • XML Namespaces, XPath, XSLT, XML Schemas, XPointer, XLink, XQuery, ...
XML is a subset of SGML - still, conversion might not be trivial!

1.3. Migration and TEI

  • Substantial body of resources already encoded in TEI
  • TEI P3 SGML 1994 ... 2002 TEI P4 XML (+SGML)
  • P5 (2005) no longer backward compatible!
  • 2002: TEI Consortium establishes a Task Force on SGML to XML migration

1.4. Migration TF reports

Now "Final Draft" status..
  • TEI MI W02: Strategic considerations in migration of TEI documents from SGML to XML
  • TEI MI W03: Practical Guide to migration of TEI documents from SGML to XML
  • TEI MI W04: Technical Checklist for TEI/SGML documents
  • TEI MI W06: Migration case studies for nine projects: British National Corpus, MULTEXT-East Multilingual Corpus, Corpus of Middle English Prose and Verse, Japanese Text Initiative, Women Writers Project, Thomas MacGreevy Archive, Documenting the American South, Victorian Women Writers Project, and the Thesaurus Musicarum Italicarum.

2. Two Reports

2.1. TEI MI W02: Strategic considerations

Intended for administrators and project managers:
  1. Motivation, Opportunities, and Challenges
    reasons for migrating
  2. Areas of Migration
    instances, DTD, catalog, processing environment
  3. General Recommendations
    planning, workflow design: allocating resources, automating conversion, verifying results
  4. Special Considerations in Migration
    degrees of migration complexity: from easy to forward-looking
  5. Appendix: Potential Impact of Future Versions of the Guidelines on Migration Issues
    changes likely to appear in P5

2.2. TEI MI W03: the Practical Guide

Written primarily for the technical staff:
  • solutions to specific conversion problems
  • augmented by Migration Case Studies
  • covers obtaining the XML DTD and modifying the processing environment
  • bulk devoted to instance conversion:
    • recommended workflow
    • conversion tools
    • conversion of SDATA entities

2.3. Migration workflow

Instance:
  1. convert the documents to well-formed XML
  2. case normalisation
  3. validation
DTD (TEI):
  • unextended: switch SYSTEM identifier to P4:
    <!DOCTYPE TEI.2 SYSTEM "http://www.tei-c.org/P4X/DTD/tei2.dtd" [...]>
  • extended: convert extensions only
  • for one file DTD use Pizza Chef
  • broken TEI: you're on your own..

2.4. Instance conversion tools

  • SGML to well-formed XML: (o)sx
    osx -d OUTPUT_DIRECTORY -xno-nl-in-tag -xlower -xno-expand-external -xno-expand-internal INPUT.sgml
  • Prettyprinting: xmllint
  • Case conversion, default attributes: tei2tei.xsl

2.5. SDATA entities

  • "specific entity references" not available in XML
  • In SGML: <!ENTITY amacron SDATA "[amacron]">
  • In XML: <!ENTITY amacron "ā">
  • Or, simpler, just ā
Three cases:
  • SDATA maps to Unicode char: tables
  • Assigning code points from the private use area of Unicode (PUA)
  • Using TEI markup constructs

2.6. Migrating TEI DTD extensions to XML

  • If TEI is not enough, DTD can be modified in a number of well-defined ways - the documents are still "TEI conformant"
  • This modification involves creating two extension files with the parameter entity, element, and attribute re-definitions
  • Section descibes how to convert them to XML:
    • gives some general remarks
    • describes a sample DTD modification that covers the most important issues
    • outlines a recommended migration procedure and demonstrates the key steps using the example

3. Conclusions

3.1. The corpus sample cases: BNC and MULTEXT-East

  • BNC: large, SGML minimization features, validation
  • MULTEXT-East: varied, conversion from CES (via XCES)
  • Thursday 27th May 2004 18:25-19:45 Sala A
    O33-TW : Morphosyntactic Corpora & Tools
    ...
    MULTEXT-East Version 3 : Multilingual Morphosyntactic Specifications, Lexicons & Corpora

3.2. It's now or never..

The paper has presented the reports of the TEI TF on SGML to XML migration, which provides detailed instructions for migrating TEI P3 (SGML) documents and DTDs to XML TEI P4 (XML). The reports are meant primarily to serve TEI P3 resource holders, however, they are, in the main, relevant for any SGML to XML conversion project.
So, if you have language resources encoded in SGML, go to
http://www.tei-c.org/Activities/MI/