next up previous contents
Next: The Text Encoding Initiative Up: Processing SGML Previous: DTD induction

Processing of SGML corpora

LT NSL (by the Language Technology Group at the Human Communication Research Centre, University of Edinburg) was developed to allow direct processing of very large SGML-marked text collections called corpora. As such, the requirements of this type of processing exceed those of most publishing and information-management environments in some respects and undershoot them in others. A typical body of work used in linguistic processing can be millions of words of data, with a markup density exceeding a pair of tags for every word, but there are no requirements for rendering, and little expectation that anyone will actually read what is processed.

LT NSL is an integrated set of SGML querying/manipulation tools and a C-language application program interface (API) designed to ease the writing of C programs which manipulate SGML documents. Its API is based on the idea of using 'normalised' SGML (i.e. an expanded, easily parsable subset of SGML) as a data format for inter-program communication of structured textual information. The API defines a powerful query language which makes it easy to access (either from the shell or in a program) those parts of an SGML document which you are interested in. Both event based and (sub-)tree based views of SGML documents are supported.


next up previous contents
Next: The Text Encoding Initiative Up: Processing SGML Previous: DTD induction
Tomaz Erjavec
1/9/2000