Talk given at
Tsujii Lab,
University of Tokyo
Tuesday, 29 January 2002
This presentation is available at
http://nl.ijs.si/et/talks/tei-genia/
Overview:
- TEI: history and organisation
- TEI in Asia
- The architecture of the TEI Guidelines
- The (dis)advantages of using TEI
- TEI core tagset: text and header
- TEI analysis: words, segments and clauses
- Genia lexicon: TEI dictionaries or terminological
databases?
- TEI features structures
- Multiple hierarchies
TEI: history and organisation
- The Text Encoding Initiative, established in 1987 under
the joint sponsorship of the Association for Computers and the
Humanities, the Association for Computational Linguistics, and the
Association for Literary and Linguistic Computing.
- TEI became the only systematised attempt to develop a fully
general text encoding model and set of encoding conventions based upon
it, suitable for processing and analysis of any type of text, in any
language, and intended to serve the increasing range of existing (and
potential) applications and use.
- TEI Guidelines for Electronic Text Encoding and Interchange (TEI
P3) were first published in April 1994 in two substantial green
volumes.
- In May 1999, a revised edition of TEI P3 was produced,
correcting several typographic and other errors. This Revised reprint
is, at the time of writing, the official edition of TEI P3 (it is
also called the `P4beta') and is available online in various formats.
- In December 2000 the TEI Consortium was set up to maintain
and develop the TEI standard. The Consortium has executive offices in
Bergen, Norway, and hosts at four universities. The Consortium is
managed by a Board of Directors, and its technical work is overseen by
an elected Council. Lou Burnard is European editor and Syd Bauman is
North American Editor. Institutions and individuals can become
Consortium members or subscribers, which gives them certain
benefits inside
the consortium.
- In June 2001, the TEI Consortium announced availability of a
preliminary version of a major revision of TEI P3, the TEI P4,
the object of which would be to provide equal support for XML and SGML
applications using the TEI scheme. The revisions needed to make TEI
P4 have been deliberately restricted to error correction only, with a
view to ensuring that documents conforming to TEI P3 will not become
illegal when processed with TEI P4.
- Many possibilities for other, more fundamental, changes have been
identified. With the establishment of the new TEI Council, it becomes
possible to agree on a programme of work to enhance and modify the
Guidelines more fundamentally over the coming years. TEI P5
will be the next full revision of the Guidelines. No date has yet been
fixed for its appearance, but work on it will commence early in 2002.
TEI and Asia/Japan
While the TEI is an international effort, the nations in question are
almost exclusively in North America and Europe. All the
50 members of the Consortium
are located there, as are all the members of the
Board of Directors
and the
Council,
with the exception of Christian Wittern, of Kyoto University.
Of the 88 projects
listed on the TEI pages as using the Guidelines,
six are centred on Asian languages:
-
Chinese Buddhist Electronic Text Association
Contact: Christian Witern, Kyoto University
-
Chinese/Japanese/Korean-English Dictionary
Contact: Charles Muller, Toyo Gakuen University, Chiba, Japan
-
The Digital Dictionary of Buddhism
Contact: Charles Muller, Toyo Gakuen University, Chiba, Japan
-
Japanese/English Bilingual Corpus
Contact: Francis Bond, NTT Communication Science Laboratories, Japan
-
Japanese Map Task Dialogue Corpus
Contact: Syun Tutiya, Chiba University, Japan
-
Japanese Text Initiative
Contact: Sachiko Iwabuchi, University of Virginia
Why isn't TEI used more in Asia? I see three possibilities:
- There is, in general, less work of the kind where TEI would be
helpful, i.e. less corpora are produced, and less (historical, classical)
texts are being made available electronically. Maybe this could be due
to differing funding priorities: in Europe there has been a number of
EU project calls focused on language resources, while the US seems to
fund many projects on digitisation of literary materials. However,
at least some Asian countries seem not to lag in this respect. Korea,
for example, has a number of projects aimed at building electronic
text databases (Jin-Dong Kim, personal communication).
- TEI is not known in the region. In the USA and Europe TEI has had
significant exposure, which is rather natural as this is where it was
conceived and where the work of took place. TEI has also been
featured at various - mostly corpus oriented - events.
- Different character sets used in Asian languages and/or English
documentation and tag names make TEI unappealing in contrast to
locally produced encodings. And even though TEI/SGML/XML are
able to handle arbitrary character sets, significant expertise is
needed to implement their input and display, while local software
offers ready-made solutions.
The architecture of TEI Guidelines
The Guidelines consist of:
- The 'syntax': the SGML/XML DTD set
(download or bake with
TEI Pizza Chef)
- The 'semantics': the documentation (in TEI and derived formats)
The TEI encoding scheme consists of number of modules ('tagsets') or
DTD fragments.
The DTD fragments from which a specific TEI DTD is
constructed are:
- core DTD fragments
-
standard components of the TEI main DTD in all its forms; these are
always included without any special action by the encoder;
The core consists of:
- core text tags:
These include paragraph, <p>, highlighting
(<emph>, and rend attribute),
quotation, <q>, names, numbers, dates,
abbreviations, <term>, etc.
- the TEI header:
describes an encoded work so that the text itself, its source, its
encoding, and its revisions are all thoroughly documented
- base DTD fragments
-
basic building blocks for specific text types; exactly one base must
be selected by the encoder (unless one of the combined bases is
used);
Those interesting for GENIA are:
- TEI.prose: the base tag set for prose
- ?TEI.dictionaries: the base tag set for print dictionaries
- ?TEI.terminology: the base tag set for terminological data files
- ?TEI.general: the generic mixed-mode base base tag set
- additional DTD fragments
-
extra tags useful for particular purposes. All additional tag sets
are compatible with all bases and with each other; an encoder may
therefore add them to the selected base in any combination desired.
Those interesting for GENIA are:
- TEI.linking: tags for linking, segmentation, and alignment
- TEI.analysis: tags for simple analytic mechanisms
- ?TEI.fs: tags for feature structure analysis
- ?TEI.nets: tags for graphs, digraphs, trees, and other networks
- TEI.corpus: tags for additional tags for language corpora
- user defined DTD fragments
-
give the possibility of extending / modifying / localising
the Guidelines
TEI Lite
A particular parametrisation of TEI (a DTD), which implements a useful
`starter set', comprising the elements which almost every user should
know about.
Some characteristics of TEI Lite:
- includes most of the TEI `core' tag set;
- handles a reasonably wide variety of texts;
- is useful for the production of new documents as well as
encoding of existing ones;
- is usable with a wide range of existing SGML software;
- is derivable from the full TEI DTD using the extension mechanisms
- is as small and simple as is consistent with the other goals.
Parametrisation
Developing a version of
GPML
and XLiNo that is compatible with
TEI guidelines?
A preliminary SGML prolog:
<!DOCTYPE TEI.2 SYSTEM "tei2.dtd" [
<!ENTITY % TEI.general "INCLUDE">
<!ENTITY % TEI.prose "INCLUDE">
<!ENTITY % TEI.dictionaries "INCLUDE">
<!ENTITY % TEI.terminology "INCLUDE">
<!ENTITY % TEI.linking "INCLUDE">
<!ENTITY % TEI.analysis "INCLUDE">
<!ENTITY % TEI.fs "INCLUDE">
<!ENTITY % TEI.corpus "INCLUDE">
]>
Encoding Corpora
Given the multiplicity of annotations, not a simple task.
A overview is given in the presentation by L.Burnard, available at
http://users.ox.ac.uk/~tei/Presentations/TEIcorpus/.
To address this issue, esp. for corpora for language engineering,
the CES, Corpus Encoding
Specification was developed. It much more focussed than TEI, but not
extensible or parametrisable.
CES has become a part of the
"Expert Advisory Group on Language Engineering Standards",
EAGLES, as the
"Recommendations on corpus encoding".
It should be noted that the sucessor of EAGLES is the project
ISLE,
International Standards for Language Engineering, which also contains
a working group on Computational Lexicons. However, the work of the
group seems to be focused on multilingual lexica, and, at the time of
this writing, they have not yet produced any publicly available
documents.
The (dis)advantages of using TEI
Advantages:
- Using a wide-coverage, well-designed (modular and extensible),
widely accepted and maintained architecture
- TEI has extensive documentation: Guidelines as well as papers and
documentation of various projects
- Support: specific problems might have been encountered
before (the tei-l public discussion list)
- Various software already exists, and more is likely to become available
- Contributing to open standards and recommendations
Disadvantages:
- "Tag abuse":
TEI might not have elements / attributes with the exact meaning we
require; results in a tendency to misuse tags for purposes they were
not meant for
- "Tag bloat":
being a general purpose recommendation, it can never be optimal for a
specific application; a custom developed DTD will be leaner; have less
(redundant) tags
- "TEI for humanities"
maybe the least developed for "high level" NLP
applications: is problematic for encoding ontologies and lexical
databases, feature structures
Possible solutions:
- Ignore TEI completely
fast development of own DTD, but "reinventing the wheel"; also, low
interchange value.
- Manage with "out of the box" TEI-Lite encoding
Used to be done a lot, but now parametrisation is much simpler, so not
much sense in shoehorning project specifics into this particular
subset of TEI.
- Parametrise TEI, and tighten up DTD as project proceeds
The "proper" way to apply TEI; provides a good development
environment. If necessary, when encoding is fixed, can develop
strict DTD which validates our documents.
- Start with own DTD, develop mapping to TEI for better
interchange
The case in GENIA
For an "all bases covered" approach" c.f. paper by Gary F. Simons
Using architectural processing to derive small, problem-specific
XML applications from large widely used SGML applications.
It offers a path to derive a focused small project specific
TEI compatible DTD; however, the technology it proposes is no longer
used very much - but same effect could be achieved with XSLT.
TEI core and prose tagsets
The header consists of four parts:
- a file description, <fileDesc>, containing a full
bibliographical description of the computer file itself, from which
proper bibliographic citation can be derived. The file description
includes information about the source or sources from which the
electronic text was derived.
- an encoding description, <encodingDesc>, which describes the
relationship between an electronic text and its source or sources.
It allows for detailed description of whether (or how) the text was
normalised during transcription, how the encoder resolved
ambiguities in the source, what levels of encoding or analysis were
applied, and similar matters.
- a text profile, <profileDesc>, containing classificatory
and contextual information about the text, such as its subject
matter, the situation in which it was produced, the individuals
described by or participating in producing it, and so forth. Such a
text profile is of particular use in highly structured composite
texts such as corpora or language collections, where it is often
highly desirable to enforce a controlled descriptive vocabulary or
to perform retrievals from a body of text in terms of text type or
origin.
- a revision history, <revisionDesc>, which allows the encoder
to provide a history of changes made during the development of the
electronic text. The revision history is important for version
control and for resolving questions about the history of a file.
An example of a header.
Text example
<body>
<div type="article">
<head>Retinoic acid downmodulates erythroid differentiation and
<term ana="sem-000">GATA1 expression</term> in <term ana="sem-001">purified
adult-progenitor culture</term>.
</head>
<bibl><xref>MEDLIN:94129004</xref></bibl>
<div type="abstract">
<p>
<s><term ana="sem-002">All-trans retinoic acid</term> (<term
ana="sem-003">RA</term>) is an important <term
ana="sem-004">morphogen</term> in vertebrate development, a
normal constituent in <term ana="sem-005">human adult
blood</term> and is also involved in the control of cell growth
and differentiation in <term ana="sem-006">acute promyelocytic
leukemia</term>.</s>
...
TEI.ana: words, segments and clauses
The TEI module for simple linguistic
analysis contains elements for arbitrary (possibly nested)
segments of text, <seg>, as well as elements for words and clauses:
<div type="text">
<p>
<s>
<cl ana="lex201 lex202" function="(AND lex201 lex202)">
<term ana="lex203"><w ana="FW">hypo-</w></term>
<w ana="CC">and</w>
<term ana="lex204"><w ana="JJ">hyper</w></term>
<term ana="lex205"><w ana="NN">cortisolism</w></term>
<c>.</c>
</cl>
</s>
</p>
</div>
Note: if the current GPML model is used, where text and lexica
co-exist in a document, then TEI.general must be used, which enables
the combination of several base tagsets.
Either is problematic for encoding a lexical database or ontology:
- TEI.dictionaries is oriented towards
printed dictionaries
- TEI.terminology is closer to what is
needed for GENIA, but:
Since its first publication, this chapter has been rendered obsolete in several respects,
chiefly as a result of the publication of ISO 12200, and a variant of it (TBX) which has been
recently adopted by LISA, the Localisation Industry Standard Association. Work is
currently ongoing in the ISO community to define a generic platform for terminological
markup (ISO CD 16642, TMF : Terminological Markup Framework), in the light of which
it is anticipated that the recommendations of the present chapter will be substantially
revised. Readers are cautioned in particular that the discussion below of `nested' and `flat'
structures is now far removed from current practices in the terminological field. A major
revision of this chapter is planned for the next edition of these Guidelines.
TEI features structures
Background:
A rationale for the TEI recommendations for feature-structure
markup, by D. Terence Langendoen and Gary F. Simons, Computers and the Humanities, 29, (1995).
The proposal is rather complicated and is composed of two parts:
- the
TEI Feature Structures (TEI.fs) is an
additional tagset for marking-up the
text with feature structures, and
- the TEI Feature Structure Declaration (FSD)
with a special DTD, for defining feature
values and names, their descriptions and constraints on valid feature
structures.
Disadvantages:
- Not many applications of these guidelines (any?)
- Tailored towards GPSG rather than HPSG: limited support for a type
hierarchy, no marking for reentrancy
- SGML version uses (unsupported) SUB-DOC for linking the FSD
with the annotated document
- No mechanisms for checking validity of specified feature structures
Below is an example for morphosyntactic tagging, from
ELAN corpus:
<seg id="ecmr.en.3663" corresp="ecmr.sl.3663">
<w ana="Dd" ctag="DT DT" lemma="the">The</w>
<w ana="Afs" ctag="JJS JJS" lemma="high">highest</w>
<w ana="Ncns" ctag="NN VB" lemma="pay">pay</w>
<w ana="Ncns" ctag="NN NN" lemma="increase">increase</w>
<w ana="Vais3s" ctag="VBD BEDZ" lemma="be">was</w>
<w ana="Vmps" ctag="VBN VBN" lemma="record">recorded</w>
<w ana="Sp" ctag="IN IN" lemma="in">in</w>
<w ana="Ncns" ctag="NN NN" lemma="manufacture">manufacturing</w>
<c ctag=".">.</c>
</seg>
<fs type="Verb" id="Vmn" select="en sl" feats="V1.m V2.n"></fs>
<fs type="Verb" id="Vmnp" select="en" feats="V1.m V2.n V3.p"></fs>
<fs type="Verb" id="Vmp--dfp" select="sl" feats="V1.m V2.p V5.d V6.f V7.p"></fs>
<fs type="Verb" id="Vmp--dmp" select="sl" feats="V1.m V2.p V5.d V6.m V7.p"></fs>
<fs type="Verb" id="Vmp--dnp" select="sl" feats="V1.m V2.p V5.d V6.n V7.p"></fs>
<fs type="Verb" id="Vmp--pfp" select="sl" feats="V1.m V2.p V5.p V6.f V7.p"></fs>
<f select="bg cs en et hu ro sl" id="N1.p" name="Type"><sym value="proper"></f>
<f select="bg cs en ro sl" id="N2.m" name="Gender"><sym value="masculine"></f>
<f select="bg cs en et hu ro sl" id="N3.p" name="Number"><sym value="plural"></f>
<f select="cs hu sl" id="N4.a" name="Case"><sym value="accusative"></f>
<f select="cs hu sl" id="N4.d" name="Case"><sym value="dative"></f>
<f select="bg ro" id="N5.n" name="Definiteness"><sym value="no"></f>
<f select="bg ro" id="N5.y" name="Definiteness"><sym value="yes"></f>
Another example
tries to implement
a few definitions form HPSG, as given
in the formalism of the
Attribute Logic Engine:
<!DOCTYPE teiFsd2 PUBLIC "-//TEI P4//DTD Auxiliary Document Type:
Feature System Declaration//EN">
<teiFsd2>
<teiHeader>
<!-- The header is as for any TEI.2 document -->
</teiHeader>
<!-- ALE HPSG:
bot sub [bool, case, cat, c_inds, conx,...].
bool sub [minus, plus].
minus sub [].
plus sub [].
-->
<fsDecl type='bool' baseType='bot'></fsDecl>
<fsDecl type='plus' baseType='bool'></fsDecl>
<fsDecl type='minus' baseType='bool'></fsDecl>
<!-- The preceeding currently illegal! -->
<!-- ALE HPSG:
subst sub [adj, noun, prep, reltvzr, verb]
intro [prd:bool,
mod:mod_synsem].
...
verb sub []
intro [aux:bool,
inv:bool,
vform:vform].
-->
<fsDecl type='verb' baseType='sub'>
<fsDescr>Type definition for verbs</fsDescr>
<fDecl name='vform'>
<sym value='vform'/> <!--option 1-->
</fDecl>
<fDecl name='aux'>
<vRange><fs type='bool'></fs></vRange> <!--option 2-->
</fDecl>
<fDecl name='inv'>
<vRange>
<vAlt><plus/><minus/></vAlt> <!--option 3-->
</vRange>
</fDecl>
</fsDecl>
</teiFsd2>
Multiple hierarchies
Design of multiple hierarchies for GENIA annotation;
hot topic in XML world.
C.f. the paper
Implementing Concurrent
Markup in XML.
One possibility: use of stand-off markup, as advocated in e.g.
CES and by
LTG, which
can be implemented using XML
XLink.
XML and relatives
Links:
- XML: used in TEI P4
- XSLT: useful for
converting GPML to TEI and for corpus renderings/visualisations
- XPath: used in XSLT
- XPointer: for
connecting GENIA resources
Tomaž Erjavec,
2002-02-12