Introductory course at ESSLII 2002

Annotation of Language Resources

Lecture IV.

TEI and other Language Encoding Recommendations

Tomaž Erjavec
Department of Intelligent Systems
Institute Jožef Stefan
Jamova 39, SI-1000 Ljubljana
Slovenia

Abstract

This lecture presents the XML-based Text Encoding Initiative Guidelines and other language encoding recommendations. TEI can be used to annotate a wide variety of language resources. We present the history, organisation and architecture of TEI and illustrate it with applications to multilingual corpora, lexical databases and feature structures. We also discuss other encoding recommendations, first some language engineering standards that came about as a result of EU projectcs, i.e. EAGLES/ISLE with (X)CES and then a few lexicon exchange initiatives, i.e. MARTIF, TMX and OLIF.

1. Language Encoding Recommendations
- 1.1. The Text Encoding Initiative
  - 1.1.1 TEI history: establishment and motivations
  - 1.1.2 TEI P3, P4 & P5
  - 1.1.3 The TEI consortium
  - 1.1.4 Projects using the TEI
  - 1.1.5 Structure of the TEI DTD
  - 1.1.6 Core tagset
  - 1.1.7 The TEI Header
  - 1.1.8 Base tagsets
  - 1.1.9 Additional tagsets
  - 1.1.10 Invocation of the TEI DTD
  - 1.1.11 Default text structure: top level
  - 1.1.12 Default text structure: divisions
  - 1.1.13 Examples of TEI use
  - 1.1.14 TEI.analysis example
  - 1.1.15 TEI.fs example
  - 1.1.16 TEI.dictionaries example
  - 1.1.17 Parameter entities in TEI
  - 1.1.18 Global attributes
  - 1.1.19 Model classes
  - 1.1.20 Local modifications
  - 1.1.21 TEI Lite
  - 1.1.22 TEI software
- 1.2. European Language Engineering Standards
  - 1.2.1 EAGLES
  - 1.2.2 Corpus Encoding Standard
  - 1.2.3 XCES
  - 1.2.4 ISLE
- 1.3. Lexicon Exchange Initiatives
  - 1.3.1 Lexicon Exchange Initiatives: Introduction
  - 1.3.2 Terminology Interchange
  - 1.3.3 A sample MARTIF terminological entry
  - 1.3.4 Translation memory interchange
  - 1.3.5 A sample TMX document
  - 1.3.6 Lexicon Interchange
  - 1.3.7 Design of OLIF
  - 1.3.8 A sample OLIF lexicon entry

1. Language Encoding Recommendations

1.1. The Text Encoding Initiative

TEI is a complex application of SGML/XML used to annotate a wide variety of resources; we illustrate with applications to multilingual corpora, lexical databases and feature structures.

1.1.1. TEI history: establishment and motivations

The Text Encoding Initiative was established in 1987 under the joint sponsorship of the:
- ACH: Association for Computers and the Humanities
- ACL: Association for Computational Linguistics
- ALLC: Association for Literary and Linguistic Computing.
The impetus for the project came from the humanities computing community, which sought a common encoding scheme for complex textual structures in order to reduce the diversity of existing encoding practices, simplify processing by machine, and encourage the sharing of electronic texts. But it soon became apparent that a sufficiently flexible scheme could provide solutions for text encoding problems generally.
TEI became the only systematised attempt to develop a fully general text encoding model and set of encoding conventions based upon it, suitable for processing and analysis of any type of text, in any language, and intended to serve the increasing range of existing (and potential) applications and use.
SGML was chosen as the underlying standard for the TEI Guidelines.
The first draft of the TEI Guidelines for Electronic Text Encoding and Interchange, TEI P1 was published in 1990.
The second draft, TEI P2, followed in 1993.

1.1.2. TEI P3, P4 & P5

The third version of the TEI Guidelines for Electronic Text Encoding and Interchange (TEI P3) was published in 1994 in two substantial green volumes (1200pp) and soon also on the Web.
In 1999, a revised edition of TEI P3 was produced (also called the “P4beta”), which corrected several typographic and other errors.
A major revision of TEI P3, the TEI P4 was published on the WWW in early 2002, and, in June 2002 also in print, this time as two blue volumes.
TEI P4 addresses the following issues:
- provides equal support for XML and SGML applications using the TEI scheme;
- error correction, while maintaining backward compatibility: documents conforming to TEI P3 will not become illegal when processed with TEI P4.
Many possibilities for other, more fundamental, changes have been identified. TEI P5 will be the next full revision of the Guidelines. No date has yet been fixed for its appearance.

1.1.3. The TEI consortium

In December 2000 the TEI Consortium was set up to maintain and develop the TEI standard.
The Consortium is a non-profit corporation, has executive offices in Bergen, Norway, and hosts at at the University of Bergen, Brown University, Oxford University, and the University of Virginia.
The Consortium is managed by a Board of Directors, and its technical work is overseen by an elected Council. Lou Burnard is European editor and Syd Bauman is North American Editor.
Institutions and individuals can become Consortium members or subscribers, which gives them certain benefits inside the consortium.

1.1.4. Projects using the TEI

Currently there are 75 projects known to have used the TEI, as given on the TEI project page. Below are some examples, with last significant update of the entry:

American Numismatic Society [20 May 2002]
American Theological Library Association [18 April 2002]
The Legacy Tobacco Documents Library [6 February 2002]
Early Canada Online [4 February 2002]
Emblem Project Utrecht [1 February 2002]
Medieval Nordic Text Archive [30 January 2002]
Oxford Text Archive [22 January 2002]
African Languages Lexicon Project [21 January 2001]
British National Corpus [21 January 2002]
The World of Dante [21 January 2001]
Victorian Women Writers' Project [16 January 2002]
Henrik Ibsen's Writings [11 January 2002]
The Digital Dictionary of Buddhism [18 December 2001]
The English-Norwegian Parallel Corpus [18 December 2001]
The FIDA Corpus of Slovene Language [18 December 2001]
The Oslo Multilingual Corpus [18 December 2001]
Slovene-English Parallel Corpus [18 December 2001]
Multext-East [17 December 2001]

1.1.5. Structure of the TEI DTD

The TEI encoding scheme consists of number of modules (“tagsets”) or DTD fragments. The DTD fragments from which the main TEI DTD is constructed are classified as follows (“Chicago Pizza Model”):

Core DTD fragments: Standard components of the TEI main DTD in all its forms; these are always included without any special action by the encoder.
Base DTD fragments: Basic building blocks for specific text types; exactly one base must be selected by the encoder (unless one of the combined bases is used).
Additional DTD fragments: Extra tags useful for particular purposes. All additional tag sets are compatible with all bases and with each other; an encoder may therefore add them to the selected base in any combination desired.

1.1.6. Core tagset

The core tagset, which is always available, consists of:

Core tags: Used in the text, and are, for the most part, in-line elements with no consistent internal structure, e.g. highlighting (<emph>), quotation, <q>, names <name>, etc. Also in this class is paragraph, <p>, list, <list> , etc., and some simple linkage, editorial, bibliographical, etc. elements.
TEI header: Describes an encoded work so that the text itself, its source, its encoding, and its revisions are all thoroughly documented.

1.1.7. The TEI Header

The TEI header gives the meta-data on the TEI document and consists of four main parts (only first is obligatory):

<fileDesc>: file description, containing a full bibliographical description of the computer file itself; it includes information about the source or sources (<sourceDesc>) from which the electronic text was derived.
<encodingDesc>: encoding description, which describes the relationship between an electronic text and its source or sources: it allows for detailed description of whether (or how) the text was normalised during transcription, how the encoder resolved ambiguities in the source, what levels of encoding or analysis were applied, etc.
<profileDesc>: text profile, containing classificatory and contextual information about the text, e.g. its subject matter, the individuals described by or participating in producing it, etc. It is of particular use in structured composite texts such as corpora, where it is often desirable to enforce a controlled descriptive vocabulary or to perform retrievals from a body of text in terms of text type or origin.
<revisionDesc>: revision history, which allows the encoder to provide a history of changes made during the development of the electronic text. It is important for version control and for resolving questions about the history of a file.

1.1.8. Base tagsets

Only one base can be chosen, unless a mixed-mode tagset is also selected:

TEI.prose: the base tag set for prose
TEI.verse: the base tag set for verse
TEI.drama: the base tag set for drama
TEI.spoken: the base tag set for transcriptions of spoken texts
TEI.dictionaries: the base tag set for print dictionaries
TEI.terminology: the base tag set for terminological data files
TEI.general: the generic mixed-mode base base tag set
TEI.mixed: the base tag set for free mixed-mode texts

1.1.9. Additional tagsets

These tagsets represents additional interpretations of text, and an arbitrary number can be chosen:

TEI.linking: tags for linking, segmentation, and alignment
TEI.analysis: tags for simple analytic mechanisms
TEI.fs: tags for feature structure analysis
TEI.certainty: tags for indicating uncertainty and probability in the markup
TEI.transcr: tags for manuscripts, analytic bibliography, and transcription of primary sources
TEI.textcrit: tags for critical editions
TEI.names.dates: specialised tags for names and dates
TEI.nets: tags for graphs, digraphs, trees, and other networks
TEI.figures: tags for graphics, figures, illustrations, tables, and formulae
TEI.corpus: tags for additional tags for language corpora

1.1.10. Invocation of the TEI DTD

Using one base and several toppings:


<!DOCTYPE TEI.2 SYSTEM "http://www.tei-c.org/P4X/DTD/tei2.dtd" [
  <!ENTITY % TEI.XML      'INCLUDE'> <!--enable XML processing-->  
  <!ENTITY % TEI.prose    'INCLUDE'> <!--base tag set for prose --> 
  <!ENTITY % TEI.analysis 'INCLUDE'> <!--linguistic analysis-->    
  <!ENTITY % TEI.linking  'INCLUDE'> <!--pointer mechanisms-->     
]>

A more complicated example:


<!DOCTYPE teiCorpus.2 
  PUBLIC "-//TEI P4//DTD Main Document Type//EN" "tei2.dtd" [
  <!-- bases -->
  <!ENTITY % TEI.general      "INCLUDE"> <!--generic mixed base-->
  <!ENTITY % TEI.prose        "INCLUDE">
  <!ENTITY % TEI.dictionaries "INCLUDE">
  <!ENTITY % TEI.terminology  "INCLUDE">
  <!-- additional -->
  <!ENTITY % TEI.linking      "INCLUDE">
  <!ENTITY % TEI.analysis     "INCLUDE">
  <!ENTITY % TEI.fs           "INCLUDE">
  <!ENTITY % TEI.corpus       "INCLUDE">
  <!ENTITY % TEI.XML          "INCLUDE" >
  <!-- extensions -->
  <!ENTITY % TEI.extensions.ent SYSTEM 'geniaex.ent'>
  <!ENTITY % TEI.extensions.dtd SYSTEM 'geniaex.dtd'>
]-->

1.1.11. Default text structure: top level

The overall structure of a unitary text:


<TEI.2>
  <teiHeader> <!-- ... --> </teiHeader>
  <text>
    <front> <!-- front matter of text, if any. --> </front>
    <body>  <!-- body of text goes here. --> </body>
    <back>  <!-- back matter of text, if any. -->   </back>
  </text>
</TEI.2>

The overall structure of composite text:


<TEI.2>
  <teiHeader> <!-- ... --> </teiHeader>
  <text>
    <front> <!-- front matter of composite text. --> </front>
    <group>
      <text> <!-- first unitary text -->  </text>
      <text> <!-- second unitary text --> </text>
    </group>
    <back> <!-- back matter of composite text. -->   </back>
  </text>
 </TEI.2>

1.1.12. Default text structure: divisions

Using unnumbered divisions:


<body>
  <div type="part" n="1">
    <div type="chapter" n="1"><!--text of part 1, chapter 1--></div>
    <div type="chapter" n="2"><!--text of part 1, chapter 2--></div>
  </div>
  <div type="part" n="2">
    <div n="1" type="chapter"><!--text of part 2, chapter 1--></div>
    <div n="2" type="chapter"><!--text of part 2, chapter 2--></div>
  </div>
</body>

Using numbered divisions:


<body>
  <div0 type="Part" n="1">     
    <div1 type="Chapter" n="1"><!--text of part 1, chapter 1--></div1>     
    <div1 type="Chapter" n="2"><!--text of part 1, chapter 2--></div1>     
  </div0>
  <div0 type="Part" n="2">
    <div1 type="Chapter" n="1"><!--text of part 2, chapter 1--></div1>     
    <div1 type="Chapter" n="2"><!--text of part 2, chapter 2--></div1> 
  </div0>
</body>

1.1.13. Examples of TEI use

A newspaper story:


<div type="story">
  <head rend="large underlined" type="sub">
    President pledges safeguards ...</head>
  <head rend="very large bold" type="main">
    Major agrees to enforced no-fly zone</head>
  <byline>
    By George Jones, Political Editor, in Washington</byline>
  <p>Greater Western intervention in the conflict in
former Yugoslavia was pledged by President Bush ...</p>
</div>

Front matter:


<front>
  <titlePage>
    <docTitle>
      <titlePart type="main">Is There a Text in This Class?</titlePart>
      <titlePart type="sub">The Authority of Interpretive..</titlePart>
    </docTitle>
    <docAuthor>Stanley Fish</docAuthor>
    <docImprint>
      <publisher>Harvard University Press</publisher>
      <pubPlace>Cambridge, Massachusetts</pubPlace>
      <pubPlace>London, England</pubPlace></docImprint>
  </titlePage></front>

1.1.14. TEI.analysis example


<seg id="orwl.en.24" corresp="orwl.sl.24">
  <s id="Oen.1.1.4.5">
    <c type="open" ctag='"'>"</c>
    <w ana="Af" lemma="big">Big</w> 
    <w ana="Ncms" lemma="brother">Brother</w> 
    <w ana="Vaip3s" lemma="be">is</w> 
    <w ana="Vmpp" lemma="watch">watching</w> 
    <w ana="Pp2" lemma="you">you</w>
    <c ctag='"'>"</c> 
    <w ana="Dd" lemma="the">the</w> 
    <w ana="Ncns" lemma="caption">caption</w> 
    <w ana="Vmis" lemma="say">said</w>
    <c ana="Cs" lemma="while">while</w> 
    <w ana="Dd" lemma="the">the</w> 
    <w ana="Af" lemma="dark">dark</w> 
    <w ana="Ncnp" lemma="eye">eyes</w> 
    <w ana="Vmis" lemma="look">looked</w> 
    <w ana="Rmp" lemma="deep">deep</w> 
    <w ana="Sp" lemma="into">into</w> 
    <w ana="Np" lemma="winston">Winston</w>
    <w type="rsplit" ana="St" lemma="'s">'s</w> 
    <w ana="Ps3" lemma="own">own</w>
    <c ctag=".">.</c>
  </s>
</seg>

1.1.15. TEI.fs example


<fsLib>
  <fs type="Noun" id="Ncfda" select="sl" feats="N1.c N2.f N3.d N4.a"/>
  <fs type="Noun" id="Ncfdd" select="sl" feats="N1.c N2.f N3.d N4.d"/>
  <fs type="Noun" id="Ncfdg" select="sl" feats="N1.c N2.f N3.d N4.g"/>
  ...
</fsLib>

<fLib>
  <f id="N1.c"  select="en ro sl cs bg et hu hr" name="Type">
    <sym value="common"/>
  </f>
  <f id="N1.p"  select="en ro sl cs bg et hu hr" name="Type">
    <sym value="proper"/>
  </f>
  <f id="N2.m"  select="en ro sl cs bg       hr" name="Gender">
    <sym value="masculine"/>
  </f>
  <f id="N2.f"  select="en ro sl cs bg       hr" name="Gender">
    <sym value="feminine"/>
  </f>
  <f id="N2.n"  select="en ro sl cs bg       hr" name="Gender">
    <sym value="neuter"/>
  </f>
  ...
</fLib>

1.1.16. TEI.dictionaries example


<entry key="add">
  <form type="hw"><orth>add</orth> <pron>&amp;d.</pron></form>
  <hom>
    <gramGrp><pos>vtr</pos></gramGrp>
    <sense>
      <trans><tr>dodati</tr></trans>
      <eg>
        <quote>&ldquo;...&rdquo; he added angrily</quote>
        <trans><tr>&ldquo;...&rdquo; je dodal jeznorito</tr></trans>
      </eg></sense>
    <sense>
      <form type="variant">
        (<lbl type="preference">also</lbl> 
        <orth type="variant">add together</orth>)</form>
      <trans>
        <usg type="label">Math</usg> 
        <tr>se&scaron;teti</tr></trans>
        <eg>
          <quote>add the two figures (together)</quote>
          <trans><tr>se&scaron;tej ti dve &scaron;tevili</tr></trans></eg>
    </sense></hom>
  <hom>
    <gramGrp><pos>vtr</pos><pos>vi</pos></gramGrp>
    <trans><tr>pove&ccaron;ati</tr>, <tr>povi&scaron;ati</tr></trans>
    <eg>
      <quote>there's no need to add to our difficulties</quote>
      <trans><tr>res ni treba &scaron;e pove&ccaron;ati na&scaron;ih te&zcaron;av</tr></trans>
    </eg>
  </hom>
  <sense orig="idioms">
    <eg orig="idiom">
      <quote>I might add</quote>
      <trans><usg type="label">informal</usg> <tr>vrh vsega</tr></trans>
    </eg>
    <xr><ref>insult</ref> <ref>injury</ref> <ref>fuel</ref> <ref>fire</ref></xr>
  </sense>
...

1.1.17. Parameter entities in TEI

The TEI DTDs use parameter entities for several purposes:

to specify tag omisibility information within a DTD, or alternatively to omit such information in an XML DTD;
to identify what base tag set should be used for a document;
to identify what additional tag sets should be included;

to include or exclude the declaration of each element, e.g.


<!ENTITY % hi           'INCLUDE' >
<!ENTITY % distinct     'IGNORE' >

to specify the name of each element, e.g.


<!ENTITY % n.p        'o'>
<!ENTITY % n.soCalled 'takoImenovan'>

to define sets of attributes shared by given classes of elements
to define classes of elements which can occur at same locations in content models

1.1.18. Global attributes

Global attributes are defined in parameter entities and included in attribute lists:

a.global

global attributes for all elements:

id: provides a unique identifier for the element bearing the ID value
n: gives a number (or other label) for an element, which is not necessarily unique within the document.
lang: indicates the language of the element content, usually using a two- or three-letter code from ISO 639
rend: indicates how the element in question was rendered or presented in the source text.

a.analysis

additional global attributes for the analysis tag set

a.linking

additional global attributes for the linking tag set

a.terminology

additional global attributes for the terminology base

1.1.19. Model classes

When the members of a class are structurally similar and can appear at the same kinds of structural locations in the document, they are grouped together into an m-class (or `model-class'):


<!ENTITY % x.bibl '' >
<!ENTITY % m.bibl '%x.bibl; bibl | biblFull | biblStruct' >

The model class can be, in the local modifications, extended via the x. parameter entity:


<!ENTITY % x.bibl 'my.bib |' >
<!ELEMENT my.bib (...)>

1.1.20. Local modifications

There are four kinds of modification that can be made to the TEI DTD:

deletion of elements:


TEI.extensions.ent: <!ENTITY % gi 'IGNORE'>

renaming of elements:


TEI.extensions.ent: <!ENTITY % n.gi 'newname'>

extension of classes


TEI.extensions.ent: <!ENTITY % x.class 'gi |'>

TEI.extensions.dtd: <!ELEMENT gi ...>
                    <!ATTLIST gi ...>

modification of content models and attribute lists:


TEI.extensions.ent: <!ENTITY % gi 'IGNORE'>

TEI.extensions.dtd: <!ELEMENT gi ...>
                    <!ATTLIST gi ...>

1.1.21. TEI Lite

TEI Lite is a particular parametrisation of TEI (a DTD), which implements a useful “starter set”, comprising the elements which almost every user should know about.

Some characteristics of TEI Lite:

includes most of the TEI `core' tag set, since this contains elements relevant to virtually all text types and all kinds of text-processing work;
handles a reasonably wide variety of texts, at the level of detail found in existing practice;
is useful for the production of new documents as well as encoding of existing ones;
is usable with a wide range of existing SGML software;
is derivable from the full TEI DTD using the extension mechanisms described in the TEI Guidelines;
is as small and simple as is consistent with the other goals.

1.1.22. TEI software

As TEI is an application of XML, generic XML software can be used to process it. However, certain programs have been written especially for TEI:

TEI Pizza Chef: Allows one to parametrise TEI via a Web site, which then produces a one-file DTD encapsulating the parametrisation
Ratz's TEI XSL Stylesheets: A customisable suite of XSLT stylesheets for formatting TEI documents. Outputs in HTML or PDF.
C. M. Sperberg-McQueen's XSL TEI Stylesheets: A simple set of XSLT stylesheets for formatting TEI documents. Outputs HTML.

1.2. European Language Engineering Standards

1.2.1. EAGLES

EAGLES, the Expert Advisory Group on Language Engineering Standards, was an EU initiative in 1996, which aimed to accelerate the provision of standards for:
- very large-scale language resources (such as text corpora, computational lexicons and speech corpora);
- means of manipulating such knowledge, via computational linguistic formalisms, mark up languages and various software tools;
- means of assessing and evaluating resources, tools and products.
The work was carried out by five WGs:
- Text Corpora
- Computational Lexicons
- Grammar Formalisms
- Evaluation
- Spoken Language
The result of their work were the EAGLES Guidelines, which are a collection of LE recommendations, e.g.
- Recommendations on corpus typology
- Recommendations on text typology
- Recommendations on corpus encoding
- Recommendations for the morphosyntactic annotation of corpora
- Recommendations on syntactic annotation of corpora
- Synopsis and comparison of morphosyntactic phenomena encoded in lexicons and corpora
- Recommendations on subcategorization
- Study of the relation between tagsets and taggers
- Lexicon architecture
- Computational lexicons methodology task
- Evaluation of Natural Language Processing Systems
- EAGLES Handbook on Spoken Language Systems

1.2.2. Corpus Encoding Standard

The EAGLES Recommendations on corpus encoding resulted in CES, the Corpus Encoding Standard, which is a SGML DTD; it is a particular parameterisation (and modification) of the TEI P3.
CES specifies a minimal encoding level that corpora must achieve to be considered standardized in terms of descriptive representation (marking of structural and typographic information) as well as general architecture (so as to be maximally suited for use in a text database). It also provides encoding specifications for linguistic annotation, together with a data architecture for linguistic corpora.
CES has been used in a number of corpus projects, to a large extent because it is simpler to use and understand than the full TEI.
CES recommends stand-off annotation for linguistic analyses
For the encoding of primary data the CES identifies three levels of encoding:
Level 1
The minimum encoding level required for CES conformance, requiring markup for gross document structure (major text divisions), down to the level of the paragraph, conformant to the cesDoc DTD.
Level 2
This level requires that paragraph level elements are correctly marked, and (where possible) the function of rendition information at the sub-paragraph level is determined and elements marked accordingly.
Level 3
This is the most restrictive and refined level of markup for primary data. It places additional constraints on the encoding of s-units and quoted dialogue, and demands more sub-paragraph level tagging.

1.2.3. XCES

XCES is the XML version of the Corpus Encoding Standard.
XCES currently includes XML Schemas for validation, and some XSLT scripts for rendering to HTML.
XCES is being used in the ANC, American National Corpus

1.2.4. ISLE

ISLE, the International Standards for Language Engineering is a continuation of EAGLES, and is at the same time a project and a set of co-ordinated activities regarding the Human Language Technology field.
The aim of ISLE is to develop HLT standards within an international framework, in the context of the EU-US International Research Cooperation initiative. Its objectives are to support national projects, HLT RTD projects and the language technology industry in general by developing, disseminating and promoting de facto HLT standards and guidelines for language resources, tools and products.
ISLE Working Groups:
- Computational Lexicons
- Natural Interaction and Multimodality
- Evaluation
EAGLES/ISLE metadata initiative has as its goal to make a proposal for a standard of meta-data descriptions of Multi-Media/Multi-Modal Language resources. Using such a standard it should become possible to create a browsable and searchable universe of such resources in the Internet.

1.3. Lexicon Exchange Initiatives

1.3.1. Lexicon Exchange Initiatives: Introduction

There is another, less academic strand of encoding standardisation initiatives:

terminological databases
translation memories
machine translation lexica
general computer lexica

The impetus for this work comes from:

the localisation industry
computer aided translation industry
machine translation industry

...strong industry support: Xerox, Microsoft, IBM, Systran, Trados, etc.

The various initiatives are being developed under the auspices of:

International Standards Organisation, ISO
Localisation Industry Standards Association, LISA
Open Lexicon Interchange Format Consortium, OLIF
Various European HLT projects: OTELO, SALT

A lot of information on this work can be obtained from the “Translation, Theory, and Technology” homepage, ttt.org.

Note that (published) applications seem to be scarce...

1.3.2. Terminology Interchange

ISO TC 37: ISO Technical Committee on Terminology and other language resources
ISO TC 37/SC 3: Subcommittee on Computer applications for terminology
ISO TC 37/SC 3/WG 3: Working Group on Data interchange
ISO TC 37/SC 4: Newly established SC on Language resource management

Standards by TC 37/SC 3:

MARTIF (ISO 12200:1999, MAchine-Readable Terminology Interchange Format): a SGML DTD designed to facilitate the negotiated interchange of structured terminological data among various applications, system environments, and hardware platforms
TMF (ISO/DIS 16642:2002) Computer applications in terminology -- Terminological markup framework: an abstract XML(schema,link)-based framework that provides a “definition of underlying structures and mechanisms needed for the computer representation of terminological data” and “Independence with regards any specific format”

1.3.3. A sample MARTIF terminological entry


<termEntry>
  <descripGrp>
    <descrip type='subjectField'> appearance of materials </descrip>
    <note> treated in DIN under paper and cardboard </note>
  </descripGrp>

  <note> The in-house working group for Optics is slated to finalize
    this entry by 1995-12-15. </note>

  <ntig lang='en'>
    <termGrp>
      <term> opacity </term>
      <termNote type='pos'> n </termNote>
    </termGrp>
    <descripGrp>
      <descrip type='definition'> degree of obstruction to the 
        transmission of visible light </descrip> 
      <ptr type='sourceIdentifier' target='ASTM.E284'> 
    </descripGrp>
    <descripGrp>
      <descrip type='figure'> Degrees of Opacity </descrip>
      <note> The chart provides graphic images illustrating various 
        degrees of opacity. </note> 
      <ptr type='figure' target='f357'> 
    </descripGrp>
    <adminGrp>
      <admin type='responsibility'> ASTM E12 </admin>
    </adminGrp>
  </ntig>

  <ntig lang='de'>
    <termGrp>
      <term> Opazit&auml;t </term>
      <termNote type='pos'> n </termNote>
      <termNote type='gender'> f </termNote>
    </termGrp>
    <descripGrp>
      <descrip type='definition'> Ma&szlig; f&uuml;r die
      Lichtundurchl&auml;ssigkeit </descrip>
      <ref type='sourceIdentifier' target='DIN-6730-1992-08'>p.5</ref>
    </descripGrp> 
    <adminGrp> 
      <admin type='responsibility'>
        Normenaussch&szlig; Papier und Pappe (NPa) im DIN Deutsches 
        Institut f&uuml;r Normung e.V.</admin> 
    </adminGrp> 
  </ntig>

</termEntry>

1.3.4. Translation memory interchange

LISA (Localisation Industry Standards Association) was founded in 1990 as a non-profit association joining the globalization, internationalisation, localisation, and translation business communities;
TMX (Translation Memory eXchange) is a specification (XML DTD) to allow easier exchange of translation memory data between tools and/or translation vendors with little or no loss of critical data during the process.
TMX is defined in two parts:
- container format specification: for the higher-level elements that provide information about the file as a whole and about entries (a multilingual entry is a translation unit, <tu>, composed of monolingual segments);
- specification of low-level meta-markup format for the content of a segment of translation-memory text
TMX offers two levels of implementation:
- Level 1 (Plain Text Only) - Support for the container only. The data inside each <seg> element is plain text. This level is sufficient when the data does not have inline codes.
- Level 2 (Content Markup) - Support for both container and content. Tools supporting TMX Level 2 can re-create the translated version of an original document by using only the TMX document.
TMX Version 1.3 (August 2001);
TMX Version 1.4 -- draft (June 2002)

1.3.5. A sample TMX document


<?xml version="1.0"?>
<!-- Example of TMX document -->
<tmx version="1.4">
 <header
  creationtool="XYZTool" creationtoolversion="1.01-023"
  datatype="PlainText"   segtype="sentence"
  adminlang="en-us"      srclang="EN"
  o-tmf="ABCTransMem"
  creationdate="20020101T163812Z" creationid="ThomasJ"
  changedate="20020413T023401Z"   changeid="Amity"
  o-encoding="iso-8859-1"
 >
  <note>This is a note at document level.</note>
  <prop type="RTFPreamble">{\rtf1\ansi\tag etc...{\fonttbl}</prop>
  <ude name="MacRoman" base="Macintosh">
   <map unicode="#xF8FF" code="#xF0" ent="Apple_logo" subst="[Apple]"/>
  </ude>
 </header>
 <body>
  <tu 
   tuid="0001"
   datatype="Text" usagecount="2"
   lastusagedate="19970314T023401Z"
  >
   <note>Text of a note at the TU level.</note>
   <prop type="x-Domain">Computing</prop>
   <prop type="x-Project">P&#x00E6;gasus</prop>
   <tuv
    xml:lang="EN"
    creationdate="19970212T153400Z" creationid="BobW"
   >
    <seg>data (with a non-standard character: &#xF8FF;).</seg>
   </tuv>
   <tuv xml:lang="FR-CA">
    <prop type="Origin">MT</prop>
    <seg>donn&#xE9;es (avec un caract&#xE8;re non standard: &#xF8FF;).</seg>
   </tuv>
  </tu>
  <tu 
   tuid="0002"
   srclang="*all*"
  >
   <prop type="Domain">Cooking</prop>
   <tuv xml:lang="EN">
    <seg>menu</seg>
   </tuv>
   <tuv xml:lang="FR-CA">
    <seg>menu</seg>
   </tuv>
   <tuv xml:lang="FR-FR">
    <seg>menu</seg>
   </tuv>
  </tu>
 </body>
</tmx>

1.3.6. Lexicon Interchange

OTELO (Common Access to Translation Services): EU project (1996-1999). Objective: ease integration of tools and functions that help translating. Subtask: “specifying a common format for text and lexical resources, with mechanisms for handling other current document formats, and adapting a range of NLP systems to accept these formats”.
OLIF (Open Lexicon Interchange Format) Consortium: Joint effort of a group of major NLP technology suppliers, corporate users of NLP, and research institutions. OLIF is based on the OTELO and Aventinus projects, and is a part of the SALT framework.
SALT (Standards-based Access to multilingual Lexicons and Terminologies): EU project, whose goal is to combine OLIF (lexical databases) MARTIF (terminological databases) into a new kind of database, XLT (eXchange format for Lex/Term-data).

1.3.7. Design of OLIF

OLIF is to be a user-friendly vehicle for exchanging terminological and lexical data: it is XML-compliant and offers support for NLP systems, such as machine translation, by providing coverage of a wide and detailed range of linguistic features.

Current official version of OLIF is V2.0, published in February 2002.
A lexical entry is supposed to mimic a feature-value representation
The content model has been kept very flat, with almost no use made of attributes
Some support for user extensions a la TEI, i.e. the “%x.” parameter entity
one of the main achievements of OLIF seems to be that it provides an extensive list of inflectional classes and grammatical categories for 5 EU languages
A lexicon is divided into the header and body; the body is composed of lexical entries. The element classes of an OLIF v.2 entry are:
- monolingual: defines monolingual data; each OLIF entry may contain only one monolingual group
- cross-reference: defines cross-reference relations between the given entry and other entries in the lexicon in the same language
- transfer: defines transfer relations between the given entry and other entries in different languages

1.3.8. A sample OLIF lexicon entry


<entry EntryUserId="2312">
  <mono MonoUserId="2311">
    <keyDC>
      <canForm>Briefkurs</canForm> 
      <language>de</language> 
      <ptOfSpeech>noun</ptOfSpeech> 
      <subjField>gac-fi</subjField> 
      <semReading>b</semReading> 
    </keyDC>
    <monoDC>
      <monoAdmin>
        <syllabification>brief-kurs</syllabification> 
        <entryFormation>cmp</entryFormation> 
        <originator>FISHERF</originator> 
        <adminStatus>ver</adminStatus> 
        <entrySource>sapterm</entrySource> 
        <company>sap</company> 
      </monoAdmin>
      <monoMorph>
        <morphStruct>brief:kurs</morphStruct> 
        <inflection>like Tisch</inflection> 
        <head>kurs</head> 
        <gender>m</gender> 
      </monoMorph>
      <monoSyn>
        <synType>cnt</synType> 
      </monoSyn>
      <monoSem>
        <semType>meas</semType> 
      </monoSem>
    </monoDC>
    <generalDC>
      <updater>HANSENPOU</updater> 
      <modDate>1999-28-01</modDate> 
      <usage>online</usage> 
      <note>online-A</note> 
    </generalDC>
  </mono>
  <transfer>
    <keyDC>
      <canForm>bank selling rate</canForm> 
      <language>en</language> 
      <ptOfSpeech>noun</ptOfSpeech> 
      <subjField>gac-fi</subjField> 
      <semReading>b</semReading> 
    </keyDC>
    <equival>full</equival> 
  </transfer>
</entry>