Next: English Up: Multext-East D2.3 F Previous: Corpus Encoding

Morphosyntactic Tagging

COP project 106 MULTEXT-East Deliverable D2.3 F-- Tagging

In addition to the cesDoc encoding, the Orwell corpus is also available as a tokenised and morphosyntactically tagged cesAna document. For more informations on morphosyntactic descriptions (MSDs) with which the text was annotated, see the MULTEXT-East Deliverable D1.1 F. The lexica used in the annotation process are described in the Deliverable D1.2. The cesAna DTD is explained in the CES documentation.

To arrive at the tokenised and tagged cesAna Orwell the following steps have been performed:

1.: the cesDoc version has been simplified and converted to cesAna encoding;
2.: the text was tokenised
3.: the tokens were annotated with lexical, i.e. ambiguous MSDs, lemmas and tags
4.: the lexical information was disambiguated.

To explain the structure of the final documents, first consider a fragment of the English cesDoc Orwell:

<!DOCTYPE text PUBLIC "-//CES//DTD cesDoc//EN">
<text>
  <body lang=en id=Oen>
    <div id="Oen.1" type=part n=1>
      <div id="Oen.1.1" type=chapter n=1>
        <p id="Oen.1.1.1">
          <s id="Oen.1.1.1.1">
            It was a bright cold day in April, 
            and the clocks were striking thirteen.
          </s>
          <s id="Oen.1.1.1.2">
            <name type=person>Winston Smith</name>,
            his chin nuzzled into his breast in an effort to escape the 
            vile wind, slipped quickly through the glass doors of
            <name type=place>Victory Mansions</name>,
            though not quickly enough to prevent a swirl of gritty dust 
            from entering along with him.
          </s>
        </p>
...

The gross document structure of a cesDoc document is different from the cesAna one. In the Orwell corpus the following relations exist between the two:

the TEXT is encoded as CHUNKLIST
the BODY is encoded as CHUNK
the DIV tags are omitted
the QUOTE tags are omitted
the P-level elements are encoded as PAR elements:
- P is PAR, with implied TYPE;
- the HEAD elements or omitted,
  if present they are encoded as PAR TYPE=HEAD
- LIST and POEM elements can be omitted,
  if present they are encoded as PAR TYPE=LIST and TYPE=POEM respectively
the S-level elements are encoded as S elements:
- S is S, with implied TYPE;
- if ITEM and L are present, they are marked as TYPE=ITEM and TYPE=L.
- P-level and S-level IDs are referred to in the FROM attribute of PAR and S.
the Q tags are omitted
other cesDoc (sub-S level) tags such as DATE, NAME, ABBR, etc., are encoded as values of the CLASS attribute of the TOKen element.

Follows an example the derived cesAna from the above cesDoc, marked according to these conventions:

<!DOCTYPE cesAna PUBLIC "-//CES//DTD cesAna//EN">
<cesAna version="4.6">
  <chunkList type=TEXT>
   <chunk  type=BODY>
     <par from="Oen.1.1.1">
      <s from="Oen.1.1.1.1">
        It was a bright cold day in April, 
        and the clocks were striking thirteen.
      </s>
      <s from="Oen.1.1.1.2">
        Winston Smith
        his chin nuzzled into his breast in an effort to escape the 
        vile wind, slipped quickly through the glass doors of
        Victory Mansions,
        though not quickly enough to prevent a swirl of gritty dust 
        from entering along with him.
        </s>
     </par>
...

At the S level the documents have been tokenised according the lexical resources of the language and are encoded as TOKen elements. Tokens are either 'normal' words, compounds, separable parts of words ('clitics'), or punctuation marks. They are distinguished by the value of the token's TYPE attribute. The values used are WORD for words, and PUNCT for punctuation marks. The word or punctuation mark is contained in the ORTH element. The punctuation tokens are annotated with (unambiguous) corpus tags, which identical across the languages of MULTEXT-East; for a description of the tags used, see the MULTEXT-East Deliverable D1.1 F, Section 2.5.1. The following example illustrates this markup:

 <chunkList type=TEXT lang=en>
  <chunk type=BODY lang=en>
   <par from='Oen.1.1.1'>               
    <s from='Oen.1.1.1.1'>              
     <tok type=WORD><orth>It</orth></tok>
     <tok type=WORD><orth>was</orth></tok>
     <tok type=WORD><orth>a</orth></tok>
     <tok type=WORD><orth>bright</orth></tok>
     <tok type=WORD><orth>cold</orth></tok>
     <tok type=WORD><orth>day</orth></tok>
     <tok type=WORD><orth>in</orth></tok>
     <tok type=WORD><orth>April</orth></tok>
     <tok type=PUNCT><orth>,</orth><ctag>COMMA</ctag></tok>
     <tok type=WORD><orth>and</orth></tok>
     <tok type=WORD><orth>the</orth></tok>
     <tok type=WORD><orth>clocks</orth></tok>
     <tok type=WORD><orth>were</orth></tok>
     <tok type=WORD><orth>striking</orth></tok>
     <tok type=WORD><orth>thirteen</orth></tok>
     <tok type=PUNCT><orth>.</orth><ctag>PERIOD</ctag></ctag>
     </tok>
    </s>

The word tokens are annotated both with ambiguous lexical information, and with context-dependent, disambiguated information. The former is contained in the <lex> elements of the token, the latter in the <disamb> element(s). Both elements contain the <base> (lemma) of the token, its morphosyntactic description <msd>, and (depending on the language) its corpus tag, <ctag>, as illustrated in the following example:

     <tok type=WORD>
      <orth>glass</orth>
      <disamb><base>glass</base><msd>Afp</msd><ctag>ADJE</ctag></disamb>
      <lex><base>glass</base><msd>Afp</msd><ctag>ADJE</ctag></lex>
      <lex><base>glass</base><msd>Ncns</msd><ctag>NN</ctag></lex>
     </tok>
     <tok type=WORD>
      <orth>doors</orth>
      <disamb><base>door</base><msd>Ncnp</msd><ctag>NNS</ctag></disamb>
      <lex><base>door</base><msd>Ncnp</msd><ctag>NNS</ctag></lex>
     </tok>

It should be noted that the correct annotation is given both in the <disamb>, as well as in (one of) the <lex> elements. The <lex> elements of token thus represent its ambiguity class. In generally, however, there may be more <disamb> elements for one token, in cases where the tagger or human could not decide how to disambiguate. In this case each <disamb> element appears among the <lex> elements as well.

GI	EN	BG	CS	ET	HU	RO	SL
par	1,286	1,322	1,297	1,266	1,303	1,343	1,288
s	6,701	6,682	6,751	6,478	6,768	6,521	6,689
tok	118,102	101,173	100,358	94,906	98,426	118,063	107,770
orth	118,102	101,173	100,358	94,906	98,426	118,063	107,770
disamb	187,526	86,020	79,862	75,433	80,705	101,508	90,792
lex	214,404	156,002	214,368	147,542	111,945	189,695	187,562
base	401,930	242,022	294,230	222,975	192,650	291,203	278,354
msd	401,930	156,002	294,230	222,975	192,650	291,203	278,354
ctag	416,035	257,175	20,496	94,906	98,426	307,758	16,978

Tag usage in Orwell's ``1984''

As can be noticed, the cesAna documents produced in the project are maximal in terms of contained data (ambiguity classes), and annotation (annotations as elements, not attribute values, no tag minimisation. As the documents are furthermore encoded with SGML entities, rather than the 8bit ISO character sets, the resulting files are rather large. However, the intention is to provide these resources for interchange and as self-contained as possible.

The next sections give the <cesHeader> of the seven cesAna documents.

Next: English Up: Multext-East D2.3 F Previous: Corpus Encoding

Multext-East