next up previous contents
Next: English Up: Multext-East D2.3 F Previous: Corpus Encoding

Morphosyntactic Tagging

COP project 106 MULTEXT-East Deliverable D2.3 F-- Tagging

In addition to the cesDoc encoding, the Orwell corpus is also available as a tokenised and morphosyntactically tagged cesAna document. For more informations on morphosyntactic descriptions (MSDs) with which the text was annotated, see the MULTEXT-East Deliverable D1.1 F. The lexica used in the annotation process are described in the Deliverable D1.2. The cesAna DTD is explained in the CES documentation.

To arrive at the tokenised and tagged cesAna Orwell the following steps have been performed:

1.
the cesDoc version has been simplified and converted to cesAna encoding;
2.
the text was tokenised
3.
the tokens were annotated with lexical, i.e. ambiguous MSDs, lemmas and tags
4.
the lexical information was disambiguated.

To explain the structure of the final documents, first consider a fragment of the English cesDoc Orwell:

<!DOCTYPE text PUBLIC "-//CES//DTD cesDoc//EN">
<text>
  <body lang=en id=Oen>
    <div id="Oen.1" type=part n=1>
      <div id="Oen.1.1" type=chapter n=1>
        <p id="Oen.1.1.1">
          <s id="Oen.1.1.1.1">
            It was a bright cold day in April, 
            and the clocks were striking thirteen.
          </s>
          <s id="Oen.1.1.1.2">
            <name type=person>Winston Smith</name>,
            his chin nuzzled into his breast in an effort to escape the 
            vile wind, slipped quickly through the glass doors of
            <name type=place>Victory Mansions</name>,
            though not quickly enough to prevent a swirl of gritty dust 
            from entering along with him.
          </s>
        </p>
...

The gross document structure of a cesDoc document is different from the cesAna one. In the Orwell corpus the following relations exist between the two:

Follows an example the derived cesAna from the above cesDoc, marked according to these conventions:

<!DOCTYPE cesAna PUBLIC "-//CES//DTD cesAna//EN">
<cesAna version="4.6">
  <chunkList type=TEXT>
   <chunk  type=BODY>
     <par from="Oen.1.1.1">
      <s from="Oen.1.1.1.1">
        It was a bright cold day in April, 
        and the clocks were striking thirteen.
      </s>
      <s from="Oen.1.1.1.2">
        Winston Smith
        his chin nuzzled into his breast in an effort to escape the 
        vile wind, slipped quickly through the glass doors of
        Victory Mansions,
        though not quickly enough to prevent a swirl of gritty dust 
        from entering along with him.
        </s>
     </par>
...

At the S level the documents have been tokenised according the lexical resources of the language and are encoded as TOKen elements. Tokens are either 'normal' words, compounds, separable parts of words ('clitics'), or punctuation marks. They are distinguished by the value of the token's TYPE attribute. The values used are WORD for words, and PUNCT for punctuation marks. The word or punctuation mark is contained in the ORTH element. The punctuation tokens are annotated with (unambiguous) corpus tags, which identical across the languages of MULTEXT-East; for a description of the tags used, see the MULTEXT-East Deliverable D1.1 F, Section 2.5.1. The following example illustrates this markup:

 <chunkList type=TEXT lang=en>
  <chunk type=BODY lang=en>
   <par from='Oen.1.1.1'>               
    <s from='Oen.1.1.1.1'>              
     <tok type=WORD><orth>It</orth></tok>
     <tok type=WORD><orth>was</orth></tok>
     <tok type=WORD><orth>a</orth></tok>
     <tok type=WORD><orth>bright</orth></tok>
     <tok type=WORD><orth>cold</orth></tok>
     <tok type=WORD><orth>day</orth></tok>
     <tok type=WORD><orth>in</orth></tok>
     <tok type=WORD><orth>April</orth></tok>
     <tok type=PUNCT><orth>,</orth><ctag>COMMA</ctag></tok>
     <tok type=WORD><orth>and</orth></tok>
     <tok type=WORD><orth>the</orth></tok>
     <tok type=WORD><orth>clocks</orth></tok>
     <tok type=WORD><orth>were</orth></tok>
     <tok type=WORD><orth>striking</orth></tok>
     <tok type=WORD><orth>thirteen</orth></tok>
     <tok type=PUNCT><orth>.</orth><ctag>PERIOD</ctag></ctag>
     </tok>
    </s>

The word tokens are annotated both with ambiguous lexical information, and with context-dependent, disambiguated information. The former is contained in the <lex> elements of the token, the latter in the <disamb> element(s). Both elements contain the <base> (lemma) of the token, its morphosyntactic description <msd>, and (depending on the language) its corpus tag, <ctag>, as illustrated in the following example:

     <tok type=WORD>
      <orth>glass</orth>
      <disamb><base>glass</base><msd>Afp</msd><ctag>ADJE</ctag></disamb>
      <lex><base>glass</base><msd>Afp</msd><ctag>ADJE</ctag></lex>
      <lex><base>glass</base><msd>Ncns</msd><ctag>NN</ctag></lex>
     </tok>
     <tok type=WORD>
      <orth>doors</orth>
      <disamb><base>door</base><msd>Ncnp</msd><ctag>NNS</ctag></disamb>
      <lex><base>door</base><msd>Ncnp</msd><ctag>NNS</ctag></lex>
     </tok>

It should be noted that the correct annotation is given both in the <disamb>, as well as in (one of) the <lex> elements. The <lex> elements of token thus represent its ambiguity class. In generally, however, there may be more <disamb> elements for one token, in cases where the tagger or human could not decide how to disambiguate. In this case each <disamb> element appears among the <lex> elements as well.

GI EN BG CS ET HU RO SL
par 1,286 1,322 1,297 1,266 1,303 1,343 1,288
s 6,701 6,682 6,751 6,478 6,768 6,521 6,689
tok 118,102 101,173 100,358 94,906 98,426 118,063 107,770
orth 118,102 101,173 100,358 94,906 98,426 118,063 107,770
disamb 187,526 86,020 79,862 75,433 80,705 101,508 90,792
lex 214,404 156,002 214,368 147,542 111,945 189,695 187,562
base 401,930 242,022 294,230 222,975 192,650 291,203 278,354
msd 401,930 156,002 294,230 222,975 192,650 291,203 278,354
ctag 416,035 257,175 20,496 94,906 98,426 307,758 16,978

Tag usage in Orwell's ``1984''

As can be noticed, the cesAna documents produced in the project are maximal in terms of contained data (ambiguity classes), and annotation (annotations as elements, not attribute values, no tag minimisation. As the documents are furthermore encoded with SGML entities, rather than the 8bit ISO character sets, the resulting files are rather large. However, the intention is to provide these resources for interchange and as self-contained as possible.

The next sections give the <cesHeader> of the seven cesAna documents.



 
next up previous contents
Next: English Up: Multext-East D2.3 F Previous: Corpus Encoding
Multext-East