In addition to the cesDoc encoding, the Orwell corpus is also available as a tokenised and morphosyntactically tagged cesAna document. For more informations on morphosyntactic descriptions (MSDs) with which the text was annotated, see the MULTEXT-East Deliverable D1.1 F. The lexica used in the annotation process are described in the Deliverable D1.2. The cesAna DTD is explained in the CES documentation.
To arrive at the tokenised and tagged cesAna Orwell the following steps have been performed:
To explain the structure of the final documents, first consider a fragment of the English cesDoc Orwell:
<!DOCTYPE text PUBLIC "-//CES//DTD cesDoc//EN"> <text> <body lang=en id=Oen> <div id="Oen.1" type=part n=1> <div id="Oen.1.1" type=chapter n=1> <p id="Oen.1.1.1"> <s id="Oen.1.1.1.1"> It was a bright cold day in April, and the clocks were striking thirteen. </s> <s id="Oen.1.1.1.2"> <name type=person>Winston Smith</name>, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of <name type=place>Victory Mansions</name>, though not quickly enough to prevent a swirl of gritty dust from entering along with him. </s> </p> ...
The gross document structure of a cesDoc document is different from the cesAna one. In the Orwell corpus the following relations exist between the two:
Follows an example the derived cesAna from the above cesDoc, marked according to these conventions:
<!DOCTYPE cesAna PUBLIC "-//CES//DTD cesAna//EN"> <cesAna version="4.6"> <chunkList type=TEXT> <chunk type=BODY> <par from="Oen.1.1.1"> <s from="Oen.1.1.1.1"> It was a bright cold day in April, and the clocks were striking thirteen. </s> <s from="Oen.1.1.1.2"> Winston Smith his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not quickly enough to prevent a swirl of gritty dust from entering along with him. </s> </par> ...
At the S level the documents have been tokenised according the lexical resources of the language and are encoded as TOKen elements. Tokens are either 'normal' words, compounds, separable parts of words ('clitics'), or punctuation marks. They are distinguished by the value of the token's TYPE attribute. The values used are WORD for words, and PUNCT for punctuation marks. The word or punctuation mark is contained in the ORTH element. The punctuation tokens are annotated with (unambiguous) corpus tags, which identical across the languages of MULTEXT-East; for a description of the tags used, see the MULTEXT-East Deliverable D1.1 F, Section 2.5.1. The following example illustrates this markup:
<chunkList type=TEXT lang=en> <chunk type=BODY lang=en> <par from='Oen.1.1.1'> <s from='Oen.1.1.1.1'> <tok type=WORD><orth>It</orth></tok> <tok type=WORD><orth>was</orth></tok> <tok type=WORD><orth>a</orth></tok> <tok type=WORD><orth>bright</orth></tok> <tok type=WORD><orth>cold</orth></tok> <tok type=WORD><orth>day</orth></tok> <tok type=WORD><orth>in</orth></tok> <tok type=WORD><orth>April</orth></tok> <tok type=PUNCT><orth>,</orth><ctag>COMMA</ctag></tok> <tok type=WORD><orth>and</orth></tok> <tok type=WORD><orth>the</orth></tok> <tok type=WORD><orth>clocks</orth></tok> <tok type=WORD><orth>were</orth></tok> <tok type=WORD><orth>striking</orth></tok> <tok type=WORD><orth>thirteen</orth></tok> <tok type=PUNCT><orth>.</orth><ctag>PERIOD</ctag></ctag> </tok> </s>
The word tokens are annotated both with ambiguous lexical information, and with context-dependent, disambiguated information. The former is contained in the <lex> elements of the token, the latter in the <disamb> element(s). Both elements contain the <base> (lemma) of the token, its morphosyntactic description <msd>, and (depending on the language) its corpus tag, <ctag>, as illustrated in the following example:
<tok type=WORD> <orth>glass</orth> <disamb><base>glass</base><msd>Afp</msd><ctag>ADJE</ctag></disamb> <lex><base>glass</base><msd>Afp</msd><ctag>ADJE</ctag></lex> <lex><base>glass</base><msd>Ncns</msd><ctag>NN</ctag></lex> </tok> <tok type=WORD> <orth>doors</orth> <disamb><base>door</base><msd>Ncnp</msd><ctag>NNS</ctag></disamb> <lex><base>door</base><msd>Ncnp</msd><ctag>NNS</ctag></lex> </tok>
It should be noted that the correct annotation is given both in the <disamb>, as well as in (one of) the <lex> elements. The <lex> elements of token thus represent its ambiguity class. In generally, however, there may be more <disamb> elements for one token, in cases where the tagger or human could not decide how to disambiguate. In this case each <disamb> element appears among the <lex> elements as well.
GI | EN | BG | CS | ET | HU | RO | SL |
---|---|---|---|---|---|---|---|
par | 1,286 | 1,322 | 1,297 | 1,266 | 1,303 | 1,343 | 1,288 |
s | 6,701 | 6,682 | 6,751 | 6,478 | 6,768 | 6,521 | 6,689 |
tok | 118,102 | 101,173 | 100,358 | 94,906 | 98,426 | 118,063 | 107,770 |
orth | 118,102 | 101,173 | 100,358 | 94,906 | 98,426 | 118,063 | 107,770 |
disamb | 187,526 | 86,020 | 79,862 | 75,433 | 80,705 | 101,508 | 90,792 |
lex | 214,404 | 156,002 | 214,368 | 147,542 | 111,945 | 189,695 | 187,562 |
base | 401,930 | 242,022 | 294,230 | 222,975 | 192,650 | 291,203 | 278,354 |
msd | 401,930 | 156,002 | 294,230 | 222,975 | 192,650 | 291,203 | 278,354 |
ctag | 416,035 | 257,175 | 20,496 | 94,906 | 98,426 | 307,758 | 16,978 |
Tag usage in Orwell's ``1984''
As can be noticed, the cesAna documents produced in the project are maximal in terms of contained data (ambiguity classes), and annotation (annotations as elements, not attribute values, no tag minimisation. As the documents are furthermore encoded with SGML entities, rather than the 8bit ISO character sets, the resulting files are rather large. However, the intention is to provide these resources for interchange and as self-contained as possible.
The next sections give the <cesHeader> of the seven cesAna documents.