Next: Morphosyntactic Tagging Up: Sentence Alignment Previous: Sentence Alignment

Corpus Encoding

Each of the six pairwise alignments with English has been encoded as a separate SGML document, with the <cesAlign> root element, and stored in a separate file.

All system identifiers (i.e. filenames) are encapsulated in the MULTEXT-East catalog, which is structured according to the SGML Open Technical Resolution 9401:1997. There the alignment documents are given the following PUBLIC identifiers:

     -//MTE//DOCUMENT ALIGN 1984//BG EN
     -//MTE//DOCUMENT ALIGN 1984//CS EN
     -//MTE//DOCUMENT ALIGN 1984//ET EN
     -//MTE//DOCUMENT ALIGN 1984//HU EN
     -//MTE//DOCUMENT ALIGN 1984//RO EN
     -//MTE//DOCUMENT ALIGN 1984//SL EN

It should be noted that the bilingual specifications of the above identifiers could be invalid, as SGML OTR 9401:1997 does not make provisions for multilingual documents.

The <cesAlign> documents do not contain the primary data, but only links to S-level elements, expressed as pairs of ID references to the parallel S-units of the two aligned cesDoc documents.

The structure of the MULTEXT-East alignment documents is as follows: The alignment documents currently do not have a header. The root element is <cesAlign>, which contains a <linkList> element. This element can have a number of link group <linkGrp> elements. In MULTEXT-East each document has only one link group of type=body that encompasses the complete bodies of the two aligned documents. The link group contains the actual links between the S-elements, with the references to IDs given as the value of its xTargets attribute. This value is composed of two sequences of IDs, separated by semicolon. The ID references in the sequence are separated by spaces.

The following hypothetical Slovene-English Orwell illustrates the overall structure of an MULTEXT-East alignment document; each link gives one type (one, many, zero) of possible alignment:

<!DOCTYPE cesAlign PUBLIC "-//CES//DTD cesAlign//EN">
<cesAlign version="4.1">
  <linkList id="Oslen">
    <linkGrp id="Oslen.1" type="body" targtype="s" domains="Osl Oen">
      <link xtargets="Osl1.1 ; Oen1.1">
      <link xtargets="Osl1.2 Osl1.3 ; Oenl1.2">
      <link xtargets="Osl1.4 ; ">
    </linkGrp>
  </linkList>
</cesAlign>

As can be seen, the only link group in the link list is of type BODY, its target type is of type S, and its domains are the Slovene and English Orwell. The first link represents an 1 - 1 alignment, the second a 2 - 1 alignment, and the third a 1 - 0 alignment.

To simplify ID reference parsing, each ID in the first sequence is followed by exactly one space, while each ID in the second sequence is preceded by one space.

Next: Morphosyntactic Tagging Up: Sentence Alignment Previous: Sentence Alignment

Multext-East