Practical Presentation of a "Vanilla" Aligner*

Pernilla Danielsson and Daniel Ridings
Språkbanken
Institutionen för svenska språket
Göteborgs universitet
S-412 98 Göteborg SWEDEN
GU-ISS-97-2
ISSN 1401-5919
pedant@svenska.gu.se

Introduction

In our presentation we are going to concentrate on plain, "vanilla", tools for aligning and presenting texts together with their translations. Our criterium for designating these tools as "plain" is that they can, in many cases, be used on a standard IBM compatible PC with a processor of 386 or higher and they are free. This is a conscious attempt on our part to continue the discussion of public domain generic tools by Tomaz Erjavec [3].

We will be discussing the following aspects of aligning with Church and Gale's widespread implementation of the Dynamic Time Warping algorithm from 1993 [4]:

The relevant tools are found after each item above. All of them are readily accessible for academic users and many of them are included with operating systems.

Preprocessing: caveats with SGML

There is at least one good aligner that is SGML compatible [1], but the Church and Gale program is not one of them. This program wants to work with basically two things: paragraphs and sentences. One will rapidly feel uneasy about calling things sentences so we prefer to refer to "chunks," that is, units that a translator is liable to translate in one sweep. Such things could be a heading, which may or may not be a sentence, or plain sentences within a paragraph. Most of our work deals with technical literature and official texts with many "itemized" paragraphs such as the one above. It is less than satisfying to set out sentence boundaries in such material and it is not likely that a translator will translate a whole itemized paragraph in one sweep, but will take it piece by piece.

In the Republic there is a very special problem that encoders should be aware of. All Greek editions of Plato print numbers in the margin with letters of the alphabet, usually "a" to "e", inbetween. These numbers are what we use to refer to sections of Plato's work, much the same way as one refers to the Bible by book, chapter and verse. They are standardized and accepted as reference by all who work with Plato.

If we take a closer look at the numbers and try to figure out what they really stand for it will not be immediately obvious. They do not occur at paragraph breaks, so they are not "paragraphs", nor do they occur at page breaks with any perceptible regularity. They are simply a remnant of publishing history. The first printed edition of Plato was by a French scholar, Stephanus. The numbering in modern editions of Plato reflects the page breaks in Stephanus' edition. These page breaks could and did occur in the middle of paragraphs, sentences and dialogues. They reflect the structure of the first printed book but not the structure of the text being printed.

Unfortunately few editors of critical editions have ever seen this editio princeps and fewer, if any, translators have. Why is this important? There is no guarantee that the numbers in editions can be placed more accurately than on the level of a line of printed text. The lines in different editions do not correspond to each other. Therefore, the "scope" of the numbers are imprecise. In addition to that, they do not reflect the structure of the text itself, since they do not necessarily fall on any particular linguistic boundary. It is well known that SGML has problems with non-hierarchic units. Therefore there will be a need for a consensus among the TELRI participants: Someone will have to decide on exactly where the pages start and end.

As already mentioned, Church & Gale's program is highly dependent upon the units paragraph and sentence. Paragraphs are not found in Greek manuscripts from which modern critical editions are derived so the division into paragraphs that is found in the various editions and thereby in the various translations, often reflects the individual tastes and fashions current in the typographical conventions of the various countries.

The most expedient solution would probably be that the partners agree upon one text, in one language that everyone can understand, to serve as the text containing the canonical "structure."

Conversion

Besides requiring paragraphs and sentences this aligner places other demands as well. It requires a little more preparation for individual texts, but is, on the other hand, one of the very few language independent aligners freely available. Other aligners that use individual words to find correct links between two texts are of necessity dependent upon linguistic facts about both languages. This aligner assumes that a translation of a unit in one language, for example a sentence, will be represented by a unit that is approximately of the same length in another language. "Length," in this context, is simply a character count, not a word count. This is an assumption that proves true in a remarkable number of cases.

Church & Gale's implementation works with two input files, one file for each language. The text in each file is divided up so that there is one word or marker per line. That is:

One
word
or
marker
per
line.
.EOS
.EOP
In the above ".EOS" stands for "end-of-sentence" and ".EOP" stands for "end-of-paragraph." The actual markers used are arbitrary and are provided on the command line when the program is executed, ie:
align -D '.EOP' -d '.EOS' republic.en republic.de
This format can be created by calling upon simple program in UNIX or by simple scripts written in Awk or Perl, both of which are available in the public domain for MS-DOS.* If one is using UNIX, the following command (translate) will do the trick:
tr ' ' '\n' <republic.en.txt >republic.en
That searches for all spaces and replaces them with an end of line character. A simple equivalent in AWK would be:
awk '{for (i=1; i <= NF; i++) print $i}' <republic.en.txt >republic.en
A simple perl script that reads through a normalized SGML file and creates input files for the aligner can be found below. The script converts all paragraphs within a <BODY> element to the one word per line format, slices the input into sentences, adds the end of paragraph code and strips away all SGML. It is not intended to be perfect. At the workshop we were encouraged to demonstrate a simple, "vanilla" segmenter and this one is as good a starting point as any.
#!/usr/bin/perl
# A KISS (Keep It Simple and Stupid) version of a
# sentence slicer.  Another Pedantic antic brought to you
# by Pernilla Danielsson and Daniel Ridings.
# Do what you want with it.

while(<>) {
        # Skip headers and everything else and just segment
        # the <BODY> element.
        &GotBody if (/BODY/);
}

sub GotBody {
        # We're not _real_ interested in the BODY itself, but
        # rather the <P>'s in it.
        while (<>) {
                chop;
                &GotP if (/<P>/);        
                last if (/<\/BODY>/);
        }
}

sub GotP {
        $OurP = $_;
        # Read in all the text and concatenate all the lines of
        # a paragraph to one single line.
        if (!/<\/P>/) {
                while (<>) {
                        chop;
                        $OurP = $OurP . $_ . " ";
                        last if (/<\/P>/);
                }
        }
        # Now we have the whole paragraph as one line. Look
        # for sentence ending punctuation marks and stick in
        # an .EOS after each one.
        $OurP =~ s/([.!?])([ "']+)([(A-ZÅÄÖ])/$1$2 .EOS $3/g; 
        # Get rid of SGML
        $OurP =~ s/<[A-Z\/]+>//g;
        # Split up the paragraph into an array of words.
        @OurP = split(/ /, $OurP, 9999);
        # Print out one word per line. The length test is needed
        # in order to get rid of the double spaces resulting
        # from inserting .EOS
        foreach $word (@OurP) {
                print "$word\n" if (length($word));
                $EOStest = $word eq ".EOS";
        }
        # We're finished.
        if ($EOStest eq "1") {
                print ".EOP\n";
        } else {
                print ".EOS\n.EOP\n";
        }
}
These files, with one word per line, need not necessarily be saved and will not be needed again once the alignment has been performed.

Now follows a sample from the Intergovernmental Conference in Turin in order to exemplify the output. The output will be found in a file having the same name as the file for L1 but with the extention ".al":

*** Länk: 1-1 ***
The European Council began its proceedings by exchanging ideas
with Mr Klaus Hänsch, President of the European Parliament,
on the main subjects for discussion at this meeting. .EOS

Der Europäische Rat hat zunächst einen Gedankenaustausch
mit dem Präsidenten des Europäischen Parlaments, Herrn
Klaus Hänsch, über die wichtigsten auf dieser Tagung
zur Erörterung anstehenden Themen geführt. .EOS

The output of the alignment process is first a line providing information about the relationship between the units: 1-1 means one sentence (chunk) was aligned to one sentence in the translation, 2-1 means that two sentences are found in the first language of the pair and they are represented by one sentence in the second language. All in all the possibilities are 1-1, 0-1, 1-0, 2-1, 1-2 and 2-2. It would have been possible to have alignments of 3-1 and 1-3 but that would require removing the possibility for 0-1 and 1-0 since the program would have not be able to decide if a sweep of pairs was 3-1 or if it was 2-1 followed by 1-0 etc. After the line with the alignment information there are two lines, one line for each language.*

A sample showing how a 2-1 relationship can look can be seen in this excerpt:

*** Länk: 1-2 ***
Finally, the European Council invites the Conference, which should
finalize its work in about one year, to adopt a general
and consistent vision throughout its work :
its aim is to meet the needs and expectations of our citizens,
while advancing the process of European construction and
preparing the Union for its future enlargement. .EOS

Abschlieend bittet der Europäische Rat die Konferenz,
die ihre Arbeiten in etwa einem Jahr abschlieen sollte, sich
während ihrer gesamten Arbeiten von einer umfassenden und
konsequenten Vision leiten zu lassen. Ihr Ziel ist es, den
Bedürfnissen und Erwartungen unserer Bürger gerecht
zu werden, dabei den Proze der europäischen Einigung
voranzubringen und die Union auf ihre künftige Erweiterung
vorzubereiten. .EOS

There is a certain amount of proof-reading that is required after performing the alignment but it is fairly easy to isolate the problem areas.

The original program did not write out the alignment information, nor did it produce one output file, but rather two. Having two separate files to work with instead of only one, makes it slightly more inconvenient to check the results and look for possible errors.

In our experience the program gets it right more than 95% of the time. When it does go wrong it is usually when it tries to find a 0-1 alignment (or a 1-0) that should really be a 3-1 or a 1-3, for example. Therefore by combining the output in one file and by providing each alignment pair with "administrative" information one can search for all instances of "0-" or "-0" and pinpoint the problem areas. This is a process that typically takes ten or fifteen minutes for a text of around 50 or 60 pages if the one language is truly a translation of the other. Problems arise when both texts are translations of yet another language and in particular when they are one or two generations removed from the original. This happens with documents that do not have "official" status coming from EU offices.

Postprocessing

There are various ways of dealing with the resulting output file and we will only touch on some of them here. Some are quite involved, though fruitful, and the interested reader can refer to some of our other work [2].

In a text like the Republic that contains only running text, paragraph after paragraph, with no headings or the like one could easily match up the two full texts using ID attributes on s-units. If one has a series like 1-1, 1-1, 2-1, 1-1 then one knows that the first sentence in one text matches the first in the next (ID=EN1 is matched with ID=DE1), the same goes for the next pair (ID=EN2 and ID=DE2) and the third and fourth sentences in English match the third in German and so on and so on. This is assuming that the alignment text has been converted from SGML with tools like those found in the NSL package from Edinburgh [5] or something similar. This could then result in something like the following, illustrating a 1-1 alignment.

<P ID=TURSE.1>
<SEG ID=SEEN1 DOC=pedant>
<SEG FROM='id (TURSE.1.1)' ID=Se1></SEG>
<SEG FROM='id (TUREN.1.1)' ID=En1></SEG>
</SEG>
</P>
A 2-1 alignment would look like this:
<P ID=MADSE.249>
<SEG ID=321 DOC=pedant>
<SEG FROM='id (MADSE.249.1)' TO='id (MADSE.249.2)' ID=Se321></SEG>
<SEG FROM='id (MADEN.249.1)' ID=En321></SEG>
</SEG>
</P>
Such ID's point back into the original SGML texts in the corpus and can be "resolved" into the relevant languages using the various tools that come with NSL.

Accessing the results

We have already touched upon accessing the data provided by the alignment process in the last section. Those methods can be fruitful but would require more time than we have available.

Another way of accessing information is to use yet another free and readily available environment, namely, the WWW. This is, in fact, one of the ways we make some of our own material accessible to others at the faculty and to others outside the university.

We use a combination of "FORMS" for WWW, Perl for cgi-scripts and a free (for academic purposes) "mini" SQL database called mSQL (MiniSQL see http://Hughes.com.au). The process is fairly simple and automatic.

First we have our "givens": a file consisting of a series of three lines (alignment info, L1 sample and L2 sample). We read through this file and save indexing information, that is, before we read the first three lines we know we are at position 0 (zero) of the file. We read the three lines, tokenize the input, we know we can find them again if we save the byte position (zero) of the file together with the tokenized input in database tables. We have one table for unique words that we use to search on besides some administrative information the table contains the orthographical word and a key into another table consisting of one row for every occurrence of the orthographical word. This second table, containing a row for every occurrence, also contains the byte position of the text file. This byte position is the one we saved in the tokenization process. These appear as follows:

$ relshow pedant

Database = pedant

Table    = Swedish_graph

 +-----------------+----------+--------+----------+-----+
 |     Field       |   Type   | Length | Not Null | Key |
 +-----------------+----------+--------+----------+-----+
 | id              | int      | 4      | N        | N   |
 | sw_graph        | char     | 32     | Y        | Y   |
 +-----------------+----------+--------+----------+-----+

Table    = Swedish_English_occurrences

 +-----------------+----------+--------+----------+-----+
 |     Field       |   Type   | Length | Not Null | Key |
 +-----------------+----------+--------+----------+-----+
 | id              | int      | 4      | Y        | Y   |
 | sw_g            | int      | 4      | N        | N   |
 | sw_tp           | int      | 4      | N        | N   |
 | en_tp           | int      | 4      | N        | N   |
 | ainfo           | char     | 3      | N        | N   |
 +-----------------+----------+--------+----------+-----+
If we get a "hit" in the table containing orthographical words we move on to the table recording the occurrences. There we pick up the byte pointer into the file. We move the file pointer to that position and read three lines. The first line will contain the alignment information. The second line will contain L1, which will also contain the word we were searching for in the simpliest case, and the second line will contain L2 where a translation equivalent of the word we looked up will be found.

In practice we do this a little differently, for the sake of automation. The difference concerns reading through the original alignment file and saving the byte positions. Such a process would only work the first time. The second time around and thereafter are different. We read three lines, tokenize, move to the end of the text file that all alignments are being saved in, take the byte position and then concatenate the new alignment file to the end of the previously processed texts. That way we can align, run a script to update the database and text file and we are finished.

There is a Perl package for reading the mSQL database. Therefore we use Perl for running our postprocessing tasks and for providing the web page with cgi-scripts and an example can be seen below in figure 1.

  
Figure 1: In this example we have experimented with automatically isolating English translation equivalents.
\begin{figure}
 \epsfxsize=391pt
 \epsffile{utbildning-p.eps}\end{figure}

[1] [2]#1 [1]#1 [2]#2#1

References

1
P Bonhomme and L Romary.
The lingua parallel concordancing project: Managing multilingual texts for educational purpose.
In Proceedings of Language Engineering. JUNE 1995.

2
Pernilla Danielsson and Daniel Ridings.
Annotating parallel texts with the nsl library.
1996.

3
Tomaz Erjavec.
Public domain generic tools: an overview.
In Proceedings of the First European TELRI Seminar: Language Resources for Language Technology, pages 37-48, 1996.
15-16 September 1995, Tihany, Hungary.

4
William A. Gale and Kenneth W. Church.
A program for aligning sentences in bilingual corpora.
Computational Linguistics, 19(1):75-102, 1993.

5
Henry Thompson, Steve Finch, and David McKelvie.
The normalised sgml library (nsl).
Multext LRE Project 62-050, The Language Technology Group, November 1995.

About this document ...

Practical Presentation of a "Vanilla" Aligner*

This document was generated using the LaTeX2HTML translator Version 97.1 (release) (July 13th, 1997)

Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.

The command line arguments were:
latex2html -split 0 ljubljana.

The translation was initiated by Tomaz Erjavec on 11/10/1999


Footnotes

...Aligner
Presentation held at the TELRI Workshop in alignment and exploitation of texts in Ljubljana, Feb. 1-2, 1997.

...MS-DOS.
The GNU version of AWK is called GAWK.

...language.
The line breaks in the sample exemplified above have been inserted inorder to fit the lines onto the current printed page. The blank line between L1 and L2 has also been added for the present paper.



Tomaz Erjavec
11/10/1999