Pernilla Danielsson and Daniel Ridings
Språkbanken
Institutionen för svenska språket
Göteborgs universitet
S-412 98 Göteborg SWEDEN
GU-ISS-97-2
ISSN 1401-5919
pedant@svenska.gu.se
We will be discussing the following aspects of aligning with Church and Gale's widespread implementation of the Dynamic Time Warping algorithm from 1993 [4]:
The relevant tools are found after each item above. All of them are readily accessible for academic users and many of them are included with operating systems.
In the Republic there is a very special problem that encoders should be aware of. All Greek editions of Plato print numbers in the margin with letters of the alphabet, usually "a" to "e", inbetween. These numbers are what we use to refer to sections of Plato's work, much the same way as one refers to the Bible by book, chapter and verse. They are standardized and accepted as reference by all who work with Plato.
If we take a closer look at the numbers and try to figure out what they really stand for it will not be immediately obvious. They do not occur at paragraph breaks, so they are not "paragraphs", nor do they occur at page breaks with any perceptible regularity. They are simply a remnant of publishing history. The first printed edition of Plato was by a French scholar, Stephanus. The numbering in modern editions of Plato reflects the page breaks in Stephanus' edition. These page breaks could and did occur in the middle of paragraphs, sentences and dialogues. They reflect the structure of the first printed book but not the structure of the text being printed.
Unfortunately few editors of critical editions have ever seen this editio princeps and fewer, if any, translators have. Why is this important? There is no guarantee that the numbers in editions can be placed more accurately than on the level of a line of printed text. The lines in different editions do not correspond to each other. Therefore, the "scope" of the numbers are imprecise. In addition to that, they do not reflect the structure of the text itself, since they do not necessarily fall on any particular linguistic boundary. It is well known that SGML has problems with non-hierarchic units. Therefore there will be a need for a consensus among the TELRI participants: Someone will have to decide on exactly where the pages start and end.
As already mentioned, Church & Gale's program is highly dependent upon the units paragraph and sentence. Paragraphs are not found in Greek manuscripts from which modern critical editions are derived so the division into paragraphs that is found in the various editions and thereby in the various translations, often reflects the individual tastes and fashions current in the typographical conventions of the various countries.
The most expedient solution would probably be that the partners agree upon one text, in one language that everyone can understand, to serve as the text containing the canonical "structure."
Church & Gale's implementation works with two input files, one file for each language. The text in each file is divided up so that there is one word or marker per line. That is:
One word or marker per line. .EOS .EOPIn the above ".EOS" stands for "end-of-sentence" and ".EOP" stands for "end-of-paragraph." The actual markers used are arbitrary and are provided on the command line when the program is executed, ie:
align -D '.EOP' -d '.EOS' republic.en republic.deThis format can be created by calling upon simple program in UNIX or by simple scripts written in Awk or Perl, both of which are available in the public domain for MS-DOS.* If one is using UNIX, the following command (translate) will do the trick:
tr ' ' '\n' <republic.en.txt >republic.enThat searches for all spaces and replaces them with an end of line character. A simple equivalent in AWK would be:
awk '{for (i=1; i <= NF; i++) print $i}' <republic.en.txt >republic.en
A simple perl script that reads through a normalized SGML file and
creates input files for the aligner can be found below.
The script converts all paragraphs within a <BODY> element to the
one word per line format, slices the input into sentences, adds the end
of paragraph code and strips away all SGML. It is not intended to be
perfect. At the workshop we were encouraged to demonstrate a simple,
"vanilla" segmenter and this one is as good a starting point as any.
#!/usr/bin/perl
# A KISS (Keep It Simple and Stupid) version of a
# sentence slicer. Another Pedantic antic brought to you
# by Pernilla Danielsson and Daniel Ridings.
# Do what you want with it.
while(<>) {
# Skip headers and everything else and just segment
# the <BODY> element.
&GotBody if (/BODY/);
}
sub GotBody {
# We're not _real_ interested in the BODY itself, but
# rather the <P>'s in it.
while (<>) {
chop;
&GotP if (/<P>/);
last if (/<\/BODY>/);
}
}
sub GotP {
$OurP = $_;
# Read in all the text and concatenate all the lines of
# a paragraph to one single line.
if (!/<\/P>/) {
while (<>) {
chop;
$OurP = $OurP . $_ . " ";
last if (/<\/P>/);
}
}
# Now we have the whole paragraph as one line. Look
# for sentence ending punctuation marks and stick in
# an .EOS after each one.
$OurP =~ s/([.!?])([ "']+)([(A-ZÅÄÖ])/$1$2 .EOS $3/g;
# Get rid of SGML
$OurP =~ s/<[A-Z\/]+>//g;
# Split up the paragraph into an array of words.
@OurP = split(/ /, $OurP, 9999);
# Print out one word per line. The length test is needed
# in order to get rid of the double spaces resulting
# from inserting .EOS
foreach $word (@OurP) {
print "$word\n" if (length($word));
$EOStest = $word eq ".EOS";
}
# We're finished.
if ($EOStest eq "1") {
print ".EOP\n";
} else {
print ".EOS\n.EOP\n";
}
}
These files, with one word per line, need not necessarily be saved and
will not be needed again once the alignment has been performed.
Now follows a sample from the Intergovernmental Conference in Turin in order to exemplify the output. The output will be found in a file having the same name as the file for L1 but with the extention ".al":
*** Länk: 1-1 *** The European Council began its proceedings by exchanging ideas with Mr Klaus Hänsch, President of the European Parliament, on the main subjects for discussion at this meeting. .EOS Der Europäische Rat hat zunächst einen Gedankenaustausch mit dem Präsidenten des Europäischen Parlaments, Herrn Klaus Hänsch, über die wichtigsten auf dieser Tagung zur Erörterung anstehenden Themen geführt. .EOS
The output of the alignment process is first a line providing information about the relationship between the units: 1-1 means one sentence (chunk) was aligned to one sentence in the translation, 2-1 means that two sentences are found in the first language of the pair and they are represented by one sentence in the second language. All in all the possibilities are 1-1, 0-1, 1-0, 2-1, 1-2 and 2-2. It would have been possible to have alignments of 3-1 and 1-3 but that would require removing the possibility for 0-1 and 1-0 since the program would have not be able to decide if a sweep of pairs was 3-1 or if it was 2-1 followed by 1-0 etc. After the line with the alignment information there are two lines, one line for each language.*
A sample showing how a 2-1 relationship can look can be seen in this excerpt:
*** Länk: 1-2 *** Finally, the European Council invites the Conference, which should finalize its work in about one year, to adopt a general and consistent vision throughout its work : its aim is to meet the needs and expectations of our citizens, while advancing the process of European construction and preparing the Union for its future enlargement. .EOS Abschlieend bittet der Europäische Rat die Konferenz, die ihre Arbeiten in etwa einem Jahr abschlieen sollte, sich während ihrer gesamten Arbeiten von einer umfassenden und konsequenten Vision leiten zu lassen. Ihr Ziel ist es, den Bedürfnissen und Erwartungen unserer Bürger gerecht zu werden, dabei den Proze der europäischen Einigung voranzubringen und die Union auf ihre künftige Erweiterung vorzubereiten. .EOS
There is a certain amount of proof-reading that is required after performing the alignment but it is fairly easy to isolate the problem areas.
The original program did not write out the alignment information, nor did it produce one output file, but rather two. Having two separate files to work with instead of only one, makes it slightly more inconvenient to check the results and look for possible errors.
In our experience the program gets it right more than 95% of the time. When it does go wrong it is usually when it tries to find a 0-1 alignment (or a 1-0) that should really be a 3-1 or a 1-3, for example. Therefore by combining the output in one file and by providing each alignment pair with "administrative" information one can search for all instances of "0-" or "-0" and pinpoint the problem areas. This is a process that typically takes ten or fifteen minutes for a text of around 50 or 60 pages if the one language is truly a translation of the other. Problems arise when both texts are translations of yet another language and in particular when they are one or two generations removed from the original. This happens with documents that do not have "official" status coming from EU offices.
In a text like the Republic that contains only running text, paragraph after paragraph, with no headings or the like one could easily match up the two full texts using ID attributes on s-units. If one has a series like 1-1, 1-1, 2-1, 1-1 then one knows that the first sentence in one text matches the first in the next (ID=EN1 is matched with ID=DE1), the same goes for the next pair (ID=EN2 and ID=DE2) and the third and fourth sentences in English match the third in German and so on and so on. This is assuming that the alignment text has been converted from SGML with tools like those found in the NSL package from Edinburgh [5] or something similar. This could then result in something like the following, illustrating a 1-1 alignment.
<P ID=TURSE.1> <SEG ID=SEEN1 DOC=pedant> <SEG FROM='id (TURSE.1.1)' ID=Se1></SEG> <SEG FROM='id (TUREN.1.1)' ID=En1></SEG> </SEG> </P>A 2-1 alignment would look like this:
<P ID=MADSE.249> <SEG ID=321 DOC=pedant> <SEG FROM='id (MADSE.249.1)' TO='id (MADSE.249.2)' ID=Se321></SEG> <SEG FROM='id (MADEN.249.1)' ID=En321></SEG> </SEG> </P>Such ID's point back into the original SGML texts in the corpus and can be "resolved" into the relevant languages using the various tools that come with NSL.
Another way of accessing information is to use yet another free and readily available environment, namely, the WWW. This is, in fact, one of the ways we make some of our own material accessible to others at the faculty and to others outside the university.
We use a combination of "FORMS" for WWW, Perl for cgi-scripts and a free (for academic purposes) "mini" SQL database called mSQL (MiniSQL see http://Hughes.com.au). The process is fairly simple and automatic.
First we have our "givens": a file consisting of a series of three lines (alignment info, L1 sample and L2 sample). We read through this file and save indexing information, that is, before we read the first three lines we know we are at position 0 (zero) of the file. We read the three lines, tokenize the input, we know we can find them again if we save the byte position (zero) of the file together with the tokenized input in database tables. We have one table for unique words that we use to search on besides some administrative information the table contains the orthographical word and a key into another table consisting of one row for every occurrence of the orthographical word. This second table, containing a row for every occurrence, also contains the byte position of the text file. This byte position is the one we saved in the tokenization process. These appear as follows:
$ relshow pedant Database = pedant Table = Swedish_graph +-----------------+----------+--------+----------+-----+ | Field | Type | Length | Not Null | Key | +-----------------+----------+--------+----------+-----+ | id | int | 4 | N | N | | sw_graph | char | 32 | Y | Y | +-----------------+----------+--------+----------+-----+ Table = Swedish_English_occurrences +-----------------+----------+--------+----------+-----+ | Field | Type | Length | Not Null | Key | +-----------------+----------+--------+----------+-----+ | id | int | 4 | Y | Y | | sw_g | int | 4 | N | N | | sw_tp | int | 4 | N | N | | en_tp | int | 4 | N | N | | ainfo | char | 3 | N | N | +-----------------+----------+--------+----------+-----+If we get a "hit" in the table containing orthographical words we move on to the table recording the occurrences. There we pick up the byte pointer into the file. We move the file pointer to that position and read three lines. The first line will contain the alignment information. The second line will contain L1, which will also contain the word we were searching for in the simpliest case, and the second line will contain L2 where a translation equivalent of the word we looked up will be found.
In practice we do this a little differently, for the sake of automation. The difference concerns reading through the original alignment file and saving the byte positions. Such a process would only work the first time. The second time around and thereafter are different. We read three lines, tokenize, move to the end of the text file that all alignments are being saved in, take the byte position and then concatenate the new alignment file to the end of the previously processed texts. That way we can align, run a script to update the database and text file and we are finished.
There is a Perl package for reading the mSQL database. Therefore we
use Perl for running our postprocessing tasks and for providing the
web page with cgi-scripts and an example can be seen below in
figure 1.
![]() |
[1] [2]#1 [1]#1 [2]#2#1
This document was generated using the LaTeX2HTML translator Version 97.1 (release) (July 13th, 1997)
Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -split 0 ljubljana.
The translation was initiated by Tomaz Erjavec on 11/10/1999