next up previous contents
Next: Corpus Encoding Up: Multext-East D2.3 F Previous: Introduction

Sentence Alignment

COP project 106 MULTEXT-East Deliverable D2.3 F-- Alignment

Each of the six translations of 1984 has been S aligned with the English original, and the alignments hand validated. The alignment is not hierarchical, i.e. division and paragraph level alignments have not been retained, although they have been used in the process of alignment. The S-level elements that have been aligned are the following:

The initial S alignment was performed automatically, where different aligners were used for different languages:

With Vanilla (see http://svenska.gu.se/PEDANT/workshop/workshop.html ), the texts were first aligned to the paragraph level, these alignments checked and, where necessary, corrected. Once this alignment was correct, the paragraph level links were taken as 'hard' links, and S-level alignment performed. This was again hand-validated, where, in addition to alignment errors, this validation often exposed errors of sentence segmentation. Automatic alignment can produce, in addition to 1-1 links, 2-1, 1-2, 2-2, 0-1, and 1-0 links. In manual verification, a number of other links were discovered as well. First, where there was a sequence of 0-1 or 1-0 links, these were (typically) merged into 0-n or n-0 links. Such links were due to translators not translating a portion of the text. But furthermore, other link arities were discovered, e.g. 1-6 and 2-4 links. The table below summarizes all the link arities encountered in the six translation-original alignments of MULTEXT-East:

Link BG-EN CS-EN ET-EN HU-EN RO-EN SL-EN
0 - 1 16 21 2 19 10 3
0 - 2   1 1 3 2  
0 - 3         2  
0 - 4       1    
1 - 0   2 1 1   2
1 - 1 6623 6439 6428 6477 6047 6572
1 - 2 36 78 100 47 259 53
1 - 3   2 1   14  
1 - 4            
1 - 5       1 1 1
1 - 6       1    
2 - 1 22 110 58 108 85 48
2 - 2 2   3   2  
2 - 3         3  
2 - 4         1  
3 - 1   2 2 7 3  
3 - 3           1
4 - 1   1   1    

Link arities in ``1984'' alignment

In the following section we give the details of the CES encoding of the alignment documents. As these documents do not contain the aligned sentences directly, a HTML version of the alignments was also prepared. To produce it, the NSL software produced by LTG was used. For details on this software see http://www.ltg.ed.ac.uk/software/ .


 
next up previous contents
Next: Corpus Encoding Up: Multext-East D2.3 F Previous: Introduction
Multext-East