An Experiment in Automatic Bi-Lingual Lexicon Construction from a Parallel Corpus

7th TELRI seminar "Information in Corpora"

Dubrovnik 26-28 September 2002

Tomaž Erjavec
Department of Intelligent Systems
Jozef Stefan Institute
Jamova 39
SI-1000 Ljubljana
Slovenia

Abstract

The IJS-ELAN corpus (Erjavec 20002) contains 1 million words of annotated parallel Slovene-English texts. The corpus is sentence aligned and both languages are word-tagged with context disambiguated morphosyntactic descriptions and lemmas. In the talk we discuss an experiment in automatic bi-lingual lexicon extraction from this corpus. Extracting such lexica is one the prime uses of parallel corpora, as manual construction is an extremely time consuming process, yet the resource is invaluable for lexicographers, terminologists, translators as well as machine translation systems. For the experiment we used two statistics based programs for the automatic extraction of bi-lingual lexicons from parallel corpora: the Twente software (Hiemstra, 1998), and the PWA system (Tiedemann, 1998). We compare the two programs in terms of availability, ease of use and the type and quality of the results. We experimented with several different choices of input to the programs, using varying amounts of linguistic information. We compared the extractions using the word-forms from the corpus to that where lemmas have been used: this normalises the input and abstracts away from the rich inflections of Slovene. Following the lead of Tufis and Barbu (2001) we also restricted the translation lexicon to lexical items of the same part-of-speech, i.e. we make the assumption that a noun is always translated as a noun, a verb as a verb, etc. This again reduces the search space for the algorithms and could thus lead to superior results. Finally, we experimented with taking the whole corpus as input, and opposed this to processing corpus components separately. The reasoning here is that it is likely that different components will contain distinct senses of polysemous words, which will be translated into different target words. For such words there would therefore be no benefit in amalgamating different texts, while the final precision might in fact be lower. Preliminary results show that the precision of the extracted translation lexicon is much improved by utilising lemmas with an identical part-of-speech in the source and target languages; this argues in favour of linguistic pre-processing of the corpus. However, the recall of the system tends to be lower, as it misses out on conversion translations. In the conclusion we discuss this and other findings, as well as current results on extracting translation equivalents of collocations.


1. Introduction

1.1. Overview of the talk

  • The IJS-ELAN corpus
  • Twente and Plug
  • Results with word-forms
  • Results with lemmas
  • Comparing part/whole
  • Including PoS
  • Conclusions

1.2. The IJS-ELAN corpus

  • Slovene-English parallel corpus, 2 x 500.000 words
  • IJS contribution to the MLIS ELAN project
  • Contains 15 parallel texts
  • Tokenised, sentence segmented
  • Sentence aligned
  • V2 automatically MSD tagged and lemmatised
  • Encoded in XML TEI P4
  • Freely available for downloading from http://nl.ijs.si/elan/

1.3. Example segments from the corpus

Slovene segment:

<seg id="anx2.sl.105" corresp="anx2.en.105">
<c ctag=":">-</c> 
<w ana="Aspnsn" lemma="sojin">Sojino</w> 
<w ana="Ncnsn" lemma="olje">olje</w>
<c ctag=",">,</c> 
<w ana="Ncnsn" lemma="olje">olje</w> 
<w ana="Spsg" lemma="iz">iz</w> 
<w ana="Ncmsg" lemma="kikiriki">kikirikija</w>
<c ctag=",">,</c> 
<w ana="Aspfsa" lemma="palmov">palmovo</w>
<c ctag=",">,</c> 
<w ana="Ncfpn">kopre</w>
<c ctag=",">,</c> 
<w ana="Aspnsg" lemma="palmov">palmovega</w> 
<w ana="Ncnsg" lemma="jedro">jedra</w>
<c ctag=",">,</c> 
<w ana="Ncnsl">babassu</w>
<c ctag=",">,</c> 
<w ana="Ncnsn">tungovo</w> 
<w ana="Ccs" lemma="in">in</w> 
<w ana="Ncfsn">oiticica</w> 
<w ana="Ncnsa" lemma="olje">olje</w>
...

1.4. Example segments from the corpus

English segment:

<seg id="anx2.en.105" corresp="anx2.sl.105">
<c ctag=":">-</c> 
<w ana="Ncns" ctag="?UH NP" lemma="soya">Soya</w>
<c ctag=",">,</c> 
<w ana="Ncns" ctag="NN NN" lemma="ground">ground</w> 
<w ana="Ncns" ctag="NN NN" lemma="nut">nut</w>
<c ctag=",">,</c> 
<w ana="Ncns" ctag="NN NN" lemma="palm">palm</w>
<c ctag=",">,</c> 
<w ana="Ncns" ctag="NN NN" lemma="copra">copra</w>
<c ctag=",">,</c> 
<w ana="Ncns" ctag="NN NN" lemma="palm">palm</w> 
<w ana="Ncns" ctag="NN NN" lemma="kernel">kernel</w>
<c ctag=",">,</c> 
<w ana="Ncns" ctag="?FW ???" lemma="babassu">babassu</w>
<c ctag=",">,</c> 
<w ana="Vmpp" ctag="NN NN" lemma="tung">tung</w> 
<w ana="Cc-n" ctag="CC CC" lemma="and">and</w> 
<w ana="Npns" ctag="NN NN" lemma="oiticica">oiticica</w> 
<w ana="Ncns" ctag="NN NN" lemma="oil">oil</w>
...

1.5. Pre-processing the corpus

  • Remove punctuation and numbers
  • All text converted lower case
  • Remove XML tags
  • Special characters to ASCII (&times;), removed ($, &commat;) or to Latin-2 (čšž)
  • Convert to software-specific format
Example (TWENTE format):

chapter $
live animals $
all the animals of chapter used must be wholly obtained $
chapter $
meat and edible meat offal $
manufacture in which all the materials of chapters and used must be wholly obtained $
chapter $
fish and crustaceans molluscs and other aquatic invertebrates $
manufacture in which all the materials of chapter used must be wholly obtained $
ex chapter $
dairy produce $
birds ' eggs $
natural honey $

2. The Software

2.1. TWENTE

  • Twente word alignment software
  • Automatically creates a bi-directional translation lexicon from a parallel corpus
  • Developed in 1998 by Djoerd Hiemstra at the University of Twente, in the scope of the "Twenty-one" Project
  • Comprises a set of programs written in C
  • Available under GNU GPL

2.2. TWENTE cont.

  • The algorithm is based on a symmetric translation model
  • Comprises three statistical algorithms:
    • Model A: Iterative Proportional Fitting algorithm
    • Model A: Monte Carlo Sampling algorithm
    • Model B: Monte Carlo Sampling algorithm
Example output:

celoti                       
---------------------        
wholly      0.58
be          0.16
all         0.12
obtained    0.08

celuloznih                   
---------------------        
cellulose   0.44
fibrous     0.20
cellulosic  0.19
material    0.17

2.3. PWA

  • PLUG Word Aligner
  • Automatically creates a translation lexicon and translation equivalents from a parallel corpus
  • Developed in 2000 by Joerg Tiedemann at Uppsala University in the scope of the PLUG project (Parallel Corpora in Linköping, Uppsala, and Göteborg)
  • Comprises a set of programs written in Perl and a Tcl/Tk wrapper
  • Available in binary for Linux/Windows under a research license
  • A Web interface also exists

2.4. PWA cont.

  • Comprises 2 word alignment systems + additions:
    • LWA: Linköping Word Aligner
    • UWA: Uppsala Word Aligner
    • automatic generation of monolingual word collocations (phrases)
    • PLUG Scorer: automated evaluation of alignment results
Example output of link types (lexicon):

accompanied     spremljano
acetic          ocetni
acid            kisline
acid            kislini
acid            kisla
acids           kisline
acids           kislin
added           sladkorjem
addition        adicijske
additives       aditivi

Example output of link tokens:

anx221  edible  u&zcaron;itni  1       11:6    11:6    4.46054036978434
anx221  products        izdelki 1       18:8    18:7    9.61970217554317
anx221  of      ki      1       27:2    44:2    9.18877848331661
anx221  animal  izvora  1       30:6    37:6    5.08354030378808
anx221  origin  &zcaron;ivalskega      7       37:6    26:10   5.67694861380386
anx221  not elsewhere specified or included     mestu   1       44:3&48:9&58:9&68:2&71:8        84:5    16.7151474681847
anx222  milk    mleko   1       30:4    26:5    5.13118957986685
anx222  cream   smetana 7       39:5    35:7    4.98021351626141
anx222  and     drugo   2       60:3    59:5    3.92860959492475
anx222  or      ali     1       80:2    78:3    10.7331875453987
anx222  milk    mleko   1       93:4    90:5    5.13118957986685
anx222  cream   smetana 7       102:5   99:7    4.98021351626141
anx222  whether ali     1       108:7   121:3   10.1405570858157
anx222  or      ali     1       116:2   149:3   10.7331875453987
anx222  not     ne      1       119:3   125:2   7.87830244161833
anx222  concentrated    z       7       123:12  128:1   3.5854502568264

2.5. Comparing TWENTE and PWA

  • Different philosophy: PWA complete toolbox / TWENTE only lexicon extraction
  • TWENTE includes confidence value in output, PWA doesn't
  • TWENTE easier to use from command line
  • TWENTE available in source code / PWA only as binary

3. Extraction of plain text

3.1. First experiment

  • Experiments performed on one IJS-ELAN component: “Europe Agreement - Annex II”
    Extents En Sl
    Segment 2,382 2,382
    Token 11,928 11,526
    Type 2,090 2,909
    Lemma 1,823 2,111
  • Extraction first performed directly on word-forms

3.2. TWENTE

All three models used, bilingual lexica produced
TWENTE En(-Sl) Sl(-En)
Corpus 2,090 2,909
A1 539 582
A2 518 536
B1 692 767
A1-100% 51 186
A2-100% 35 123
B1-100% 34 144
  • A1 = Model A Iterative Proportional Fitting algorithm
  • A2 = Model A Monte Carlo Sampling algorithm
  • B1 = Model B Monte Carlo Sampling algorithm

3.3. Examples of results of the three TWENTE models: U

  • [data]
  • Model A Iterative Proportional Fitting algorithm
    
    umbrellas       de&zcaron;niki/0.29 son&ccaron;niki/0.29 palice-de&zcaron;nike/0.11 vrtne/0.11 
    uncoated        neprevle&ccaron;ene/1.00 
    unembroidered   nevezene/1.00 
    unexposed       ob&ccaron;utljivi/0.33 svetlobo/0.33 neosvetljeni/0.33 
    unwrought       surovi/0.31 surov/0.21 neobdelanih/0.15 neobdelane/0.10 
    up              (null)/0.44 zgornje/0.09 okviru/0.08 meje/0.08 
    use             uporabo/0.67 (null)/0.10 ki/0.09 uporabni/0.04 
    used            se/0.27 (null)/0.14 uporabljeni/0.10 uporabljenih/0.09 
    uses            namene/1.00 
    
    
  • Model A Monte Carlo Sampling algorithm
    
    umbrellas       de&zcaron;niki/0.27 son&ccaron;niki/0.26 (null)/0.17 son&ccaron;nike/0.08 
    uncoated        neprevle&ccaron;ene/1.00 
    unembroidered   nevezene/1.00 
    unexposed       svetlobo/0.39 ob&ccaron;utljivi/0.34 neosvetljeni/0.27 
    unwrought       surovi/0.32 surov/0.21 neobdelanih/0.11 neobdelane/0.11 
    up              (null)/0.18 okviru/0.12 drobno/0.09 meje/0.08 
    use             uporabo/0.68 (null)/0.09 za/0.05 uporabni/0.04 
    used            se/0.26 uporabljenih/0.08 uporabljeni/0.07 kateri/0.06 
    uses            namene/0.94 terapevtske/0.04 profilakti&ccaron;ne/0.03 
    
    
  • Model B Monte Carlo Sampling algorithm
    
    umbrellas       son&ccaron;niki/0.28 de&zcaron;niki/0.27 son&ccaron;nike/0.12 vrtne/0.11 
    uncoated        neprevle&ccaron;ene/1.00 
    unembroidered   nevezene/1.00 
    unexposed       svetlobo/0.32 ob&ccaron;utljivi/0.32 neosvetljeni/0.29 za/0.04 
    units           za/0.42 izolacijo/0.15 ve&ccaron;zidni/0.15 napajalniki/0.14 
    unwrought       surovi/0.33 surov/0.22 neobdelanih/0.16 neobdelane/0.11 
    up              njega/0.13 meje/0.10 okviru/0.10 zgornje/0.09 
    use             uporabo/0.65 uporabni/0.04 motorje/0.04 valjarje/0.03 
    used            se/0.25 uporabljenih/0.08 uporabljeni/0.07 kateri/0.05 
    uses            namene/1.00 
    
    

3.4. Examples of results of the three TWENTE models: V

  • [data]
  • Model A Iterative Proportional Fitting algorithm
    
    value           vrednost/0.17 cene/0.16 franko/0.16 tovarna/0.15 
    vegetable       rastlinskega/0.36 rastlinski/0.24 rastlinskih/0.24 sokovi/0.06 
    vegetables      vrtnine/0.51 su&scaron;enih/0.16 stro&ccaron;nic/0.16 vrtnin/0.16 
    vehicles        vozila/0.62 (null)/0.21 tanki/0.03 oklepna/0.03 
    video           video/0.35 slike/0.26 videomonitorji/0.13 videoprojektorji/0.13 
    vinegar         alkoholi/0.17 kis/0.17 krompir/0.15 kisu/0.15 
    
    
  • Model A Monte Carlo Sampling algorithm
    
    value           vrednost/0.16 franko/0.16 cene/0.15 tovarna/0.15 
    vegetable       rastlinskega/0.35 rastlinskih/0.25 rastlinski/0.24 rastlinska/0.06 
    vegetables      vrtnine/0.49 stro&ccaron;nic/0.17 su&scaron;enih/0.16 vrtnin/0.15 
    vehicles        vozila/0.68 (null)/0.19 bojna/0.03 tista/0.03 
    video           video/0.30 slike/0.26 (null)/0.21 video-tuner/0.07 
    vinegar         alkoholi/0.20 kislini/0.17 kis/0.16 krompir/0.16 
    
    
  • Model B Monte Carlo Sampling algorithm
    
    value           tovarna/0.11 vrednost/0.10 presega/0.10 uporabljenih/0.10 
    vegetable       rastlinskega/0.30 rastlinski/0.21 rastlinskih/0.21 izvora/0.09 
    vegetables      vrtnine/0.49 stro&ccaron;nic/0.18 su&scaron;enih/0.17 vrtnin/0.16 
    vehicles        vozila/0.96 tanki/0.04 
    video           video/0.30 slike/0.30 videomonitorji/0.11 videoprojektorji/0.11 
    vinegar         kis/0.21 alkoholi/0.20 ocetni/0.15 kisu/0.15 
    
    

3.5. PWA

  • Experiment performed on same component: “Europe Agreement - Annex II”
  • Only LWA model used, couldn't get UWA to work...
PWA En(-Sl) Sl(-En)
Corpus 2,090 2,909
(A1) 539 582
LWA 903 1,068

3.6. Example of PWA results

  • [data]
  • English - Slovene:
    
    umbrellas       de&zcaron;niki
    unbleached      nebeljene
    uncoated        neprevle&ccaron;ene
    uncut           nebru&scaron;enih
    unembroidered   nevezene
    unmanufactured  nepredelanega
    unprinted       pogojem
    unwrought       neobdelane, neobdelanih, surovi
    up              ne
    uppers          na
    use             uporabo
    used            drugo, vrednosti, uporabijo, morajo, derivativi
    used must already be originating        poreklom
    used must be wholly obtained            pridobljeni
    uses            namene
    value           vrednost
    value does not exceed of the ex-works price of the      vrednost ne presega cene
    value of the    uporabljenih
    vapour          paro
    vegetable       &zcaron;ivalskega, rastlinski, modificirani, rastlinskih
    vegetables      vrtnine, su&scaron;enih
    vehicles        vozila, njihovi
    video           slike, video
    vinegar         kisu
    viscosity       viskoznosti
    volume          vol
    
    

3.7. Comparing TWENTE and PWA

  • PWA returns more translation equivalents than TWENTE (but treshold set to 2)
  • The quaiity of returned lexical items is comparable
  • PWA produces also multi-word translation equivalents - but most are useless (esp. > 2)
  • TWENTE includes confidence value in output, PWA doesn't

4. Lemmas vs. Word-forms

4.1. Using Lemmas

  • Motivation is to decrease the number of lexical items, especially for Slovene, and thus obtain better statistics;
  • In Annex II decrease in the number of token types: for English by 13%; for Slovene by 27%
  • Results:
    En(-Sl) Sl(-En)
    Type 2,090 2,909
    Type Lemma 1,823 2,111
    A1 539 582
    A1-Lemma 530 566
    A1-100% 51 186
    A1-100%-Lemma122 185
    LWA 903 1,068
    LWA-Lema 928 946

4.2. TWENTE

  • [data]
  • Examples of TWENTE increased confidence:
    
    artificial      umeten/1.00 
    artificial      umetno/0.43 umetni/0.43 umetnih/0.14 
    
    alcohols        alkohol/1.00 
    alcohols        alkoholov/0.50 alkoholi/0.17 industrijski/0.17 ma&scaron;&ccaron;obni/0.17 
    
    auxiliary       pomo&zcaron;en/1.00 
    auxiliary       pomo&zcaron;ni/0.67 kolesa/0.10 mopede/0.06 pomo&zcaron;nim/0.06 
    
    

4.3. PWA

Examples of differences in PWA lexica:
  • [data]
  • Word-forms:
    
    umbrellas       de&zcaron;niki
    unbleached      nebeljene
    uncoated        neprevle&ccaron;ene
    uncut           nebru&scaron;enih
    unembroidered   nevezene
    unmanufactured  nepredelanega
    unprinted       pogojem
    unwrought       neobdelane, neobdelanih, surovi
    up              ne
    uppers          na
    
    
  • Lemmas:
    
    umbrella        de&zcaron;nik, son&ccaron;nik
    unassembled     &scaron;ablona
    unbleached      nebeljen
    uncoated        neprevle&ccaron;ene
    uncut           nebru&scaron;en
    unembroidered   uporabljen
    unembroidered fabric    nevezene
    unmanufactured  &zcaron;e
    unprinted       da
    unrecorded      fenomen
    unstuffed       nenapolnjeno
    unvulcanized    nevulkanizirane
    unwrought       neobdelan, surov
    up              ne, prodaja
    upper           na
    
    

4.4. Comparing word-form with lemma extraction

  • Large increase in confidence with en-sl lexicon
  • Otherwise less effect than might be expected
  • some silly translation introduced with PWA

5. Part/Whole

5.1. Comparing part/whole extraction

  • For this experiment we took the whole IJS-ELAN corpus, lemmatised version
  • Experiment performed only with TWENTE; PWA crashes
  • TWENTE on HP/UX RISC takes 42 hours
En(-Sl) Sl(-En)
Token ANX2 11,928 11,526
Token ELAN 576,940488,397
Lemma ANX2 1,823 2,111
Lemma ELAN 16,274 22,136
A1 ANX2 530 566
A1 ELAN 6,007 8,405
100% ANX2 122 185
100% ELAN 1,006 2,413

5.2. Examples

  • [data]
  • Annex II:
    
    umbrella        de&zcaron;nik/0.43 son&ccaron;nik/0.43 palice-de&zcaron;nike/0.07 vrten/0.07 
    uncoated        neprevle&ccaron;ene/1.00 
    unembroidered   nevezene/1.00 
    unexposed       ob&ccaron;utljiv/0.33 svetloba/0.33 neosvetljen/0.33 
    unwrought       surov/0.67 neobdelan/0.33 
    up              (null)/0.50 droben/0.22 predelava/0.07 pripraviti/0.06 
    use             uporabljen/0.45 uporabljati/0.16 (null)/0.08 uporaba/0.08 
    value           vrednost/0.45 presegati/0.14 cena/0.13 franko/0.13 
    vegetable       rastlinski/0.69 vrtnina/0.17 su&scaron;en/0.07 zelenjaven/0.06 
    vehicle         vozilo/0.41 (null)/0.21 nesamovozna/0.19 njihov/0.03 
    video           slika/0.55 video/0.14 videomonitorji/0.14 videoprojektorji/0.14 
    vinegar         kis/1.00 
    
    
  • ELAN:
    
    umbrella        de&zcaron;nik/0.35 son&ccaron;nik/0.35 kroven/0.12 centrala/0.07 
    uncoated        neprevle&ccaron;ene/1.00 
    unembroidered   nevezene/1.00 
    unwrought       surov/0.60 neobdelan/0.30 (null)/0.09 
    up              (null)/0.36 biti/0.15 se/0.09 leto/0.03 
    use             uporabljati/0.42 uporaba/0.22 uporabiti/0.15 uporabljen/0.07 
    value           vrednost/0.86 vrednota/0.12 (null)/0.02 
    vegetable       zelenjava/0.46 rastlinski/0.33 vrtnina/0.15 zelenjaven/0.03 
    vehicle         vozilo/0.84 voziti/0.04 (null)/0.04 motoren/0.03 
    video           grafi&ccaron;en/0.72 video/0.18 (null)/0.03 slika/0.02 
    vinegar         kis/1.00 
    
    

5.3. Comparing part/whole extraction

  • Better results if program is run on as large a text as possible
  • But processing time / robustness becomes a concern

6. PoS limitations

6.1. Constraining PoS

  • Method proposed in:
    • Tufis, D., Barbu, A.M. Automatic Construction of Translation Lexicons. In Kluew, V., D'Attellis, C., Mastorakis, N. (eds.) Advances in Automation, Multimedia and Modern Computer Science. WSES Press, pp. 156-172, 2001.
  • We have added the PoS to word-forms, simply to check how often cross-categorial translations do in fact occur, e.g.
    
    coffee-N tea-N mat&eacute;-N and-C spices-N $
    coffee-N whether-C or-C not-R roasted-V or-C decaffeinated-V $
    coffee-N husks-N and-C skins-N $
    
    
  • Test with TWENTE on Annex II with word-forms

6.2. Results

Example output [data]:

umbrellas-N     de&zcaron;niki-N/0.29 son&ccaron;niki-N/0.29 palice-de&zcaron;nike-N/0.11 vrtne-A/0.11 
uncoated-A      neprevle&ccaron;ene-A/1.00 
unembroidered-V nevezene-A/1.00 
unexposed-V     ob&ccaron;utljivi-A/0.33 svetlobo-N/0.33 neosvetljeni-A/0.33 
unwrought-N     surovi-A/0.30 surov-A/0.30 bakrove-A/0.15 plemenitih-A/0.14 
unwrought-V     neobdelane-A/0.33 surovi-A/0.17 platiranih-A/0.17 plemenitmi-N/0.17 
up-S            (null)/0.53 predelavo-N/0.09 zgornje-A/0.07 okviru-N/0.07 
use-N           uporabo-N/0.82 in-C/0.07 &scaron;t-N/0.04 industriji-N/0.04 
use-V           uporabo-N/0.37 ki-C/0.09 uporabni-A/0.07 cestne-A/0.05 
used-A          uporabljeno-A/0.40 vsi-P/0.22 uporabljeni-A/0.12 morajo-V/0.09 
used-V          se-P/0.35 uporabljenih-A/0.11 uporabljeni-A/0.10 uporabljajo-V/0.06 
uses-N          namene-N/1.00 
value-N         vrednost-N/0.17 cene-N/0.16 franko-N/0.16 tovarna-N/0.15 
vegetable-A     rastlinskega-A/0.36 rastlinski-A/0.24 rastlinskih-A/0.24 rastlinska-A/0.06 
vegetables-N    vrtnine-N/0.51 su&scaron;enih-A/0.17 stro&ccaron;nic-N/0.17 vrtnin-N/0.16 
vehicles-N      (null)/0.22 vozila-N/0.21 nesamovozna-A/0.20 tanki-A/0.05 
video-A video-N/0.40 videomonitorji-N/0.14 videoprojektorji-N/0.14 vgrajen-A/0.08 
vinegar-N       alkoholi-N/0.19 kis-N/0.19 krompir-N/0.16 kisu-N/0.16 

6.3. Comparing PoS labeled results with plain ones

  • For most cases there is no difference
  • Damage can be done if the tagging is wrong
  • With good tagging, some translations can be corrected:
    foregoing-A     njihovi-P/0.67 navedenih-A/0.24 proizvodov-N/0.09
    
  • However, some usefull translations are also lost due to PoS conversion:
    incorporating-V     vgrajenimi-A/0.34 snemanje-N/0.16 reprodukcijo-N/0.10 zvoka-N/0.09 
    where the context is
    sound reproducing apparatus, not incorporating a sound recording device
    aparati za reprodukcijo zvoka, ki nimajo vgrajene naprave za snemanje zvoka 

6.4. Another use

Given that there is not much difference in result, we can use PoS equivalence to speed up the extraction:
  • TWENTE on Annex II with lemmas: 2:31.5
  • TWENTE on Annex II with lemmas, A only: 0:18.5

7. Conclusions and future work

7.1. Conclusions I

Comparing TWENTE and PWA:
  • PWA returns more translation euqivalents (but TWENTE set to 2 occurences)
  • PWA produces also multi-word translation equivalents - but those are - for the most part - not very good
  • PWA produces also a list of token translation equivalents
  • TWENTE includes confidence value
  • TWENTE more robust
  • TWENTE easier to use from command line
  • TWENTE available in source code / PWA only as binary

7.2. Conclusions II

Comparing word-form with lemma extraction:
  • Less effect than might be expected
  • Except for many more unambiguous hi-confidence translations with TWENTE En-Sl
  • Some silly translation introduced with PWA
Comparing corpus/element extraction:
  • Better results if program is run on as large a text as possible
  • But processing time / robustness becomes a concern
Comparing PoS equivalent translations:
  • Again, less effect than might be expected
  • Attempt only if the tagging is good enough and similar across the two languages
  • However, this is a way to speed up processing

7.3. Further work

  • Extracting multi-word lexical items is more chalenging
  • Varying strategies are pursued; some are discussed and tested in
    • Špela Vintar. 2001. Using Parallel Corpora for Translation-Oriented Term Extraction. Babel Journal, 47(2):121--132.
  • And, of course, more and larger corpora with better annotations..