An Experiment in Automatic Bi-Lingual Lexicon Construction from a Parallel Corpus

7th TELRI seminar "Information in Corpora"

Dubrovnik 26-28 September 2002

Tomaž Erjavec
Department of Intelligent Systems
Jozef Stefan Institute
Jamova 39
SI-1000 Ljubljana
Slovenia

Abstract

The IJS-ELAN corpus (Erjavec 20002) contains 1 million words of annotated parallel Slovene-English texts. The corpus is sentence aligned and both languages are word-tagged with context disambiguated morphosyntactic descriptions and lemmas. In the talk we discuss an experiment in automatic bi-lingual lexicon extraction from this corpus. Extracting such lexica is one the prime uses of parallel corpora, as manual construction is an extremely time consuming process, yet the resource is invaluable for lexicographers, terminologists, translators as well as machine translation systems. For the experiment we used two statistics based programs for the automatic extraction of bi-lingual lexicons from parallel corpora: the Twente software (Hiemstra, 1998), and the PWA system (Tiedemann, 1998). We compare the two programs in terms of availability, ease of use and the type and quality of the results. We experimented with several different choices of input to the programs, using varying amounts of linguistic information. We compared the extractions using the word-forms from the corpus to that where lemmas have been used: this normalises the input and abstracts away from the rich inflections of Slovene. Following the lead of Tufis and Barbu (2001) we also restricted the translation lexicon to lexical items of the same part-of-speech, i.e. we make the assumption that a noun is always translated as a noun, a verb as a verb, etc. This again reduces the search space for the algorithms and could thus lead to superior results. Finally, we experimented with taking the whole corpus as input, and opposed this to processing corpus components separately. The reasoning here is that it is likely that different components will contain distinct senses of polysemous words, which will be translated into different target words. For such words there would therefore be no benefit in amalgamating different texts, while the final precision might in fact be lower. Preliminary results show that the precision of the extracted translation lexicon is much improved by utilising lemmas with an identical part-of-speech in the source and target languages; this argues in favour of linguistic pre-processing of the corpus. However, the recall of the system tends to be lower, as it misses out on conversion translations. In the conclusion we discuss this and other findings, as well as current results on extracting translation equivalents of collocations.

1. Introduction
- 1.1. Overview of the talk
- 1.2. The IJS-ELAN corpus
- 1.3. Example segments from the corpus
- 1.4. Example segments from the corpus
- 1.5. Pre-processing the corpus
2. The Software
- 2.1. TWENTE
- 2.2. TWENTE cont.
- 2.3. PWA
- 2.4. PWA cont.
- 2.5. Comparing TWENTE and PWA
3. Extraction of plain text
- 3.1. First experiment
- 3.2. TWENTE
- 3.3. Examples of results of the three TWENTE models: U
- 3.4. Examples of results of the three TWENTE models: V
- 3.5. PWA
- 3.6. Example of PWA results
- 3.7. Comparing TWENTE and PWA
4. Lemmas vs. Word-forms
- 4.1. Using Lemmas
- 4.2. TWENTE
- 4.3. PWA
- 4.4. Comparing word-form with lemma extraction
5. Part/Whole
- 5.1. Comparing part/whole extraction
- 5.2. Examples
- 5.3. Comparing part/whole extraction
6. PoS limitations
- 6.1. Constraining PoS
- 6.2. Results
- 6.3. Comparing PoS labeled results with plain ones
- 6.4. Another use
7. Conclusions and future work
- 7.1. Conclusions I
- 7.2. Conclusions II
- 7.3. Further work

1. Introduction

1.1. Overview of the talk

The IJS-ELAN corpus
Twente and Plug
Results with word-forms
Results with lemmas
Comparing part/whole
Including PoS
Conclusions

1.2. The IJS-ELAN corpus

Slovene-English parallel corpus, 2 x 500.000 words
IJS contribution to the MLIS ELAN project
Contains 15 parallel texts
Tokenised, sentence segmented
Sentence aligned
V2 automatically MSD tagged and lemmatised
Encoded in XML TEI P4
Freely available for downloading from http://nl.ijs.si/elan/

1.3. Example segments from the corpus

Slovene segment:


<seg id="anx2.sl.105" corresp="anx2.en.105">
<c ctag=":">-</c> 
<w ana="Aspnsn" lemma="sojin">Sojino</w> 
<w ana="Ncnsn" lemma="olje">olje</w>
<c ctag=",">,</c> 
<w ana="Ncnsn" lemma="olje">olje</w> 
<w ana="Spsg" lemma="iz">iz</w> 
<w ana="Ncmsg" lemma="kikiriki">kikirikija</w>
<c ctag=",">,</c> 
<w ana="Aspfsa" lemma="palmov">palmovo</w>
<c ctag=",">,</c> 
<w ana="Ncfpn">kopre</w>
<c ctag=",">,</c> 
<w ana="Aspnsg" lemma="palmov">palmovega</w> 
<w ana="Ncnsg" lemma="jedro">jedra</w>
<c ctag=",">,</c> 
<w ana="Ncnsl">babassu</w>
<c ctag=",">,</c> 
<w ana="Ncnsn">tungovo</w> 
<w ana="Ccs" lemma="in">in</w> 
<w ana="Ncfsn">oiticica</w> 
<w ana="Ncnsa" lemma="olje">olje</w>
...

1.4. Example segments from the corpus

English segment:


<seg id="anx2.en.105" corresp="anx2.sl.105">
<c ctag=":">-</c> 
<w ana="Ncns" ctag="?UH NP" lemma="soya">Soya</w>
<c ctag=",">,</c> 
<w ana="Ncns" ctag="NN NN" lemma="ground">ground</w> 
<w ana="Ncns" ctag="NN NN" lemma="nut">nut</w>
<c ctag=",">,</c> 
<w ana="Ncns" ctag="NN NN" lemma="palm">palm</w>
<c ctag=",">,</c> 
<w ana="Ncns" ctag="NN NN" lemma="copra">copra</w>
<c ctag=",">,</c> 
<w ana="Ncns" ctag="NN NN" lemma="palm">palm</w> 
<w ana="Ncns" ctag="NN NN" lemma="kernel">kernel</w>
<c ctag=",">,</c> 
<w ana="Ncns" ctag="?FW ???" lemma="babassu">babassu</w>
<c ctag=",">,</c> 
<w ana="Vmpp" ctag="NN NN" lemma="tung">tung</w> 
<w ana="Cc-n" ctag="CC CC" lemma="and">and</w> 
<w ana="Npns" ctag="NN NN" lemma="oiticica">oiticica</w> 
<w ana="Ncns" ctag="NN NN" lemma="oil">oil</w>
...

1.5. Pre-processing the corpus

Remove punctuation and numbers
All text converted lower case
Remove XML tags
Special characters to ASCII (×), removed ($, &commat;) or to Latin-2 (čšž)
Convert to software-specific format

Example (TWENTE format):


chapter $
live animals $
all the animals of chapter used must be wholly obtained $
chapter $
meat and edible meat offal $
manufacture in which all the materials of chapters and used must be wholly obtained $
chapter $
fish and crustaceans molluscs and other aquatic invertebrates $
manufacture in which all the materials of chapter used must be wholly obtained $
ex chapter $
dairy produce $
birds ' eggs $
natural honey $

2. The Software

2.1. TWENTE

Twente word alignment software
Automatically creates a bi-directional translation lexicon from a parallel corpus
Developed in 1998 by Djoerd Hiemstra at the University of Twente, in the scope of the "Twenty-one" Project
Comprises a set of programs written in C
Available under GNU GPL

2.2. TWENTE cont.

The algorithm is based on a symmetric translation model
Comprises three statistical algorithms:
- Model A: Iterative Proportional Fitting algorithm
- Model A: Monte Carlo Sampling algorithm
- Model B: Monte Carlo Sampling algorithm

Example output:


celoti                       
---------------------        
wholly      0.58
be          0.16
all         0.12
obtained    0.08

celuloznih                   
---------------------        
cellulose   0.44
fibrous     0.20
cellulosic  0.19
material    0.17

2.3. PWA

PLUG Word Aligner
Automatically creates a translation lexicon and translation equivalents from a parallel corpus
Developed in 2000 by Joerg Tiedemann at Uppsala University in the scope of the PLUG project (Parallel Corpora in Linköping, Uppsala, and Göteborg)
Comprises a set of programs written in Perl and a Tcl/Tk wrapper
Available in binary for Linux/Windows under a research license
A Web interface also exists

2.4. PWA cont.

Comprises 2 word alignment systems + additions:
- LWA: Linköping Word Aligner
- UWA: Uppsala Word Aligner
- automatic generation of monolingual word collocations (phrases)
- PLUG Scorer: automated evaluation of alignment results

Example output of link types (lexicon):


accompanied     spremljano
acetic          ocetni
acid            kisline
acid            kislini
acid            kisla
acids           kisline
acids           kislin
added           sladkorjem
addition        adicijske
additives       aditivi

Example output of link tokens:


anx221  edible  u&zcaron;itni  1       11:6    11:6    4.46054036978434
anx221  products        izdelki 1       18:8    18:7    9.61970217554317
anx221  of      ki      1       27:2    44:2    9.18877848331661
anx221  animal  izvora  1       30:6    37:6    5.08354030378808
anx221  origin  &zcaron;ivalskega      7       37:6    26:10   5.67694861380386
anx221  not elsewhere specified or included     mestu   1       44:3&48:9&58:9&68:2&71:8        84:5    16.7151474681847
anx222  milk    mleko   1       30:4    26:5    5.13118957986685
anx222  cream   smetana 7       39:5    35:7    4.98021351626141
anx222  and     drugo   2       60:3    59:5    3.92860959492475
anx222  or      ali     1       80:2    78:3    10.7331875453987
anx222  milk    mleko   1       93:4    90:5    5.13118957986685
anx222  cream   smetana 7       102:5   99:7    4.98021351626141
anx222  whether ali     1       108:7   121:3   10.1405570858157
anx222  or      ali     1       116:2   149:3   10.7331875453987
anx222  not     ne      1       119:3   125:2   7.87830244161833
anx222  concentrated    z       7       123:12  128:1   3.5854502568264

2.5. Comparing TWENTE and PWA

Different philosophy: PWA complete toolbox / TWENTE only lexicon extraction
TWENTE includes confidence value in output, PWA doesn't
TWENTE easier to use from command line
TWENTE available in source code / PWA only as binary

3. Extraction of plain text

3.1. First experiment

Experiments performed on one IJS-ELAN component: “Europe Agreement - Annex II”

Extents En Sl

Segment 2,382 2,382

Token 11,928 11,526

Type 2,090 2,909

Lemma 1,823 2,111
Extraction first performed directly on word-forms

3.2. TWENTE

All three models used, bilingual lexica produced

TWENTE	En(-Sl)	Sl(-En)
Corpus	2,090	2,909
A1	539	582
A2	518	536
B1	692	767
A1-100%	51	186
A2-100%	35	123
B1-100%	34	144

A1 = Model A Iterative Proportional Fitting algorithm
A2 = Model A Monte Carlo Sampling algorithm
B1 = Model B Monte Carlo Sampling algorithm

3.3. Examples of results of the three TWENTE models: U

[data]

Model A Iterative Proportional Fitting algorithm


umbrellas       de&zcaron;niki/0.29 son&ccaron;niki/0.29 palice-de&zcaron;nike/0.11 vrtne/0.11 
uncoated        neprevle&ccaron;ene/1.00 
unembroidered   nevezene/1.00 
unexposed       ob&ccaron;utljivi/0.33 svetlobo/0.33 neosvetljeni/0.33 
unwrought       surovi/0.31 surov/0.21 neobdelanih/0.15 neobdelane/0.10 
up              (null)/0.44 zgornje/0.09 okviru/0.08 meje/0.08 
use             uporabo/0.67 (null)/0.10 ki/0.09 uporabni/0.04 
used            se/0.27 (null)/0.14 uporabljeni/0.10 uporabljenih/0.09 
uses            namene/1.00

Model A Monte Carlo Sampling algorithm


umbrellas       de&zcaron;niki/0.27 son&ccaron;niki/0.26 (null)/0.17 son&ccaron;nike/0.08 
uncoated        neprevle&ccaron;ene/1.00 
unembroidered   nevezene/1.00 
unexposed       svetlobo/0.39 ob&ccaron;utljivi/0.34 neosvetljeni/0.27 
unwrought       surovi/0.32 surov/0.21 neobdelanih/0.11 neobdelane/0.11 
up              (null)/0.18 okviru/0.12 drobno/0.09 meje/0.08 
use             uporabo/0.68 (null)/0.09 za/0.05 uporabni/0.04 
used            se/0.26 uporabljenih/0.08 uporabljeni/0.07 kateri/0.06 
uses            namene/0.94 terapevtske/0.04 profilakti&ccaron;ne/0.03

Model B Monte Carlo Sampling algorithm


umbrellas       son&ccaron;niki/0.28 de&zcaron;niki/0.27 son&ccaron;nike/0.12 vrtne/0.11 
uncoated        neprevle&ccaron;ene/1.00 
unembroidered   nevezene/1.00 
unexposed       svetlobo/0.32 ob&ccaron;utljivi/0.32 neosvetljeni/0.29 za/0.04 
units           za/0.42 izolacijo/0.15 ve&ccaron;zidni/0.15 napajalniki/0.14 
unwrought       surovi/0.33 surov/0.22 neobdelanih/0.16 neobdelane/0.11 
up              njega/0.13 meje/0.10 okviru/0.10 zgornje/0.09 
use             uporabo/0.65 uporabni/0.04 motorje/0.04 valjarje/0.03 
used            se/0.25 uporabljenih/0.08 uporabljeni/0.07 kateri/0.05 
uses            namene/1.00

3.4. Examples of results of the three TWENTE models: V

[data]

Model A Iterative Proportional Fitting algorithm


value           vrednost/0.17 cene/0.16 franko/0.16 tovarna/0.15 
vegetable       rastlinskega/0.36 rastlinski/0.24 rastlinskih/0.24 sokovi/0.06 
vegetables      vrtnine/0.51 su&scaron;enih/0.16 stro&ccaron;nic/0.16 vrtnin/0.16 
vehicles        vozila/0.62 (null)/0.21 tanki/0.03 oklepna/0.03 
video           video/0.35 slike/0.26 videomonitorji/0.13 videoprojektorji/0.13 
vinegar         alkoholi/0.17 kis/0.17 krompir/0.15 kisu/0.15

Model A Monte Carlo Sampling algorithm


value           vrednost/0.16 franko/0.16 cene/0.15 tovarna/0.15 
vegetable       rastlinskega/0.35 rastlinskih/0.25 rastlinski/0.24 rastlinska/0.06 
vegetables      vrtnine/0.49 stro&ccaron;nic/0.17 su&scaron;enih/0.16 vrtnin/0.15 
vehicles        vozila/0.68 (null)/0.19 bojna/0.03 tista/0.03 
video           video/0.30 slike/0.26 (null)/0.21 video-tuner/0.07 
vinegar         alkoholi/0.20 kislini/0.17 kis/0.16 krompir/0.16

Model B Monte Carlo Sampling algorithm


value           tovarna/0.11 vrednost/0.10 presega/0.10 uporabljenih/0.10 
vegetable       rastlinskega/0.30 rastlinski/0.21 rastlinskih/0.21 izvora/0.09 
vegetables      vrtnine/0.49 stro&ccaron;nic/0.18 su&scaron;enih/0.17 vrtnin/0.16 
vehicles        vozila/0.96 tanki/0.04 
video           video/0.30 slike/0.30 videomonitorji/0.11 videoprojektorji/0.11 
vinegar         kis/0.21 alkoholi/0.20 ocetni/0.15 kisu/0.15

3.5. PWA

Experiment performed on same component: “Europe Agreement - Annex II”
Only LWA model used, couldn't get UWA to work...

PWA	En(-Sl)	Sl(-En)
Corpus	2,090	2,909
(A1)	539	582
LWA	903	1,068

3.6. Example of PWA results

[data]

English - Slovene:


umbrellas       de&zcaron;niki
unbleached      nebeljene
uncoated        neprevle&ccaron;ene
uncut           nebru&scaron;enih
unembroidered   nevezene
unmanufactured  nepredelanega
unprinted       pogojem
unwrought       neobdelane, neobdelanih, surovi
up              ne
uppers          na
use             uporabo
used            drugo, vrednosti, uporabijo, morajo, derivativi
used must already be originating        poreklom
used must be wholly obtained            pridobljeni
uses            namene
value           vrednost
value does not exceed of the ex-works price of the      vrednost ne presega cene
value of the    uporabljenih
vapour          paro
vegetable       &zcaron;ivalskega, rastlinski, modificirani, rastlinskih
vegetables      vrtnine, su&scaron;enih
vehicles        vozila, njihovi
video           slike, video
vinegar         kisu
viscosity       viskoznosti
volume          vol

3.7. Comparing TWENTE and PWA

PWA returns more translation equivalents than TWENTE (but treshold set to 2)
The quaiity of returned lexical items is comparable
PWA produces also multi-word translation equivalents - but most are useless (esp. > 2)
TWENTE includes confidence value in output, PWA doesn't

4. Lemmas vs. Word-forms

4.1. Using Lemmas

Motivation is to decrease the number of lexical items, especially for Slovene, and thus obtain better statistics;
In Annex II decrease in the number of token types: for English by 13%; for Slovene by 27%

Results:

	En(-Sl)	Sl(-En)
Type	2,090	2,909
Type Lemma	1,823	2,111
A1	539	582
A1-Lemma	530	566
A1-100%	51	186
A1-100%-Lemma	122	185
LWA	903	1,068
LWA-Lema	928	946

4.2. TWENTE

[data]

Examples of TWENTE increased confidence:


artificial      umeten/1.00 
artificial      umetno/0.43 umetni/0.43 umetnih/0.14 

alcohols        alkohol/1.00 
alcohols        alkoholov/0.50 alkoholi/0.17 industrijski/0.17 ma&scaron;&ccaron;obni/0.17 

auxiliary       pomo&zcaron;en/1.00 
auxiliary       pomo&zcaron;ni/0.67 kolesa/0.10 mopede/0.06 pomo&zcaron;nim/0.06

4.3. PWA

Examples of differences in PWA lexica:

[data]

Word-forms:


umbrellas       de&zcaron;niki
unbleached      nebeljene
uncoated        neprevle&ccaron;ene
uncut           nebru&scaron;enih
unembroidered   nevezene
unmanufactured  nepredelanega
unprinted       pogojem
unwrought       neobdelane, neobdelanih, surovi
up              ne
uppers          na

Lemmas:


umbrella        de&zcaron;nik, son&ccaron;nik
unassembled     &scaron;ablona
unbleached      nebeljen
uncoated        neprevle&ccaron;ene
uncut           nebru&scaron;en
unembroidered   uporabljen
unembroidered fabric    nevezene
unmanufactured  &zcaron;e
unprinted       da
unrecorded      fenomen
unstuffed       nenapolnjeno
unvulcanized    nevulkanizirane
unwrought       neobdelan, surov
up              ne, prodaja
upper           na

4.4. Comparing word-form with lemma extraction

Large increase in confidence with en-sl lexicon
Otherwise less effect than might be expected
some silly translation introduced with PWA

5. Part/Whole

5.1. Comparing part/whole extraction

For this experiment we took the whole IJS-ELAN corpus, lemmatised version
Experiment performed only with TWENTE; PWA crashes
TWENTE on HP/UX RISC takes 42 hours

	En(-Sl)	Sl(-En)
Token ANX2	11,928	11,526
Token ELAN	576,940	488,397
Lemma ANX2	1,823	2,111
Lemma ELAN	16,274	22,136
A1 ANX2	530	566
A1 ELAN	6,007	8,405
100% ANX2	122	185
100% ELAN	1,006	2,413

5.2. Examples

[data]

Annex II:


umbrella        de&zcaron;nik/0.43 son&ccaron;nik/0.43 palice-de&zcaron;nike/0.07 vrten/0.07 
uncoated        neprevle&ccaron;ene/1.00 
unembroidered   nevezene/1.00 
unexposed       ob&ccaron;utljiv/0.33 svetloba/0.33 neosvetljen/0.33 
unwrought       surov/0.67 neobdelan/0.33 
up              (null)/0.50 droben/0.22 predelava/0.07 pripraviti/0.06 
use             uporabljen/0.45 uporabljati/0.16 (null)/0.08 uporaba/0.08 
value           vrednost/0.45 presegati/0.14 cena/0.13 franko/0.13 
vegetable       rastlinski/0.69 vrtnina/0.17 su&scaron;en/0.07 zelenjaven/0.06 
vehicle         vozilo/0.41 (null)/0.21 nesamovozna/0.19 njihov/0.03 
video           slika/0.55 video/0.14 videomonitorji/0.14 videoprojektorji/0.14 
vinegar         kis/1.00

ELAN:


umbrella        de&zcaron;nik/0.35 son&ccaron;nik/0.35 kroven/0.12 centrala/0.07 
uncoated        neprevle&ccaron;ene/1.00 
unembroidered   nevezene/1.00 
unwrought       surov/0.60 neobdelan/0.30 (null)/0.09 
up              (null)/0.36 biti/0.15 se/0.09 leto/0.03 
use             uporabljati/0.42 uporaba/0.22 uporabiti/0.15 uporabljen/0.07 
value           vrednost/0.86 vrednota/0.12 (null)/0.02 
vegetable       zelenjava/0.46 rastlinski/0.33 vrtnina/0.15 zelenjaven/0.03 
vehicle         vozilo/0.84 voziti/0.04 (null)/0.04 motoren/0.03 
video           grafi&ccaron;en/0.72 video/0.18 (null)/0.03 slika/0.02 
vinegar         kis/1.00

5.3. Comparing part/whole extraction

Better results if program is run on as large a text as possible
But processing time / robustness becomes a concern

6. PoS limitations

6.1. Constraining PoS

Method proposed in:
- Tufis, D., Barbu, A.M. Automatic Construction of Translation Lexicons. In Kluew, V., D'Attellis, C., Mastorakis, N. (eds.) Advances in Automation, Multimedia and Modern Computer Science. WSES Press, pp. 156-172, 2001.

We have added the PoS to word-forms, simply to check how often cross-categorial translations do in fact occur, e.g.


coffee-N tea-N mat&eacute;-N and-C spices-N $
coffee-N whether-C or-C not-R roasted-V or-C decaffeinated-V $
coffee-N husks-N and-C skins-N $

Test with TWENTE on Annex II with word-forms

6.2. Results

Example output [data]:


umbrellas-N     de&zcaron;niki-N/0.29 son&ccaron;niki-N/0.29 palice-de&zcaron;nike-N/0.11 vrtne-A/0.11 
uncoated-A      neprevle&ccaron;ene-A/1.00 
unembroidered-V nevezene-A/1.00 
unexposed-V     ob&ccaron;utljivi-A/0.33 svetlobo-N/0.33 neosvetljeni-A/0.33 
unwrought-N     surovi-A/0.30 surov-A/0.30 bakrove-A/0.15 plemenitih-A/0.14 
unwrought-V     neobdelane-A/0.33 surovi-A/0.17 platiranih-A/0.17 plemenitmi-N/0.17 
up-S            (null)/0.53 predelavo-N/0.09 zgornje-A/0.07 okviru-N/0.07 
use-N           uporabo-N/0.82 in-C/0.07 &scaron;t-N/0.04 industriji-N/0.04 
use-V           uporabo-N/0.37 ki-C/0.09 uporabni-A/0.07 cestne-A/0.05 
used-A          uporabljeno-A/0.40 vsi-P/0.22 uporabljeni-A/0.12 morajo-V/0.09 
used-V          se-P/0.35 uporabljenih-A/0.11 uporabljeni-A/0.10 uporabljajo-V/0.06 
uses-N          namene-N/1.00 
value-N         vrednost-N/0.17 cene-N/0.16 franko-N/0.16 tovarna-N/0.15 
vegetable-A     rastlinskega-A/0.36 rastlinski-A/0.24 rastlinskih-A/0.24 rastlinska-A/0.06 
vegetables-N    vrtnine-N/0.51 su&scaron;enih-A/0.17 stro&ccaron;nic-N/0.17 vrtnin-N/0.16 
vehicles-N      (null)/0.22 vozila-N/0.21 nesamovozna-A/0.20 tanki-A/0.05 
video-A video-N/0.40 videomonitorji-N/0.14 videoprojektorji-N/0.14 vgrajen-A/0.08 
vinegar-N       alkoholi-N/0.19 kis-N/0.19 krompir-N/0.16 kisu-N/0.16

6.3. Comparing PoS labeled results with plain ones

For most cases there is no difference
Damage can be done if the tagging is wrong

With good tagging, some translations can be corrected:

foregoing-A     njihovi-P/0.67 navedenih-A/0.24 proizvodov-N/0.09

However, some usefull translations are also lost due to PoS conversion:

incorporating-V     vgrajenimi-A/0.34 snemanje-N/0.16 reprodukcijo-N/0.10 zvoka-N/0.09

where the context is

sound reproducing apparatus, not incorporating a sound recording device

aparati za reprodukcijo zvoka, ki nimajo vgrajene naprave za snemanje zvoka

6.4. Another use

Given that there is not much difference in result, we can use PoS equivalence to speed up the extraction:

TWENTE on Annex II with lemmas: 2:31.5
TWENTE on Annex II with lemmas, A only: 0:18.5

7. Conclusions and future work

7.1. Conclusions I

Comparing TWENTE and PWA:

PWA returns more translation euqivalents (but TWENTE set to 2 occurences)
PWA produces also multi-word translation equivalents - but those are - for the most part - not very good
PWA produces also a list of token translation equivalents
TWENTE includes confidence value
TWENTE more robust
TWENTE easier to use from command line
TWENTE available in source code / PWA only as binary

7.2. Conclusions II

Comparing word-form with lemma extraction:

Less effect than might be expected
Except for many more unambiguous hi-confidence translations with TWENTE En-Sl
Some silly translation introduced with PWA

Comparing corpus/element extraction:

Better results if program is run on as large a text as possible
But processing time / robustness becomes a concern

Comparing PoS equivalent translations:

Again, less effect than might be expected
Attempt only if the tagging is good enough and similar across the two languages
However, this is a way to speed up processing

7.3. Further work

Extracting multi-word lexical items is more chalenging
Varying strategies are pursued; some are discussed and tested in
- Špela Vintar. 2001. Using Parallel Corpora for Translation-Oriented Term Extraction. Babel Journal, 47(2):121--132.
And, of course, more and larger corpora with better annotations..

Extents	En	Sl
Segment	2,382	2,382
Token	11,928	11,526
Type	2,090	2,909
Lemma	1,823	2,111