Tomaz Erjavec
Department for Intelligent Systems
Jozef Stefan Institute
Ljubljana, Slovenia
Talk given at
SFB441
Tübingen,
December 15th, 1999,
Revised and made avilable on
http://nl.ijs.si/et/talks/SFB441/
Ljubljana
Dec 19th 1999
Overview of the talk:
Topic of the Talk
Tagging Slavic corpora =
annotating Serbo-Croatian words in context with
morphosyntactic information using a trainable PoS tagger
Above, in the interest of terminological simplicity, Serbo-Croatian is taken mean the union of Bosnian, Croatian and Serbian.
Needed:
Another way:
Tagging SC is similar to tagging other languages, but here and for Slavic in general:
Example
Text:
Result:
<seg id="kmet.sl.87" corresp="kmet.en.87">
<c>-</c>
<w msd="Q" lemma="že">že</w>
<w msd="Vmp--pmp" lemma="omeniti">omenjeni</w>
<w msd="Ncmsa--n" lemma="zavod">zavod</w>
<w msd="Spsa" lemma="za">za</w>
<w msd="Ncfsa" lemma="živinoreja">živinorejo</w>
<c>,</c>
<w msd="Css" lemma="ki">ki</w>
<w msd="Vcif3s" lemma="biti">bo</w>
<w msd="Q" lemma="vsaj">vsaj</w>
<w msd="Spsl" lemma="v">v</w>
<w msd="Afpnsl" lemma="prehoden">prehodnem</w>
<w msd="Ncnsl" lemma="obdobje">obdobju</w>
<w msd="Vmps-sma" lemma="nadaljevati">nadaljeval</w>
<w msd="Ncfpa" lemma="naloga">naloge</w>
<w msd="Spsl" lemma="na">na</w>
<w msd="Ncnsl" lemma="področje">področju</w>
<w msd="Ncfsg">zootehnike</w>
<c>;</c>
</seg>
Steps in corpus preparation
Tool chain:
Resources needed
Tools can be off-the-shelf, but resources are dependent on language:
SC: research and resources
Some related languages:
IJS-ELAN
15 component texts:
Title
kB
kW
Constitution of the Republic of Slovenia
364
20
Speeches by the President of Slovenia, M. Kucan
1102
69
Functioning of the National Assembly
325
20
Slovenian Economic Mirror; 13 issues, 98/99
4056
239
National Environmental Protection Programme
1222
70
Europe Agreement
589
34
Europe Agreement - Annex II
483
25
Strategy for Integration into EU
1511
89
Programme for accession to EU - agriculture
543
29
Programme for accession to EU - economy
394
23
Vademecum by Lek
471
24
EC Council Regulation No 3290/94 - agriculture
1182
69
Linux Installation and Getting Started
3044
173
GNU PO localisation
353
13
G. Orwell: Nineteen Eighty-Four
6698
195
Tagging Slovene ELAN
Input for Tagger
It is expected that by 2008 a significant number of traditional air pollution problems will be eliminated...
<seg lang="sl"> <w lemma="do">Do</w> <w lemma="let letati leto">leta</w> <w type=dig>2008</w> <w lemma="se">se</w> <w lemma="pričakovati">pričakuje</w><c>,</c> <w lemma="da dati">da</w> <w lemma="biti">bo</w> <w lemma="odpravljen odpraviti">odpravljen</w> <w lemma="bistven">bistveni</w> <w lemma="del delo deti">del</w> <w lemma="tradicionalen">tradicionalnih</w> <w lemma="problem">problemov</w> <w lemma="onesnaženost">onesnaženosti</w> ...
The MULTEXT-East MSDs
Morphosyntactic
specifications
, HR
by Marko Tadic.
Noun (N)
11 Positions
**** **** **** **** **** ---- ---- ---- ---- ---- ----
PoS Type Gend Numb Case Def Cltc Anim OwnN OwnP OwdN
**** **** **** **** **** ---- ---- ---- ---- ---- ----
= ============== ============== = EN RO SL CS BG ET HU HR
P ATT VAL C x x x x x x x x
= ============== ============== =
1 Type common c x x x x x x x x
proper p x x x x x x x x
- -------------- -------------- -
2 Gender masculine m x x x x x x
feminine f x x x x x x
neuter n x x x x x x
- -------------- -------------- -
3 Number singular s x x x x x x x x
plural p x x x x x x x x
dual d x x
l.s. count t x
- -------------- -------------- -
4 Case nominative n x x x x x x
genitive g x x x x x
dative d x x x x
accusative a x x x x
vocative v x x x x
locative l x x x
instrumental i x x x x
l.s. direct r x
l.s. oblique o x
l.s. partitive 1 x
illative x x x
...
% mtems-expand -brief Ncfsg
Ncfsg: Noun common feminine singular genitive
The Slovene Lexicon
avenij avenija Ncfdg
avenij avenija Ncfpg
avenija = Ncfsn
avenijah avenija Ncfdl
avenijah avenija Ncfpl
avenijam avenija Ncfpd
avenijama avenija Ncfdd
avenijama avenija Ncfdi
avenijami avenija Ncfpi
avenije avenija Ncfpa
avenije avenija Ncfpn
avenije avenija Ncfsg
aveniji avenija Ncfda
aveniji avenija Ncfdn
aveniji avenija Ncfsd
aveniji avenija Ncfsl
avenijo avenija Ncfsa
avenijo avenija Ncfsi
Cat
entries
wforms
lemmas
MSDs
Noun
127.811
61.525
7.465
99
Verb
110.949
78.001
3.699
128
Adjective
310.754
64.604
4.621
279
Pronoun
3.654
732
105
1.335
Adverb
7.415
7.395
442
3
Preposition
123
109
79
6
Conjunction
39
38
39
3
Numeral
4.401
832
181
226
Exclamation
10
10
10
1
Abbreviation
48
48
48
1
Particle
76
76
76
1
565.281
201.011
16.766
2.083
Intex (DELAS/DELAF) YU lexicon
gdekakav,ProA03.01 gdekakav,.ProA03.01:msn*:msa-
gdeko,ProN12 gdekakva,gdekakav.ProA03.01:msg*:nsg*:fsn*:npn*:npa*
gdekoji,ProA07 gdekakve,gdekakav.ProA03.01:fsg*:mpa*:fpn*:fpa*
gdetko,ProN12 gdekakvi,gdekakav.ProA03.01:mpn*
gdešto,AdvE,ProN13 gdekakvih,gdekakav.ProA03.01:mpg*:npg*:fpg*
gdjekakav,ProA03.01 gdekakvim,gdekakav.ProA03.01:msi*:nsi*:*pd*:*pi*:*pl*
... gdekakvima,gdekakav.ProA03.01:*pd*:*pi*:*pl*
ičiji,ProA05 gdekakvime,gdekakav.ProA03.01:msi*:nsi*
ja,ProN01 gdekakvo,gdekakav.ProA03.01:nsn*:nsa*:
kakav,ProA03.01 gdekakvog,gdekakav.ProA03.01:msg*:nsg*:msa+
kakavgod,ProA03.01 gdekakvoga,gdekakav.ProA03.01:msg*:nsg*:msa+
... gdekakvoj,gdekakav.ProA03.01:fsd*:fsl*
tvoj,ProA06 gdekakvom,gdekakav.ProA03.01:msd*:msl*:nsd*:nsl*:fsi*
vaš,ProA04 gdekakvome,gdekakav.ProA03.01:msd*:msl*:nsd*:nsl*
vi,ProN04 gdekakvomu,gdekakav.ProA03.01:msd*:msl*:nsd*:nsl*
šta,Adv*,Par*,ProN13 gdekakvu,gdekakav.ProA03.01:msd*:msl*:nsd*:nsl*:fsa*
štagod,ProN13
PoS | 2 c |DELAS | 2 c ||DELAF | ||
KByte | entries | KByte | entries | |
nouns | 50K | 2720 | 627K | 18241 |
adjectives | 11K | 630 | 565K | 10956 |
verbs | 42K | 1884 | 1693K | 49076 |
other | 21K | 1378 | 133K | 3849 |
total | 124K | 6569 | 3009K | 81152 |
The Slovene pre-tagged corpus
The tagger
Qualities:
Evaluation on '1984':
TnT | MBT | |
Known | 93.55% | 93.58% |
Unknown | 60.77% | 44.45% |
TnT , 'Trigrams 'n Tags':
TnT parameters
%% Statistically tagged file, Sun Dec 5 16:32:44 1999
%% lexicon : mte.lex
%% ngrams : mte.123
%% corpus : elan-sl.t
%% model : trigrams
%% sparse data : linear interpolation
%% lambda1 = 1.292668e-01 lambda2 = 3.310223e-01 lambda3 = 5.397110e-01
%% unknown mode: lexicon entry @UNKNOWN
%% case of characters is significant
%% using suffix trie up to length 10
%% unknown words are marked with an asterisk (*)
%% Thorsten Brants, thorsten@coli.uni-sb.de
%% 177776 (30.04%) unknown tokens
%% 7141 recognized as cardinals/ordinals
%% 102761 tokens taken from the backup lexicon
%% avg. 10.87 tags/token, 1.71 tags/known token
izhajajoč Afpmsnn *
iz Spsg
Temeljne Afpfsg *
ustavne Afpfsg *
listine Ncfsg *
o Spsl
samostojnosti Ncfsl *
Conclusions
This document was generated using the LaTeX2HTML translator Version 97.1 (release) (July 13th, 1997)
Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -split 0 tue-slides.
The translation was initiated by Tomaz Erjavec on 12/19/1999