next up previous


Tagging Slavic Corpora

Tomaz Erjavec
Department for Intelligent Systems
Jozef Stefan Institute
Ljubljana, Slovenia

Talk given at
SFB441
Tübingen,
December 15th, 1999,
Revised and made avilable on
http://nl.ijs.si/et/talks/SFB441/
Ljubljana
Dec 19th 1999




Overview of the talk:

Topic of the Talk


Tagging Slavic corpora =
annotating Serbo-Croatian words in context with morphosyntactic information using a trainable PoS tagger

Above, in the interest of terminological simplicity, Serbo-Croatian is taken mean the union of Bosnian, Croatian and Serbian.

Needed:

Another way:

Tagging SC is similar to tagging other languages, but here and for Slavic in general:

Example


Text:

-
že omenjeni zavod za živinorejo, ki bo vsaj v prehodnem obdobju nadaljeval naloge na področju zootehnike;

-
the above-mentioned Livestock-Breeding Centre, which will continue the tasks in zootechnics, at least for the transitional period;

Result:

<seg id="kmet.sl.87" corresp="kmet.en.87">
<c>-</c> 
<w msd="Q" lemma="že">že</w> 
<w msd="Vmp--pmp" lemma="omeniti">omenjeni</w> 
<w msd="Ncmsa--n" lemma="zavod">zavod</w> 
<w msd="Spsa" lemma="za">za</w> 
<w msd="Ncfsa" lemma="živinoreja">živinorejo</w>
<c>,</c> 
<w msd="Css" lemma="ki">ki</w> 
<w msd="Vcif3s" lemma="biti">bo</w> 
<w msd="Q" lemma="vsaj">vsaj</w> 
<w msd="Spsl" lemma="v">v</w> 
<w msd="Afpnsl" lemma="prehoden">prehodnem</w> 
<w msd="Ncnsl" lemma="obdobje">obdobju</w> 
<w msd="Vmps-sma" lemma="nadaljevati">nadaljeval</w> 
<w msd="Ncfpa" lemma="naloga">naloge</w> 
<w msd="Spsl" lemma="na">na</w> 
<w msd="Ncnsl" lemma="področje">področju</w> 
<w msd="Ncfsg">zootehnike</w>
<c>;</c>
</seg>

Steps in corpus preparation


Tool chain:

1.
acquisition of digital source
2.
conversion to standard format:
character sets, structure, emphasis, quotes, headers

3.
segmentation:
sentences, tokens, capitalisation, abbreviations, compounds, numerals

4.
morphological analysis:
lexicon, (rules), tagset

5.
tagging (parsing)

6.
iterative improvement:
interactive environment

7.
...exploitation (eg concordancing)

Resources needed


Tools can be off-the-shelf, but resources are dependent on language:

1.
Tagset:
pragmatic: made to fit the task and tools
canonic: reuse of existing resources, comparable

2.
Lexicon:
core: closed class words, irregular words
reference: general lexical stock
dynamic: unknown words

3.
Hand-tagged corpus:
ML approaches
size/quality

SC: research and resources


Some related languages:


IJS-ELAN


15 component texts:

Title kB kW
Constitution of the Republic of Slovenia 364 20
Speeches by the President of Slovenia, M. Kucan 1102 69
Functioning of the National Assembly 325 20
Slovenian Economic Mirror; 13 issues, 98/99 4056 239
National Environmental Protection Programme 1222 70
Europe Agreement 589 34
Europe Agreement - Annex II 483 25
Strategy for Integration into EU 1511 89
Programme for accession to EU - agriculture 543 29
Programme for accession to EU - economy 394 23
Vademecum by Lek 471 24
EC Council Regulation No 3290/94 - agriculture 1182 69
Linux Installation and Getting Started 3044 173
GNU PO localisation 353 13
G. Orwell: Nineteen Eighty-Four 6698 195

Tagging Slovene ELAN


1.
Amebis lemmatisation
2.
MULTEXT-East Slovene tagset, lexicon & tagged corpus:
  • Multilingual Text Tools and Corpora for Central and Eastern European Languages
  • Copernicus 106; '95 - '97
  • MULTEXT, EAGLES
  • http://nl.ijs.si/ME/ , TELRI-CD
  • (English), Bulgarian, Czech, Estonian, Hungarian, Romanian, Slovene
    (Lithuanian, Latvian, Russian, Serbian)
3.
Tagging with TnT

Input for Tagger


It is expected that by 2008 a significant number of traditional air pollution problems will be eliminated...

<seg lang="sl">
<w lemma="do">Do</w>
<w lemma="let letati leto">leta</w>
<w type=dig>2008</w>
<w lemma="se">se</w>
<w lemma="pričakovati">pričakuje</w><c>,</c>
<w lemma="da dati">da</w>
<w lemma="biti">bo</w>
<w lemma="odpravljen odpraviti">odpravljen</w>
<w lemma="bistven">bistveni</w>
<w lemma="del delo deti">del</w>
<w lemma="tradicionalen">tradicionalnih</w>
<w lemma="problem">problemov</w>
<w lemma="onesnaženost">onesnaženosti</w>
...

The MULTEXT-East MSDs


Morphosyntactic specifications , HR by Marko Tadic.

Noun (N)

11 Positions

**** **** **** **** **** ---- ---- ---- ---- ---- ----
PoS  Type Gend Numb Case Def  Cltc Anim OwnN OwnP OwdN
**** **** **** **** **** ---- ---- ---- ---- ---- ----

= ============== ============== =  EN  RO  SL  CS  BG  ET  HU  HR
P ATT            VAL            C  x   x   x   x   x   x   x   x
= ============== ============== = 
1 Type           common         c  x   x   x   x   x   x   x   x
                 proper         p  x   x   x   x   x   x   x   x
- -------------- -------------- - 
2 Gender         masculine      m  x   x   x   x   x           x
                 feminine       f  x   x   x   x   x           x
                 neuter         n  x   x   x   x   x           x
- -------------- -------------- -
3 Number         singular       s  x   x   x   x   x   x   x   x
                 plural         p  x   x   x   x   x   x   x   x
                 dual           d          x   x
          l.s.   count          t                  x
- -------------- -------------- -
4 Case           nominative     n          x   x   x   x   x   x
                 genitive       g          x   x       x   x   x
                 dative         d          x   x           x   x
                 accusative     a          x   x           x   x
                 vocative       v      x       x   x           x
                 locative       l          x   x               x
                 instrumental   i          x   x           x   x
           l.s.  direct         r      x
           l.s.  oblique        o      x
           l.s.  partitive      1                      x
                 illative       x                      x   x
...

% mtems-expand -brief Ncfsg
Ncfsg: Noun common feminine singular genitive

The Slovene Lexicon


avenij          avenija         Ncfdg
avenij          avenija         Ncfpg
avenija         =               Ncfsn
avenijah        avenija         Ncfdl
avenijah        avenija         Ncfpl
avenijam        avenija         Ncfpd
avenijama       avenija         Ncfdd
avenijama       avenija         Ncfdi
avenijami       avenija         Ncfpi
avenije         avenija         Ncfpa
avenije         avenija         Ncfpn
avenije         avenija         Ncfsg
aveniji         avenija         Ncfda
aveniji         avenija         Ncfdn
aveniji         avenija         Ncfsd
aveniji         avenija         Ncfsl
avenijo         avenija         Ncfsa
avenijo         avenija         Ncfsi

Cat entries wforms lemmas MSDs
Noun 127.811 61.525 7.465 99
Verb 110.949 78.001 3.699 128
Adjective 310.754 64.604 4.621 279
Pronoun 3.654 732 105 1.335
Adverb 7.415 7.395 442 3
Preposition 123 109 79 6
Conjunction 39 38 39 3
Numeral 4.401 832 181 226
Exclamation 10 10 10 1
Abbreviation 48 48 48 1
Particle 76 76 76 1
$\Sigma$ 565.281 201.011 16.766 2.083

Intex (DELAS/DELAF) YU lexicon


gdekakav,ProA03.01    gdekakav,.ProA03.01:msn*:msa- 
gdeko,ProN12          gdekakva,gdekakav.ProA03.01:msg*:nsg*:fsn*:npn*:npa*   
gdekoji,ProA07        gdekakve,gdekakav.ProA03.01:fsg*:mpa*:fpn*:fpa*        
gdetko,ProN12         gdekakvi,gdekakav.ProA03.01:mpn*                       
gdešto,AdvE,ProN13    gdekakvih,gdekakav.ProA03.01:mpg*:npg*:fpg*            
gdjekakav,ProA03.01   gdekakvim,gdekakav.ProA03.01:msi*:nsi*:*pd*:*pi*:*pl*  
...                   gdekakvima,gdekakav.ProA03.01:*pd*:*pi*:*pl*           
ičiji,ProA05          gdekakvime,gdekakav.ProA03.01:msi*:nsi*                
ja,ProN01             gdekakvo,gdekakav.ProA03.01:nsn*:nsa*:                 
kakav,ProA03.01       gdekakvog,gdekakav.ProA03.01:msg*:nsg*:msa+            
kakavgod,ProA03.01    gdekakvoga,gdekakav.ProA03.01:msg*:nsg*:msa+           
...                   gdekakvoj,gdekakav.ProA03.01:fsd*:fsl*                 
tvoj,ProA06           gdekakvom,gdekakav.ProA03.01:msd*:msl*:nsd*:nsl*:fsi*  
vaš,ProA04            gdekakvome,gdekakav.ProA03.01:msd*:msl*:nsd*:nsl*      
vi,ProN04             gdekakvomu,gdekakav.ProA03.01:msd*:msl*:nsd*:nsl*      
šta,Adv*,Par*,ProN13  gdekakvu,gdekakav.ProA03.01:msd*:msl*:nsd*:nsl*:fsa*   
štagod,ProN13

PoS 2 c |DELAS 2 c ||DELAF    
  KByte entries KByte entries
nouns 50K 2720 627K 18241
adjectives 11K 630 565K 10956
verbs 42K 1884 1693K 49076
other 21K 1378 133K 3849
total 124K 6569 3009K 81152

The Slovene pre-tagged corpus


The tagger


Qualities:

Evaluation on '1984':

TnT , 'Trigrams 'n Tags':

TnT parameters


%% Statistically tagged file, Sun Dec  5 16:32:44 1999
%% lexicon     : mte.lex
%% ngrams      : mte.123
%% corpus      : elan-sl.t
%% model       : trigrams
%% sparse data : linear interpolation
%%   lambda1 = 1.292668e-01   lambda2 = 3.310223e-01   lambda3 = 5.397110e-01
%% unknown mode: lexicon entry @UNKNOWN
%% case of characters is significant
%% using suffix trie up to length 10
%% unknown words are marked with an asterisk (*)
%% Thorsten Brants, thorsten@coli.uni-sb.de

%% 177776 (30.04%) unknown tokens
%%      7141 recognized as cardinals/ordinals
%% 102761 tokens taken from the backup lexicon
%% avg. 10.87 tags/token, 1.71 tags/known token

izhajajoč               Afpmsnn *
iz                      Spsg
Temeljne                Afpfsg  *
ustavne                 Afpfsg  *
listine                 Ncfsg   *
o                       Spsl
samostojnosti           Ncfsl   *

Conclusions


About this document ...

Tagging Slavic Corpora

This document was generated using the LaTeX2HTML translator Version 97.1 (release) (July 13th, 1997)

Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.

The command line arguments were:
latex2html -split 0 tue-slides.

The translation was initiated by Tomaz Erjavec on 12/19/1999


next up previous
Tomaz Erjavec
12/19/1999