Copernicus Project

Multext-East - Deliverable D1.2. Language-specific resources/Appendix 1 - May 96.

Appendix 1

Script for counting the forms and lemmas in the Estonian lexicon.

# 1) tab -> @
# 2) grep all the lines for that word class
#    (V in "@[^@]*@V" - verb, N in "@[^@]*@N" - noun etc)
#    ("@[^@]*@[VNAMPSCIR]" for all the lemmas)
# 3) delete the wordform and the morphological info; leave the lemma or '='
# 4) delete "=" and the morphol. info for wordforms equal to lemmas
# 5) sort
# 6) delete duplicate lines
# 7) count the lemmas


tr '\011' '@' < tbl.wordform.et | \
grep "@[^@]*@V" | \
sed 's/\(^[^@]*@\)\([^@=]*\)\(@.*$\)/\2/g' | \
sed 's/@=@.*$//g' | \
sort | \
uniq | \
wc -l

Notes:

1. The reason to use such a complicated way for counting lemmas is the way the lexicon has been built: not all the lemmas appear as wordforms in the lexicon. In fact, there are 20861 lemmas which do appear as wordforms in the lexicon.

2. The sum of lemmas is 47067, but "together" is 42803. The difference shows the amount of homonymous lemmas.