# 1) tab -> @ # 2) grep all the lines for that word class # (V in "@[^@]*@V" - verb, N in "@[^@]*@N" - noun etc) # ("@[^@]*@[VNAMPSCIR]" for all the lemmas) # 3) delete the wordform and the morphological info; leave the lemma or '=' # 4) delete "=" and the morphol. info for wordforms equal to lemmas # 5) sort # 6) delete duplicate lines # 7) count the lemmasNotes:
tr '\011' '@' < tbl.wordform.et | \ grep "@[^@]*@V" | \ sed 's/\(^[^@]*@\)\([^@=]*\)\(@.*$\)/\2/g' | \ sed 's/@=@.*$//g' | \ sort | \ uniq | \ wc -l
1. The reason to use such a complicated way for counting lemmas is the way the lexicon has been built: not all the lemmas appear as wordforms in the lexicon. In fact, there are 20861 lemmas which do appear as wordforms in the lexicon.
2. The sum of lemmas is 47067, but "together" is 42803. The difference shows the amount of homonymous lemmas.