Multext-East - Deliverable D1.2. Language-specific resources/Appendix 2 - May 96.




Appendix 2

Number of Hungarian Word Forms.

La'szlo' Tihanyi, 04-30-1996.





Introduction

Although working morphological analyzers do exist in Hungarian, the number of possible word forms still remains an estimation. The main reason is that Hungarian differs from other languages in that the number of word forms on the boarder between the acceptable and unacceptable is quite big as it will be exemplified below. The morphological systems solve the problem with overgenaration. In the following we try to give a more accurate estimation on possible word forms.

First we estimate the total number and than see an example and its concrete figures on one word.

1. Total number of word forms in Hungarian.

The calculation uses a number of words that can be found in the dictionary of the HUMOR morphological analyzer.

A verbs:                              9.400 (in dictionary)
B verbal prefix:                         50 (estimated average)
C prefixed verbs:                   147.000 (B*A)
D v to v derivations:                    14 (estimated average)
E v to n derivations:                    29 (estimated average)
F verbs derived from verbs:       2.205.000 (C+D*C)
G nouns derived from verbs:       4.263.000 (C*E)

H nouns:                              50.000 (in dictionary)
I adjectives:                         10.800 (in dictionary)
J total nominals                      60.800 (I+J not counted numerals)
K n to n derivations:                     91 (estimated average)
L n to v derivations:                     72 (estimated average)
M verbs derived form noms:         4.337.600 (L*J)
N nouns derived from noms:         5.532.800 (K*J)

O others (adverbs etc.):               2.300 (in dictionary)

P verbs total                      6.552.000 (A+F+M)
Q nominals total:                  9.845.800 (H+G+N)

W verbal inflections                      59 (in dictionary)
Z nominal inflections                    924 (in dictionary)

R verbal inflection combinations:386.568.000 (W*P)
S nominal infl. combinations:  9.096.780.000 (Z*Q)

T grand total:                 9.483.348.000 (R+S+O)
U grand total with clitics    18.966.396.000 (2*T)


That is about 20 billion word forms.

2. Estimation by an example

As we can see, B, D, E, K, L are averages, concrete numbers can be different. We take the verb 'ver' which means 'beat', 'hit', 'defeat', 'whack' or 'lap' as an example.

2.1. Prefixes

First we estimate the number of prefixed forms of the example verb. Here we can find 53 prefixed forms (out of the 81) that are meaningful. The main problem is that it is impossible to say which morphological construction is acceptable and which is ungrammatical, the boarder is fuzzy. It is very likely that different speakers would make different judgements on the good and bad forms in the following list.

Good forms are glossed with English explanation and bad ones are signed with the * .

agyonver              slay (by hitting)
ala'ver               hit_under
ala'bbver             hit_lower 
be-bever              hit_in (several times)
bele-belever          hit_in (several times)
belever               hit_in
bele'ver              hit_in
bever                 hit_in 
el-elver              whack (several times)
elver                 whack
fel-felver            awake (several times)
felver                awake
fe'lrever             ring_a_bell
fo:lver               awake
fo:le'ver             hit_above
hazaver               chase_home
helyrever             mend_by_hitting
hozza'ver             hit_against
ha'traver             hit_back
idever                hit_here
keresztbever          hit_across
keresztu:lver         beat_trough
kette'ver             hit_to_fall_apart
ki-kiver              whack (several times)
kiver                 whack (a child's bottom)
ko:rbever             hit_around
ko:ru:lver            hit_around
ko:zbever             hit_in_between
ko:ze'ver             hit_in_between
le-lever              beat (several times)
lever                 beat (an army)
meg-megver            defeat (several times)
megver                defeat (in sport)
melle'ver             hit_close_to (a spike)
mo:ge'ver             hit_behind
nekiver               hit_against
odaver                hit_against
rea'ver               hit_on
ra'ver                hit_on
sze'jjelver           defeat (an army)
sze'tver              defeat (an army)
tova'bbver            countinues_the_hitting
to:nkrever            beat (with big difference)
tu'lver               hit_more_than_needed
uta'naver             hit_again_to_fix
vissza-visszaver      repel (several times)
visszaver             repel (an)
ve'gigver             hit_along
o:sszever             beat (in fight)
a'tver                mislead
u'jraver              hit_again
telever               nail/spike_full (the surface/area)
teliver               nail/spike_full (the surface/area)

*abbaver, *alulver, *bennver, *egybever, *egyu:ttver, *ele'ver
*elo"rever, *elo"ver, *felu:lver, *fennver, *fe'lbever, *fo:lu:lver     
*fo:l-lever, *fo:nnver, *ki-bever, *ku:lo:nver, *ko:zrever, *rajtaver
*szembever, *szertever, *tovaver, *utolver, *viszontver, *ve'gbever
*ve'ghezver, *ve'grever, *a'ltalver, *u'jja'ver

2.2. Derivations

The number of verbal derivations also varies from word to word. The system again enables all combinations (89) but we found only 50 'good' ones for the verb 'ver'.

Here again it is not possible to rule out ill formed derivations since the boarder is fuzzy, and the decision on them can be good, possible, unlikely and bad.

Here we have 9 'verb to verb' derivations (marked with V in front of the English explanation) and 40 'verb to noun' (marked with N or A).

vereget                V hit (several times)
veregethet             V is alowed to hit several times
veregetheto"           A can be hit several times
veregetheto"bb         A can be beaten more than others several times
veregetheto"se'g       N the possibility of hitting several times
veregete's             N the action of hitting (several times)
veregete'si            A have connection with the hitting (several times)
veregeto"              N somebody who hits (several times)
verendo"               A somebody who should be beaten
verendo"bb             A somebody who should be beaten rather than somebody
else
legverendo"bb          A somebody who should be beaten most
veret                  V make somebody beaten
verethet               V allow somebody to make somebody else be beaten
veretheto"             A allowed to be beaten
veretheto"se'g         N the state of being allowed to be beaten
veretlen               A has not beaten so far
veretlenebb            A having less defeat than somebody else
legveretlenebb         A having the least defeat
verete's               N the action of beat (on a ...)
vereto"                A the man who made somebody else to beat others
verhet                 V allowed to beat
verhetetlen            A unbeatable
verhetetlenebb         A more unbeatable than others
legverhetetlenebb      A most unbeatable
verhetetlense'g        N the state of being unbeateble
verheto"               A can be beaten
verheto"bb             A can be betean more easily than others
verheto"se'g           N the state of being beatable
verheto"se'gi          A have connection with the state of being beatable
vernivalo'             A a something that is to beat
vert                   A somebody who is beaten 
vertebb                A somebody who is beaten more than others
vertse'g               N the state of being beaten
vere's                 N the action of beating 
vere'ses               A have beats
vere'si                A have connection with beat
vere'snyi              A measure of one beat
vere'su"               A have some kind of stamp (coin)
vero"                  A somebody who beats
vero"dik               V laps against something ()
vero"dget              V laps several times 
vero"dgethet           V allowed to lap against several times
vero"dgete's           N the action of lapping several times
vero"dgeto"            N something that laps several times
vero"dhet              V is allowed to lap several times
vero"de's              N the action of lapping against
vero"de'si             A is in connection with lapping
vero"do"               A lapping 
vero"do:tt             A has been lapped

*veregethete's, *veregethete'si, *veregetheto"i, *veregeto"i
*verendo"se'g, *verethete's, *verethete'si, *veretheto"i
*verete'si, *vereto"i, *verhetetlense'geskede's, *verhete'si, *verheto"i
*vere'sesed, *vere'seskedhetne'k, *vere'sesse'g, *vere'sesi't,
*vere'sesi'te's
*vere'sesi'to", *vero"dgethete's, *vero"dgethete'si, *vero"dgetheto", 
*vero"dgetheto"i
*vero"dgete'si, *vero"dgeto"i, *vero"dhete's, *vero"dhete'si, *vero"dheto"
*vero"do"i, *vero"leges, *vero"legesse'g, *vero"se'g, *vero"s
*vero"i, *vero"ibb, *veretlenedik, *verhete's, *vere'sesebb, *vero"bb

2.3 Inflections

2.3.1 Verbal inflections

The verbs may have 59 inflections in Hungarian.

3(Person)*2(Number)*4(Present indicative, Present imperativ, Present conditional, Past indicative)OS/2(Transitivity) +4(1s2s)+7(Infinitive)=59. This number is constant for every verb.

The list of inflected forms of the example verb 'ver':

verek,versz,ver,veru:nk,vertek,vernek
verem,vered,veri,verju:k,veritek,verik
vertem,verte'l,vert,vertu:nk,vertetek,vertek
vertem,verted,verte,vertu:k,verte'tek,verte'k
verne'k,verne'l,verne,verne'nk,verne'tek,verne'nek
verne'm,verne'd,verne',verne'nk,verne'tek,verne'k
verjek,verje'l,verjen,verju:nk,verjetek,verjenek
verjem,verjed,verje,verju:k,verje'tek,verje'k
verlek,vertelek,verne'lek,verjelek
verni,vernem,verned,vernie,vernu:nk,vernetek,verniu:k

There are 6 further adverbs that are derived from verbs, but here we do not count them since they are infrequent.

vertedben, vertemben, vertetekben, vertu:kben, vertu:nkben, verte'ben

2.3.2. Nominal inflections

The nouns in Hungarian may have 924 inflections.

2(Number)*7(OwnerPerson,OwnerNumber)*3(OwnedNumber)*22(Case)=924

This list already contains elements that one can find strange at least but all of them are grammatical.

See one example on the derived form 'vere's' (N beat) in Appendix 3.

2.4. Clitics

In Hungarian we have only one clitic, the '-e' question word which may follow any Hungarian word. So the final number should be multiplied by two.

2.5. Total number for the example word

Now we can calculate the actual numbers for the verb 'ver':

For this we assume that all prefixed forms may have all derivations.

C prefixed forms:                      53   see 2.1.
D v to v derivations:                   9   see 2.2.
E v to n derivations:                  40   see 2.2.
F verbs derived from verbs:           540 (C+1)*(D+1)
G nouns derived from verbs:          2160 (C+1)*(E+1)

W verbal inflections                   59 see 2.3.1
Z nominal inflections                 924 see 2.3.2

R verbal inflection combinations:  31.860 (W*F)
S nominal infl. combinations:   1.995.840 (Z*G)

T Total1                        2.027.700 (R+S)
U Total forms with clitics     4.055.400 (2*T)

So we have more than 4 million forms for a single verb. This is why we cannot supply any kind of list type dictionary. The dictionary containing only one word is bigger than what can be printed or handled by computer programs expecting a word form list of a language (If we have a clitic preprocessor like in MULTEXT project than the number is only 2 million forms per verbs).

2.6. Compoundation

The actual number is much bigger than the above calculated 2 million because we have compoundation. But it is out of our sight even for estimations.

These are the forms from the Explanatory Dictionary of Hungarian:

csapravere's,csipkevere's,csordakivere's,dio'vere's,hulla'mvere's
hi'dvere's,hi'rvere's,istenvere's,je'gvere's,ko:te'lvere's,ka'rtyakevere's
pe'nzvere's,szi'vvere's,sa'torvere's,e'rvere's,a'rvere's

but compoundation being fairly productive, an indefinite number of new forms can be idiosyncratically generated.


Netscape-HTML Checked! | Top | Next | Table of contents | Multext-East | LPL/CNRS

Copyright © Centre National de la Recherche Scientifique, 1996.