In our setting, examples are pairs of lemmas and word-forms that have the same morphosyntactic description. A particular morphosyntactic description can thus be considered a concept. E.g., the fact ncmsg([g,o,l,o,b],[g,o,l,o,b,a]) is a positive example for the concept Ncmsg (cf. Table 1), represented by the predicate ncmsg. Note that an orthographic representation is used.
For our experiments, three 'concepts' were selected from the lexicon of the Slovene language [1], namely the singular genitive forms of (proper and common) masculine, feminine and neuter nouns, i.e., N*msg (2893 cases), N*fsg (2755 cases) and N*nsg (1323 cases). These are represented by the predicates nxmsg, nxfsg and nxnsg.
A simple way to arrive from the base form to an oblique form is to
subtract a suffix from the base form and then append a suffix to the
resulting 'stem', thus getting rules of the form
-suff/+suff
, e.g., golob
-/+a = goloba. To cover the singular genitive data set we need 24
such rules for the most complicated case of the masculine, 10 for the
feminine, and 11 for neuter. To illustrate the morphological processes
involved we give for the N*fsg case in Table 2
the complete set of rules, together with their coverage.
Table 2: Rules for feminine singular genitive form
The two rules under a) and b) belong, respectively, to the nouns of the canonical first and second feminine declensions, e.g., miz-a/miz-e; perut-0/perut-i ('table'; 'wing'). Case c) belongs to a relatively common alternation affecting all feminine nouns of the first declension ending in -ev, e.g., kletv-)/kletv-e ('curse'). Cases d), f), h), and i) all exhibit a common phonological alternation in Slovene, whereby a schwa in the last syllable is deleted in word-forms with a non-null ending, e.g., ljubezn/ljubezn-i ('love'). The e) case is an idiosyncratic alternation affecting only two nouns of Slovene, namely hc-i/hc-ere; mat-i/mat-ere ('daughter'; 'mother'). Finally, g) and j) are, again, forms of idiosyncratic nouns: kr-0/kr-i; ravn-0/ravn-i ('blood'; 'plain').
As regards Slovene morphology, it should be noted that the
morphological stem, especially in its orthographic form, does not in
all cases contain enough information to correctly predict the correct
forms that it can give rise to. In a stem lexicon we would typically
need to accompany the stem with morphological information i.e.,
declension, morphosyntactic information, e.g., animacy, and phonological
information (e.g., that the -e- in the final syllable is indeed
a schwa and not a stressed e). However, in a lexicon such as
the one for the MULTEXT-East project, the problem is alleviated, to a certain
extent, as it contains not stems, but the base form of the words.
For example, the stem of 'table' is miz-, while the base form
is miza, and, as can be seen above, the nominative singular
ending -a serves to uniquely identify the word as belonging
to the first feminine declension and thus to correctly determine the
genitive ending. Still, the
-suff/+suff
type rules will in
general not have enough discriminatory power to apply only to the
correct base forms. The simplest extension to such rules is to take
into account more orthographic material from the base, and this is
where FOIDL induction comes into play.