In our setting, examples are pairs of lemmas and word-forms that have the same morphosyntactic description. A particular morphosyntactic description can thus be considered a concept. E.g., the fact ncmsg([g,o,l,o,b],[g,o,l,o,b,a]) is a positive example for the concept Ncmsg (cf. Table 1), represented by the predicate ncmsg. Note that an orthographic representation is used.
For our experiments, three 'concepts' were selected from the lexicon of the Slovene language [1], namely the singular genitive forms of (proper and common) masculine, feminine and neuter nouns, i.e., N*msg (2893 cases), N*fsg (2755 cases) and N*nsg (1323 cases). These are represented by the predicates nxmsg, nxfsg and nxnsg.
A simple way to arrive from the base form to an oblique form is to subtract a suffix from the base form and then append a suffix to the resulting 'stem', thus getting rules of the form -suff/+suff, e.g., golob -/+a = goloba. To cover the singular genitive data set we need 24 such rules for the most complicated case of the masculine, 10 for the feminine, and 11 for neuter. To illustrate the morphological processes involved we give for the N*fsg case in Table 2 the complete set of rules, together with their coverage.
Table 2: Rules for feminine singular genitive form
The two rules under a) and b) belong, respectively, to the nouns of the canonical first and second feminine declensions, e.g., miz-a/miz-e; perut-0/perut-i ('table'; 'wing'). Case c) belongs to a relatively common alternation affecting all feminine nouns of the first declension ending in -ev, e.g., kletv-)/kletv-e ('curse'). Cases d), f), h), and i) all exhibit a common phonological alternation in Slovene, whereby a schwa in the last syllable is deleted in word-forms with a non-null ending, e.g., ljubezn/ljubezn-i ('love'). The e) case is an idiosyncratic alternation affecting only two nouns of Slovene, namely hc-i/hc-ere; mat-i/mat-ere ('daughter'; 'mother'). Finally, g) and j) are, again, forms of idiosyncratic nouns: kr-0/kr-i; ravn-0/ravn-i ('blood'; 'plain').
As regards Slovene morphology, it should be noted that the morphological stem, especially in its orthographic form, does not in all cases contain enough information to correctly predict the correct forms that it can give rise to. In a stem lexicon we would typically need to accompany the stem with morphological information i.e., declension, morphosyntactic information, e.g., animacy, and phonological information (e.g., that the -e- in the final syllable is indeed a schwa and not a stressed e). However, in a lexicon such as the one for the MULTEXT-East project, the problem is alleviated, to a certain extent, as it contains not stems, but the base form of the words. For example, the stem of 'table' is miz-, while the base form is miza, and, as can be seen above, the nominative singular ending -a serves to uniquely identify the word as belonging to the first feminine declension and thus to correctly determine the genitive ending. Still, the -suff/+suff type rules will in general not have enough discriminatory power to apply only to the correct base forms. The simplest extension to such rules is to take into account more orthographic material from the base, and this is where FOIDL induction comes into play.