Dan Tufis & al * Corpora and Corpus-Based Morpho-Lexical Processing


Eliminating neuter gender from the lexicon encoding needs a few explanations. Traditionally, grammar books distinguish in Romanian three genders: masculine, feminine and neuter. However there are few reasons - if any - not to get rid of neuter value and consider a simpler dual gender system. From the inflectional point of view, neuter nouns/adjectives behave in singular as masculine nouns/adjectives and in plural as feminine ones. Since there is no intrinsic semantic feature specific to neuter nouns (inanimacy is by no means specific to neuter nouns, existing plenty of feminine and masculine nouns denoting inanimate things) preserving the distinction masculine/feminine/neuter creates more problems than it solves [13]. Due to the agreement rules, adjectives can take masculine, feminine and neuter gender. At the lexicon level, this would require to add about 33% more entries in the case of adjectives. At the lookup level, considering only gender, any adjective would be two way ambiguous (masculine/neuter in singular and feminine/neuter in plural). However, it is worth mentioning that if needed, the neuter nouns or adjectives can be easily identified: those nouns/adjectives that are tagged with masculine gender in singular and with feminine gender in plural are what the traditional Romanian linguistics calls neuter nouns/adjectives.

This position has recently found adherents among theoretical linguists as well. For instance, in [14] neuter nouns are considered to be underspecified for gender in their lexical entries, having default rules assigning masculine gender for occurrences in singular and feminine gender for occurrences in plural.

The tables on the next pages present the MSD encoding for Romanian in the following format:

Category (category code)

Attribute
Position
Attribute Value Example

The category code is an uppercase letter (N, V, A, P, D, T, R, S, C, M, Q, Y, I, X) identifying one of the 14 parts of speech considered in the MULTEXT-EAST encoding schema. Any MSD code will begin with one of these 14 letters. The Attribute Position column specifies the position in the linear MSD encoding of the attribute, the name of which is given in the Attribute column. The Value column contains the allowable values of the current attribute. Their codes, given between parentheses, may appear in an MSD headed by the appropriate category code, at the position specified by the Attribute Position column. This linear encoding is a relatively efficient and compact way to represent the flat attribute-value matrices.

1. Noun (N)

Attribute
Position
Attribute ValueExample
1 Type common (c)carte
proper (p)Ion
2 Gender masculine (m)bãiatul
feminine (f)casa
3 Number singular (s)fatã
plural (p)fete
4 Case direct (r)omul
oblique (o)omului
vocative (v)omule
5 Definiteness yes (y)omul
no (n)om
6 Clitic no (n)soran
yes (y)soruy-mea



32

Previous Next