\documentclass[a4paper,11pt]{report} \setlength{\parskip}{1ex} \setlength{\parindent}{0in} \setlength{\textwidth}{6.3in} \setlength{\textheight}{8.8in} \setlength{\topmargin}{-0.25in} \setlength{\headheight}{.25in} \setlength{\headsep}{.25in} \setlength{\oddsidemargin}{0in} \setlength{\evensidemargin}{0in} %\pagestyle{multipage} \pagestyle{myheadings} %\pageheader{{\em COP Project 106 \mte}}{M D1.1/2}{{\em Lexicon specifications and notation}} \newcommand{\mte}{{\sc multext}-East} \newcommand{\multext}{{\sc multext}} \newcommand{\eagles}{{\sc eagles}} \begin{document} \thispagestyle{empty} \begin{center} \vspace{1.5in} {\Huge\bf Specifications and Notation for Lexicon Encoding} \vspace{0.5in} {\Large\bf COP Project 106 MULTEXT-East Work Package WP1 -- Task 1.1 Deliverable D1.1, Version 3.0 Final Version } \vspace{0.5in} {\Large\bf Intermediate Report 21 August 1997 %\today %Work in Progress } \vspace{1in} {\large Workpackage Coordinators:\\ CNRS - Nancy Ide and Jean Veronis Task Leader:\\ PISA - Monica Monachini LJUBLJANA - Toma\v{z} Erjavec\\ {\sf tomaz.erjavec@ijs.si}\\ \vspace{0.5in} Contributors:\\ Bulgarian: R.Pavlov, L.Dimitrova, L.Sinapova and K.Simov \\ Czech: V.Petkevi\v c\\ Estonian: H.J.Kaalep\\ Hungarian: L.Tihanyi\\ Romanian: D.Tufi\c{s}, A.M.Barbu\\ Slovene: T.Erjavec, P.Holozan } \vspace{0.5in} \end{center} \renewcommand{\thepage}{\roman{page}} \markright{COP project 106 \mte {\hfill}Deliverable D1.1 Lexicon specifications and notation{\hfill}} \tableofcontents \chapter{Introduction} \renewcommand{\thepage}{\arabic{page}} \setcounter{page}{1} \markright{COP project 106 \mte {\hfill}Deliverable D1.1 M --- Introduction{\hfill}} The present document constitutes the Milestone M version of the Deliverable D1.1 carried out within the framework of the \mte\ project. It is a significantly revised version of the Deliverable D1.1 for the Intermediate Milestone (IM1). Its purpose is to (i) provide harmonized lexical specifications for the six languages involved in the project --- Bulgarian, Czech, Estonian, Hungarian, Romanian, Slovene --- and (ii) formulate the relevant notation to be used in the lexicons contributed by each language group as resources for the tools, which will perform the automatic corpus tagging. The two proposals of lexicon specifications presented in the \multext\ D1-6-1B Deliverable of the \multext\ Project (Bel, Calzolari and Monachini eds. 1995) and in the \eagles\ document of the Lexicon sub-group on Morphosyntactic annotation (Monachini and Calzolari 1995) --- which is the basis of the previous one --- constitute together the starting point and the model of the work and results presented here. The partners have evaluated the two proposals from the point of view of the coverage with respect to their languages, have added the specifications needed to encode the peculiarities of their languages and have produced concrete applications of the proposed specifications. The work has been done through a cyclical process of adjustments and re-application, giving rise to continuous exchanges between the task coordinators and the partners. This cycle has led to the formulation of the common proposal for lexicon specifications of the Central \& Eastern European languages contained in the present deliverable (Chapter \ref{chp:common}). On the basis of the \eagles\ and \multext\ models, they are presented as sets of attribute-values --- displayed in tabular format --- and the notation proposed follows the ``string of characters in fixed positions'' strategy. The applications of the proposal to the six languages are given in Chapter \ref{chp:LSA}.They have been contributed respectively by: \begin{description} \item[Bulgarian:] R.Pavlov, L.Dimitrova, L.Sinapova and K.Simov; \item[Czech:] V.Petkevi\v{c}; \item[Estonian:] H.J.Kaalep; \item[Hungarian:] L.Tihanyi; \item[Romanian:] D.Tufi\c{s} and A.M.Barbu; \item[Slovene:] T.Erjavec and P.Holozan. \end{description} The language application parts present the lexicon specifications category by category and are structured as follows: \begin{enumerate} \item[(a)] section which describes the features and values pertinent to a given category in the form of tables with exemplications from the language in analysis; \item[(b)] a second section which provides the combinations of values: the way in which different values combine together, giving rise to all the possible lexicon specifications for the items belonging to that category, is displayed. \end{enumerate} Pisa, October 1995\\ Ljubljana, August 1997 \newpage %\section{Introduction} %The aim of the present document is to formulate harmonized %specifications and a common notation for the lexicons of the six %languages involved in the \mte\ Project, i.e.\ Bulgarian, Czech, %Estonian, Hungarian, Romanian, Slovene. These lexicons have been %contributed in the form of word-form lists associated to their %morhosyntactic information by the six partners. %The work carried out within this Task 1.1 ``Lexicon specification and %resource building" can, hence, be seen as parallel to that developed %within the \multext\ Task 1.6 (\multext\ 1993), in which harmonized %lexicon specifications with the respective notation for the six %project languages have been formulated and used to encode the lexical %lists contributed within the Task 5.4. %This will allow us to obtain lexical material with harmonized information %encoded in an agreed way. These resources will run under the same tools %and make it possible to perform comparisons across languages. %One of the most important goals of \mte\ is to make its %standards, software and data freely available for research in both %academia and industry. The consortium adopts a policy of maximum access %to all project resources. A primary goal is to provide the greatest %possible opportunity for feedback from the potential community of users. %Therefore, the project will provide free access to early versions of %software and data for use within the NLP and MT community. %In a larger perspective, the \mte\ work fits into the framework %of the standardization efforts in the academic and industrial %environment, provided by \eagles\ and \multext. The Eastern work, %indeed, serves as a refinement and validation of \multext\ itself. With %respect to the \eagles\ initiative, it is an example of the large impact %and dissemination of the proposal within the scientific community and %constitutes a further contribution to the testing and application of %the specifications, obviously increasing their coverage from a %multilingual point of view. \section{Background work} The \multext\ D1-6-1B Deliverable (Bel, Calzolari and Monachini eds. 1995) and the \eagles\ proposal of specifications to be encoded in lexica (Monachini and Calzolari 1995) have constituted the basis of the results presented in what follows. The two documents together collect the outcomes of the main efforts in Europe towards standardization in large linguistic resources and, among other fields, in linguistic annotation. The relationships and the interdependencies between the two actions are reported in the introduction of the \eagles\ document, where the \multext\ contribution to the \eagles\ effort is described in detail and commented also in the introduction of the \multext\ Deliverable. Summing up, the two actions have proceeded with the same bottom-up approach --- that is also followed here ---, producing similar sets of specifications: the \eagles\ one, aiming at a general description of the languages, is intended to offer a more general proposal and cover a vast range of purposes, while the \multext\ set has to be considered in the context of the project itself, whose aim was to produce resources to be used within the specific application of automatic corpus tagging. The \multext\ set of lexicon specifications is, hence, task-oriented, being designed with a special look to the final purpose. \section{\mte\ work and approach} The \mte\ partners have analysed and evaluated the set of \multext\ specifications with respect to their languages; when the scheme has not been sufficient, then, the \eagles\ inventory has been consulted. If necessary, new distinctions for language-specific information have been proposed. The work proceeded in a cyclical process: firstly, the partners have circulated the results of their evaluation work in applications that have been, in turn, evaluated by Pisa, the coordinator. These applications have constituted the basis for the proposal of a set of Central \& Eastern specifications, which has been designed with a bottom-up procedure. The nucleus of common features, already isolated within \multext, proved to be readily applicable to the Central \& Eastern languages; the formulation of the distinctions needed for encoding the morphosyntactic information peculiar to them has required much harmonization work. Some of the features proposed by the partners in their first applications have been dropped, as they pertained to a not-purely morphosyntactic level (i.e.\ Transitivity); other distinctions have been kept out from the set since considered too fine. The emerged set has been again circulated among the partners for new cycles of revisions and re-applications, until the specifications have been considered acceptable for all the language groups and stable enough for the first version (IM1) of this deliverable. However, there were still significant problems with this first version of the deliverable, in particular: \begin{itemize} \item the three Slavic languages sometimes described the same phenomena in different ways; \item language independent aspects of the specifications (e.g. the 'form' of numerals: digit, roman, etc.) were treated differently for different languages; \item different attributes or values were used to describe the same phenomena with different categories / languages (e.g. \texttt{1st} and \texttt{1} for first person); \item the ordering of attributes in some categories was sub-optimal, necessitating long morphosyntactic descriptions, i.e. descriptions with long strings of '\texttt{-}'; \item the attributes and values used different 'punctuation' (e.g.\ \texttt{full art}, \texttt{Modific.Type}, \texttt{Pron-Form}, \texttt{SubType}; \item the common tables were formated differently for different categories. \end{itemize} Therefore it was decided to produce, for milestone M, version 2 of the deliverable; this effort was led by the Ljubljana site. This harmonisation led to a more motivated and --- on the average --- more compact morpho-syntactic descriptions for the \mte\ languages, while the formalisation of the tables and descriptions had an added benefit. Namely, a simple Perl program (\texttt{mtems-expand}) was written, which could, working directly with common tables of the morpho-syntactic descriptions of this report, either expand or validate lexical morphosyntactic descriptions. This program was used to validate the word-form lexica of the project, thus ensuring that all the morpho-syntactic descriptions in the lexica of the particular languages are well-formed. Another Perl program, \texttt{mtems-split} was also written, which, again working on the common tables, produces language specific tables. These tables were circulated to the partners, thus ensuring that the language specific section do in fact reflect the common tables. \subsection{Criteria for the inclusion/exclusion of new distinctions} As regards the types of information to be included in the lexicons and the ``fine-grainedness'' of distinctions, the \multext\ approach can simply be described in the following way: \begin{itemize} \item in principle, all the distinctions that seemed to complicate the process of disambiguation by means of an automatic tagger have been discarded; \item a specification for the encoding of a distinction has been included if it affects the word-form level, i.e.\ if a different graphical word-form is recognized by that feature. \end{itemize} The motivation for this is to provide the information necessary for our final purpose, i.e.\ automatic corpus tagging. \section{Description of the proposal} The methodology adopted to design the set of Central \& Eastern specifications has been, as already mentioned above, that experimented in \multext. The proposal has been prepared in the usual \multext\ table format, which displays the specifications (as sets of attribute-values, see below for further details about the notation), with the respective code, used to mark them in the lexicons: \begin{itemize} \item[(i)] the {\em minimal core\/} features, i.e.\ those shared by most of the languages, have been highlighted in the tables. We tried to keep this set in common to {\em all\/} the \multext\ and \mte\ languages. In such a way, the comparability across the information encoded in the lexical lists of Central \& Eastern and of the six original \multext\ languages is ensured to a certain extent. The features of the common core are highlighted with asterisks (*) to distinguish them from the \item (ii) purely {\em language-specific values}. The formulation of this set has been, as already mentioned, highly delicate, due also to the fact that, many language-specific values were presented in the applications and sometimes, the same morphosyntactic phenomenon was referred to with two different attribute or value names. The phase of recognition and harmonization of semantics of some attributes, values and naming conventions has, hence, required much effort. \end{itemize} If a feature presents value(s) that are used by only one \mte\ language (i.e.\ if a value is {\em language-specific}) then these have been marked with {\tt l.s.}. This marking is used only when a subset of the feature values is language specific --- in case a whole attribute, along with all its features is language specific, then the mark {\tt l.s.} is not used. Experience shows that the adopted representation described above, with the concrete applications which display and exemplify the attributes and values, also providing their internal constraints and relationships, makes the proposal self-explanatory. Other groups can easily test the specifications on their language, simply by following the method of the applications. The possibility of incorporating idiosyncratic classes and distinctions after the common core features, makes the proposal adaptable and flexible enough, without ruining the compatibility. \section{Lexical lists} The proposed specifications with the respective codes will be used to encode the word-form lists constituting the resources to run under the tool which will perform the automatic tagging of the corpus. Note that the lexicons supplied by each \mte\ partners will have the following form: \begin{center} Word-form, lemma, morphosyntactic information, TAG \end{center} The TAG part will be provided in a second phase of the Project. Our experience with the six original \multext\ languages demonstrated that it is not possible to specify identical tagsets across languages, even those within the same language family. The need for idiosyncratic tagsets for each language has been confirmed also within \textsc{parole-mlap}. The comparability and the harmonization of the linguistic properties represented in different tagsets can be obtained only by defining them according to the specifications contained in the lexicons, i.e.\ relating the tagset to the lexicon. In such a way, these specifications, agreed and harmonized across languages, make different physical tagsets compatible and mappable one onto each other. In other words, lexical specifications are used as a common platform across languages, a sort of ``interface'', which permits different tagsets ``to speak''. This is the philosophy which guided, within \eagles, the tagset mapping exercise (Teufel 1995), where two different tagsets are mapped via the lexicon: lexicon specifications are modelled in typed hierarchy, the semantics of each physical tag of the two tagsets is defined according to the specifications themselves by means of Prolog rules and the mapping is automatically performed by a powerful tool (the same is being done for the mapping of two different Italian tagsets). \section{Notation} In \multext, the notation has been chosen following current practices for NLP, where information is represented in attribute-value formalisms and following the idea that it should also be self-informative for human understanding. Considerations concerning the desirability that these descriptions are able to provide information about language-specific characteristics, have also been taken into account. To sum up, the notation format suggested has the following main characteristics: \begin{itemize} \item attributes are marked by positions; \item values are represented by a single character; \item a special marker reflects the non applicability of a given attribute. \end{itemize} These characteristics make the proposed lexical notation synonymous with attribute-value pairs used in current unification formalisms (see the D1-6-1B Deliverable for further details). The linear strings of characters representing the morphosyntactic information to be associated with word-forms are constructed following the philosophy of the Intermediate Format proposed in the \eagles\ Corpus proposal (Leech and Wilson, 1994), i.e.\ of having agreed symbols in predefined and fixed positions: the positions of a string of characters are numbered 0, 1, 2, etc. in the following way: \begin{itemize} \item[a.] the agreed character at position 0 encodes part-of-speech; \item[b.] each character at position 1, 2, n, encodes the value of one attribute (person, gender, number, etc.); \item[c.] if an attribute does not apply, the corresponding position in the string contains a special marker, in our case '\texttt{-}' (hyphen). \end{itemize} \begin{verbatim} Example: Ncms- (noun, common, masculine, singular, nocase) \end{verbatim} This notation adopts the \eagles\ Intermediate Format with a small revision: the Intermediate Format encodes information by means of digits, while in \multext\ characters of a mnemonic nature are preferred. It is worth noting here that this representation is proposed for word-form lists which will be used for a specific application, i.e.\ corpus annotation. \subsection{The use of '\texttt{-}' ('not-applicable')} We call this marker 'not-applicable' and, as stated above, its function is just to keep the relationship established between attributes and values. It might be used for the following cases: \begin{itemize} \item[(a)] {\em not applicable given a particular combination of attributes-values}, i.e.\ although the attribute applies to the category in a given language, it does not apply to a particular subclass of the category (e.g., Person applies to Pronouns, but not to the Type {\tt demonstrative}). \item[(b)] {\em not applicable to a particular lexical item, although the attribute applies to the rest of its paradigm} (e.g Gender in the paradigm of Personal Pronouns applies only to the 3rd person, {\sl I, you} vs. {\sl she, he}). \item[(c)] {\em not relevant} to a given language, (e.g.\ Gender to Estonian). This appears as the simplest case to be encoded with a different character and, for the second phase of the Project, we suggest to find a strategy to distinguish this case of not-applicativity from those at points (a) and (b). \end{itemize} \section{Organization of the language-specific sections} Following the procedure of the \multext\ Deliverable, the language-specific sections consist of two distinct parts: \begin{itemize} \item a descriptive section, where the features and values relevant to the lexicon of the language are displayed in tabular form, with examples taken from the language; \item the combination section, where the way in which the different values combine together is shown. \end{itemize} Two different strategies of displaying combinations have been followed: \begin{itemize} \item[(i)] all the {\em admitted\/} combinations are provided together with an example. This has the disadvantage of producing big lists, given the high number of features and values to combine together, but the advantage of providing the only legal combinations with the relevant constraints in the application of some features/values in presence of other features or values or combination of them (see e.g Gender in Personal Pronouns). \item[(ii)] a mathematical expression describing the combinations is provided, which can subsequently used to generate {\em all\/} combinations, including those which are not valid. \end{itemize} \chapter{Morphosyntactic Specifications for Central \& Eastern Lexicons} \label{chp:common} \markright{COP project 106 \mte {\hfill}Deliverable D1.1 M --- Common Tables{\hfill}} The categories listed below with the relevant attributes and values are based on the \multext\ and \eagles\ documents. The specifications constitute a ``harmonized'' set of features which properly describe lexical items of the different languages. In keeping with the general aim of the project, these harmonized specifications --- and the related resources --- will contribute to the standardization of the corpus annotation work. This common set of features, besides the advantages described in the Introduction, will also be a common ground to perform comparisons of different annotation tool results, because the existence of many lexical description systems is causing nowadays a problem for comparing results. Therefore the categories and features listed below are the common reference for the work done by the different groups. \newpage \section{Table of categories} \begin{verbatim} =============== ==== Part-of-Speech Code =============== ==== Noun N Verb V Adjective A Pronoun P Determiner D Article T Adverb R Adposition S Conjunction C Numeral M Interjection I Residual X Abbreviation Y Particle Q =============== ==== \end{verbatim} The attributes and the values pertinent to these categories are presented below. At the beginning of each category, the string summarizing its pertinent attributes together with the respective positions, to be used for encoding the belonging items in the lexicons, is displayed. The features constituting the minimal core, in common with \multext\ (and \eagles) are graphically contained between strings of stars (*). This will serve to the partners during the concrete work of preparation of the lexical lists and will ensure that all use the same string and put the values in the correct position. Again category by category, on the left side of the table, further useful information is given about ``who uses what?'': the languages are listed at the top left row of the category and an 'x' is put in the row corresponding to a value, if that language marks that value. This expedient permits us to have a clear picture of the level of consensus reached with respect to harmonization when elaborating lexical lists. After the section devoted to the tables, another section presents all the used values listed in alphabetical order of their name and provides the respective code. This will also ensure consistency in the use of values across languages, in order not to compromise the comparison. A further section contains the list of the attributes, providing the information of the category/categories in which they are used, and for some of them, a very synthetic definition of their semantics. \newpage \section{Tables of attribute-values} \subsection{Noun (N)} {\small \begin{verbatim} 11 Positions **** **** **** **** **** ---- ---- ---- ---- ---- ---- PoS Type Gend Numb Case Def Cltc Anim OwnN OwnP OwdN **** **** **** **** **** ---- ---- ---- ---- ---- ---- = ============== ============== = RO SL CS BG ET HU P ATT VAL C x x x x x x = ============== ============== = 1 Type common c x x x x x x proper p x x x x x x - -------------- -------------- - 2 Gender masculine m x x x x feminine f x x x x neuter n x x x x - -------------- -------------- - 3 Number singular s x x x x x x plural p x x x x x x dual d x x l.s. count t x - -------------- -------------- - 4 Case nominative n x x x x x genitive g x x x x dative d x x x accusative a x x x vocative v x x x locative l x x instrumental i x x x l.s. direct r x l.s. oblique o x l.s. partitive 1 x illative x x x inessive 2 x x elative e x x allative t x x adessive 3 x x ablative b x x l.s. translative 4 x terminative 9 x x essive w x x l.s. abessive 5 x l.s. komitative k x l.s. aditive 7 x l.s. temporalis m x l.s. causalis c x l.s. sublative s x l.s. delative h x l.s. sociative q x l.s. factive y x l.s. superessive p x l.s. distributive u x * ***************************** * 5 Definiteness no n x x yes y x x l.s. short_art s x l.s. full_art f x - -------------- -------------- - 6 Clitic no n x yes y x - -------------- -------------- - 7 Animate no n x x yes y x x - -------------- -------------- - 8 Owner_Number singular s x plural p x - -------------- -------------- - 9 Owner_Person first 1 x second 2 x third 3 x ---------------- -------------- - 10Owned_Number singular s x plural p x ================================= \end{verbatim} } \newpage \subsection{Verb (V)} {\small \begin{verbatim} 14 Positions **** **** **** **** **** **** **** ---- ---- ---- ---- ---- ---- ---- PoS Type VFrm Tens Pers Numb Gend Voic Neg Def Cltc Case Anim Clt2 **** **** **** **** **** **** **** ---- ---- ---- ---- ---- ---- ---- = ============== ============== = RO SL CS BG ET HU P ATT VAL C x x x x x x = ============== ============== = 1 Type main m x x x x x x auxiliary a x x x x x x modal o x x x x copula c x x x - -------------- -------------- - 2 VForm indicative i x x x x x x subjunctive s x imperative m x x x x x x conditional c x x x x infinitive n x x x x x participle p x x x x x gerund g x x x supine u x x l.s. transgressive t x l.s. quotative q x - -------------- -------------- - 3 Tense present p x x x x x x imperfect i x x x future f x x past s x x x x x x l.s. pluperfect l x l.s. aorist a x - -------------- -------------- - 4 Person first 1 x x x x x x second 2 x x x x x x third 3 x x x x x x - -------------- -------------- - 5 Number singular s x x x x x x plural p x x x x x x l.s. dual d x - -------------- -------------- - 6 Gender masculine m x x x x feminine f x x x x neuter n x x x x ********************************* 7 Voice active a x x x x passive p x x x x - -------------- -------------- - 8 Negative no n x x x yes y x x x - -------------- -------------- - 9 Definiteness no n x x yes y x x short_art s x full_art f x l.s. 1s2s 2 x - -------------- -------------- - 10Clitic no n x yes y x - -------------- -------------- - 11Case nominative n genitive g dative d accusative a locative l instrumental i illative x x inessive 2 x elative e x translative 4 x abessive 5 x - -------------- -------------- - 12Animate no n x yes y x - -------------- -------------- - 13Clitic_s no n x yes y x ================================= \end{verbatim} } \newpage \subsection{Adjective (A)} {\small \begin{verbatim} 13 Positions **** **** **** **** **** **** ---- ---- ---- ---- ---- ---- ---- PoS Type Degr Gend Numb Case Def Cltc Anim Form OwnN OwnP OwdN **** **** **** **** **** **** ---- ---- ---- ---- ---- ---- ---- = ============== ============== = RO SL CS BG ET HU P ATT VAL C x x x x x x = ============== ============== = 1 Type qualificative f x x x x indefinite i possessive s x x l.s. ordinal o x - -------------- -------------- - 2 Degree positive p x x x x x comparative c x x x x x superlative s x x x x x - -------------- -------------- - 3 Gender masculine m x x x x feminine f x x x x neuter n x x x x - -------------- -------------- - 4 Number singular s x x x x x x plural p x x x x x x dual d x x - -------------- -------------- - 5 Case nominative n x x x x genitive g x x x x dative d x x x accusative a x x x vocative v x x locative l x x instrumental i x x x l.s. direct r x l.s. oblique o x l.s. partitive 1 x illative x x x inessive 2 x x elative e x x allative t x x adessive 3 x x ablative b x x l.s. translative 4 x terminative 9 x x essive w x x l.s. abessive 5 x l.s. komitative k x l.s. aditive 7 x l.s. temporalis m x l.s. causalis c x l.s. sublative s x l.s. delative h x l.s. sociative q x l.s. factive y x l.s. superessive p x l.s. distributive u x l.s. essive_formal f x * ***************************** * 6 Definiteness no n x x yes y x x l.s. short_art s x l.s. full_art f x - -------------- -------------- - 7 Clitic no n x yes y x - -------------- -------------- - 8 Animate no n x x yes y x x - -------------- -------------- - 9 Formation nominal n x compound c x - -------------- -------------- - 10Owner_Number singular s x plural p x - -------------- -------------- - 11Owner_Person first 1 x second 2 x third 3 x ---------------- -------------- - 12Owned_Number singular s x plural p x ================================= \end{verbatim} } \newpage \subsection{Pronoun (P)} {\small \begin{verbatim} 17 Positions **** **** **** **** **** **** **** **** ---- ---- ---- ---- ---- ---- ... PoS Type Pers Gend Numb Case OwnN OwnG Cltc RefT SynT Def Anim Clt2 **** **** **** **** **** **** **** **** ---- ---- ---- ---- ---- ---- ... ... ---- ---- ---- PrFr OwnP OwdN ... ---- ---- ---- = ============== ============== = RO SL CS BG ET HU P ATT VAL C x x x x x x = ============== ============== = 1 Type personal p x x x x x x demonstrative d x x x x x x indefinite i x x x x x x possessive s x x x x x x interrogative q x x x x x relative r x x x x x l.s. exclamative e reflexive x x x x x x x reciprocal y x x l.s. int_rel w x negative z x x x x l.s emphatic h x l.s. determinal m x general g x x x - -------------- -------------- - 2 Person first 1 x x x x x x second 2 x x x x x x third 3 x x x x x x - -------------- -------------- - 3 Gender masculine m x x x x feminine f x x x x neuter n x x x x - -------------- -------------- - 4 Number singular s x x x x x x plural p x x x x x x dual d x x - -------------- -------------- - 5 Case nominative * n x1 x x x x x genitive * g x2 x x x x dative * d x2 x x x x accusative * a x1 x x x x vocative v x locative l x x instrumental i x x x l.s. direct * r xR1 l.s. oblique * o xR2 l.s. partitive 1 x illative x x x inessive 2 x x elative e x x allative t x x adessive 3 x x ablative b x x l.s. translative 4 x terminative 9 x x essive w x x l.s. abessive 5 x l.s. komitative k x l.s. aditive 7 x l.s. temporalis m x l.s. causalis c x l.s. sublative s x l.s. delative h x l.s. sociative q x l.s. factive y x l.s. superessive p x l.s. distributive u x l.s. essive_formal f x - -------------- -------------- - 6 Owner_Number singular s x x x x plural p x x x x l.s. dual d x - -------------- -------------- - 7 Owner_Gender masculine m x x feminine f x x neuter n x x ********************************* 8 Clitic no n x x x x yes y x x x x - -------------- -------------- - 9 Referent_Type personal p x x x possessive s x x x attributive a x quantitative q x - -------------- -------------- - 10Syntactic_Type nominal n x x adjectival a x x - -------------- -------------- - 11Definiteness no n x yes y x short_art s x full_art f x - -------------- -------------- - 12Animate no n x yes y x - -------------- -------------- - 13Clitic_s yes y x no n x - -------------- -------------- - 14Pronoun_Form strong s x weak w x - -------------- -------------- - 15Owner_Person first 1 x second 2 x third 3 x - -------------- -------------- - 16Owned_Number singular s x plural p x ================================= \end{verbatim} } {\bf Notes} * In the Romanian case system the value 'direct' conflates 'nominative' and 'accusative', while the value 'oblique' conflates 'genitive' and 'dative'. If we want to list these values as belonging to the same Case universe, i.e.\ if we want either 'direct', 'nominative' and 'accusative' or 'oblique', 'genitive' and 'dative' appear all together in the same system, their semantics should be explained and their internal relationships represented. We suggest here to follow the same strategy adopted within the \eagles\ validation exercise, where different kinds of relationships between values have been formally represented by adding indexes to the values of interest: - in this example, the kind of relation 'direct' and 'oblique' have with other values of the system is of 'replacement'. We mark this labelling 'direct' and 'oblique' with the index 'R' and adding the numerical indexes 1 to 'direct' and 2 to 'oblique', which point to the values which are respectively replaced. \newpage \subsection{Determiner (D)} {\small \begin{verbatim} 10 Positions **** **** **** **** **** **** **** **** ---- ---- PoS Type Pers Gend Numb Case OwnN OwnG Cltc Mod **** **** **** **** **** **** **** **** ---- ---- = ============== ============== = RO SL CS BG ET HU P ATT VAL C x - - - - - = ============== ============== = 1 Type demonstrative d x indefinite i x possessive s x interrogative q relative r exclamative e article a l.s. int_rel w x l.s. negative z x l.s. emphatic h x - -------------- -------------- - 2 Person first 1 x second 2 x third 3 x - -------------- -------------- - 3 Gender masculine m x feminine f x neuter n x - -------------- -------------- - 4 Number singular s x plural p x - -------------- -------------- - 5 Case direct r x oblique o x - -------------- -------------- - 6 Owner_Number singular s x plural p x - -------------- -------------- - 7 Owner_Gender masculine m feminine f neuter n ********************************* 8 Clitic no n x yes y x - -------------- -------------- - 9 Modific_Type prenomin e x postnomin o x ================================= \end{verbatim} } \newpage \subsection{Article (T)} {\small \begin{verbatim} 6 Positions **** **** **** **** **** ---- PoS Type Gend Numb Case Cltc **** **** **** **** **** ---- = ============== ============== = RO SL CS BG ET HU P ATT VAL C x - - - - x = ============== ============== = 1 Type definite f x x indefinite i x x l.s. possessive s x l.s. demonstrative d x - -------------- -------------- - 2 Gender masculine m x feminine f x neuter n x - -------------- -------------- - 3 Number singular s x plural p x - -------------- -------------- - 4 Case l.s. direct r x l.s. oblique o x ********************************* 5 Clitic no n x yes y x ================================= \end{verbatim} } \newpage \subsection{Adverb (R)} {\small \begin{verbatim} 6 Positions **** **** **** ---- ---- ---- PoS Type Degr Cltc Numb Pers **** **** **** ---- ---- ---- = ============== ============== = RO SL CS BG ET HU P ATT VAL C x x x x x x = ============== ============== = 1 Type general g x x x x x particle p x x l.s. causal o x l.s. negative z x l.s. adjectival a x l.s. verbal v x modifier m x x l.s. int_rel w x l.s. interrogative q x - -------------- -------------- - 2 Degree positive p x x x comparative c x x x superlative s x x x ********************************* 3 Clitic no n x x yes y x x - -------------- -------------- - 4 Number singular s x plural p x - -------------- -------------- - 5 Person first 1 x second 2 x third 3 x ================================= \end{verbatim} } \newpage \subsection{Adposition (S)} {\small \begin{verbatim} 5 Positions **** **** **** ---- ---- PoS Type Form Case Cltc **** **** **** ---- ---- = ============== ============== = RO SL CS BG ET HU P ATT VAL C x x x x x x = ============== ============== = 1 Type preposition p x x x x x postposition t x x - -------------- -------------- - 2 Formation simple s x x x compound c x x x ********************************* 3 Case nominative n (req.by prep.) genitive g x x x dative d x x x accusative a x x x locative l x x instrumental i x x - -------------- -------------- - 4 Clitic no n x yes y x ================================= \end{verbatim} } \newpage \subsection{Conjunction (C)} {\small \begin{verbatim} 8 Positions **** **** ---- ---- ---- ---- ---- ---- PoS Type Form CTyp SubT Cltc Numb Pers **** **** ---- ---- ---- ---- ---- ---- = ============== ============== = RO SL CS BG ET HU P ATT VAL C x x x x x x = ============== ============== = 1 Type coordinating c x x x x x x subordinating s x x x x x x ********************************* 2 Formation simple s x x x x compound c x x x x - -------------- -------------- - 3 Coord_Type simple s x repetit r x correlat c x sentence p x words w x - -------------- -------------- - 4 Sub_Type negative z x positive p x - -------------- -------------- - 5 Clitic no n x yes y x - -------------- -------------- - 6 Number singular s x plural p x - -------------- -------------- - 7 Person first 1 x second 2 x third 3 x ================================= \end{verbatim} } \newpage \subsection{Numeral (M)} {\small \begin{verbatim} 13 Positions **** **** **** **** **** ---- ---- ---- ---- ---- ---- ---- ---- PoS Type Gend Numb Case Frm Def Cltc Clas Anim OwnN OwnP OwdN **** **** **** **** **** ---- ---- ---- ---- ---- ---- ---- ---- = ============== ============== = RO SL CS BG ET HU P ATT VAL C x x x x x x = ============== ============== = 1 Type cardinal c x x x x x x ordinal o x x x x x x fractal f x x multiple m x x x collect l x x special s x x - -------------- -------------- - 2 Gender masculine m x x x x feminine f x x x x neuter n x x x x - -------------- -------------- - 3 Number singular s x x x x x x plural p x x x x x x dual d x x - -------------- -------------- - 4 Case nominative n x x x x genitive g x x x x dative d x x x accusative a x x x locative l x x instrumental i x x x l.s. direct r x l.s. oblique o x l.s. partitive 1 x illative x x x inessive 2 x x elative e x x allative t x x adessive 3 x x ablative b x x l.s. translative 4 x terminative 9 x x essive w x x l.s. abessive 5 x l.s. komitative k x l.s. aditive 7 x l.s. temporalis m x l.s. causalis c x l.s. sublative s x l.s. delative h x l.s. sociative q x l.s. factive y x l.s. superessive p x l.s. distributive u x l.s. essive_formal f x l.s. multiplicative 6 x ********************************* 5 Form digit d x x x x x x roman r x x x x x x letter l x x x x x x l.s. both b x l.s. m-form m x l.s. approx a x - -------------- -------------- - 6 Definiteness no n x x yes y x x short_art s x full_art f x - -------------- -------------- - 7 Clitic no n x yes y x - -------------- -------------- - 8 Class definite1 1 x definite2 2 x definite34 3 x definite f x demonstrative d x indefinite i x interrogative q x relative r x - -------------- -------------- - 9 Animate no n x yes y x - -------------- -------------- - 10Owner_Number singular s x plural p x - -------------- -------------- - 11Owner_Person first 1 x second 2 x third 3 x - -------------- -------------- - 12Owned_Number singular s x plural p x ================================= \end{verbatim} } \newpage \subsection{Interjection (I)} {\small \begin{verbatim} 3 Positions **** ---- ---- PoS Type Form **** ---- ---- = ============== ============== = RO SL CS BG ET HU P ATT VAL C x x x x x x = ============== ============== = 1 Type mood m x other o x - -------------- -------------- - 2 Formation simple s x compound c x ================================= \end{verbatim} } \newpage \subsection{Residual (X)} {\small \begin{verbatim} 1 Position **** PoS **** = ============== ============== = RO SL CS BG ET HU P ATT VAL C x x x x x x = ============== ============== = ================================= \end{verbatim} } \newpage \subsection{Abbreviation (Y)} {\small \begin{verbatim} 6 Positions **** ---- ---- ---- ---- ---- PoS SynT Gend Numb Case Def **** ---- ---- ---- ---- ---- = ============== ============== = RO SL CS BG ET HU P ATT VAL C x x x x x x = ============== ============== = 1 Syntactic_Type nominal n x x verbal v x x adjectival a x x adverbial r x x - ------------------------------- 2 Gender masculine m x feminine f x neuter n x - -------------- -------------- - 3 Number singular s x x plural p x x - -------------- -------------- - 4 Case l.s. nominative n x l.s. genitive g x l.s. direct r x l.s. oblique o x l.s. partitive 1 x l.s. illative x x l.s. inessive 2 x l.s. elative e x l.s. allative t x l.s. adessive 3 x l.s. ablative b x l.s. translative 4 x l.s. terminative 9 x l.s. essive w x l.s. abessive 5 x l.s. komitative k x l.s. aditive 7 x - -------------- -------------- - 5 Definiteness yes y x no n x ================================= \end{verbatim} } \newpage \subsection{Particle (Q)} {\small \begin{verbatim} 5 Positions **** ---- ---- ---- PoS Type Form Cltc **** ---- ---- ---- = ============== ============== = RO SL CS BG ET HU P ATT VAL C x x x x - - = ============== ============== = 1 Type negative z x x infinitive n x subjunctive s x aspect a x future f x general g x comparative c x verbal v x interrogative q x modal o x - -------------- -------------- - 2 Formation simple s x compound c x - -------------- -------------- - 3 Clitic no n x yes y x ================================= \end{verbatim} } \newpage \section{List of values with respective codes} The values presented within the tables are, in the following, listed in alphabetical order; the first column gives the name of the value, the second column its code and the third lists attributes for which the value is appropriate. {\scriptsize \begin{verbatim} 1s2s 2 Definiteness abessive 5 Case ablative b Case accusative a Case active a Voice adessive 3 Case aditive 7 Case adjectival a Syntactic_Type, Type allative t Case aorist a Tense approx a Form article a Type aspect a Type attributive a Referent_Type auxiliary a Type both b Form cardinal c Type causal o Type causalis c Case collect l Type common c Type comparative c Degree, Type compound c Formation conditional c VForm coordinating c Type copula c Type correlat c Coord_Type count t Number dative d Case definite f Class, Type definite1 1 Class definite2 2 Class definite34 3 Class delative h Case demonstrative d Class, Type determinal m Type digit d Form direct r Case distributive u Case dual d Number, Owner_Number elative e Case emphatic h Type essive w Case essive_formal f Case exclamative e Type factive y Case feminine f Gender, Owner_Gender first 1 Owner_Person, Person fractal f Type full_art f Definiteness future f Tense, Type general g Type genitive g Case gerund g VForm illative x Case imperative m VForm imperfect i Tense indefinite i Class, Type indicative i VForm inessive 2 Case infinitive n Type, VForm instrumental i Case int_rel w Type interrogative q Class, Type komitative k Case letter l Form locative l Case main m Type masculine m Gender, Owner_Gender modal o Type modifier m Type mood m Type multiple m Type multiplicative 6 Case negative z Sub_Type, Type neuter n Gender, Owner_Gender no n Animate, Clitic, Clitic_s, Definiteness, Negative nominal n Formation, Syntactic_Type nominative n Case oblique o Case ordinal o Type other o Type participle p VForm particle p Type partitive 1 Case passive p Voice past s Tense personal p Referent_Type, Type pluperfect l Tense plural p Number, Owned_Number, Owner_Number positive p Degree, Sub_Type possessive s Referent_Type, Type postnomin o Modific_Type postposition t Type prenomin e Modific_Type preposition p Type present p Tense proper p Type qualificative f Type quantitative q Referent_Type quotative q VForm reciprocal y Type reflexive x Type relative r Class, Type repetit r Coord_Type roman r Form second 2 Owner_Person, Person sentence p Coord_Type short_art s Definiteness simple s Coord_Type, Formation singular s Number, Owned_Number, Owner_Number sociative q Case special s Type strong s Pronoun_Form subjunctive s Type, VForm sublative s Case subordinating s Type superessive p Case superlative s Degree supine u VForm temporalis m Case terminative 9 Case third 3 Owner_Person, Person transgressive t VForm translative 4 Case verbal v Type vocative v Case weak w Pronoun_Form words w Coord_Type yes y Animate, Clitic, Clitic_s, Definiteness, Negative \end{verbatim} } \newpage \section{List of attributes} In this section, all the attributes presented in the tables are listed in alphabetical order. For some attributes which are not self-explanatory, a brief description of their semantics is provided. \vspace{2ex} \begin{center} \begin{tabular}{|l l|} \hline {\bf Attribute} & {\bf Relevant to Category} \\ \hline Animate & Adj Noun Num Pron Verb\\ Case & Abbr Adj Adpos Art Det Noun Num Pron Verb\\ Class & Num\\ Clitic & Adj Adpos Adv Art Conj Det Noun Num Part Pron Verb\\ Clitic\_s & Pron\\ Coord\_Type & Conj\\ Definiteness & Abbr Adj Noun Num Pron Verb\\ Degree & Adj Adv \\ Form & Num\\ Formation & Adj Adpos Conj Interj Part\\ Gender & Abbr Adj Art Det Noun Num Pron Verb\\ Modific\_Type & Det\\ Negative & Verb\\ Number & Abbr Adj Adv Art Conj Det Noun Num Pron Verb\\ Owned\_Number & Adj Noun Num Pron\\ Owner\_Gender & Det Pron\\ Owner\_Number & Adj Det Noun Num Pron\\ Owner\_Person & Adj Noun Num Pron\\ Person & Adv Conj Det Pron Verb\\ Pronoun\_Form & Pron\\ Referent\_Type& Pron\\ Sub\_Type & Conj\\ Syntactic\_Type& Abbr Pron\\ Tense & Verb\\ Type & Adj Adpos Adv Art Conj Det Interj Noun Num Part Pron Verb\\ VForm & Verb\\ Voice & Verb\\ \hline \end{tabular} \end{center} \begin{description} \item[Class:] distinguishes subtypes of Numerals in Czech which have a distinct syntactic distributions: e.g.\ subclasses for 1, 2, 3\&4, etc.\ are distinguished. \item[Clitic\_s:] the 'yes' value of the Clitic\_s attribute denotes Czech pronouns having the clitic morpheme 's' appended as a suffix. \item[Definiteness:] corresponds to the definite and indefinite article in English, which is expressed in Bulgarian by suffixes; these differ according to the gender and number of a word. For singular masculine there are two forms: full article and short article (full is used when a sing.masc. form is the syntactic subject of the clause, otherwise short article is used). The distinction full vs. short is not made for feminine, neuter and plural forms. Definiteness is also used in Romanian. \item[Form:] used to distinguish different forms of numerals (Roman, digit, 'letter') and in Bulgarian also for the semantic distinction between numerals that refer to male persons (but not children) or groups of male+female. \item[Formation:] refers to the graphical components: simple, i.e.\ consisting of one word; compound, i.e.\ consisting of more than one word. \item[Modific\_Type:] refers to the prenominal or postnominal positions of Determiners which distinguish different forms in Romanian. \item[Negative:] the value 'yes' encodes negative verbal word-forms in Czech, Slovene and in Estonian. \item[Owned\_Number:] in the Hungarian system, different word-forms are distinguished for nominals on the basis of so called 'anaphoric possessive' number, i.e.\ the number of the thing(s) possessed by the nominal in question. \item[Owner\_Gender:] used to encode the Gender of the possessor in Pronouns and (in Romanian) Determiners. \item[Owner\_Number:] used to specify the possessor number in Pronouns, as well as (in Romanian) in Determiners, and (in Hungarian) in Adjectives and Nouns. \item[Owner\_Person:] used to specify the possessor person in in Hungarian in Adjectives and Nouns. \item[Pronoun\_Form:] used to encode weak and strong pronouns in Romanian. \item[Referent\_Type:] used to distinguish reflexive personal from reflexive possessive pronouns in the Slavic languages. In Bulgarian, it also describes a subdivision on the basis of semantic features which have effect on the morpho-syntactic paradigm, e.g.\ quantitative: the pronoun refers to quantity, etc. \item[Sub\_Type:] used in Romanian to distinguish negative from positive Conjunctions. \item[Syntactic\_Type:] used to distinguish the nominal and adjectival function of Pronouns in Slovene and Czech. Also used in Abbreviations to signal the Part of Speech of the abbreviation; currently used only by Romanian and Estonian. \end{description} \chapter{Language Specific Applications} \label{chp:LSA} \section{Application to Bulgarian} \markright{COP project 106 \mte {\hfill}Deliverable D1.1 F --- Bulgarian{\hfill}} The application to Bulgarian has been carried out by Radoslav Pavlov and Ludmila Dimitrova (IM-BAS), Lydia Sinapova (IIT-BAS), and Kiril Iv. Simov (LML-BAS). \begin{small} \begin{verbatim} 1. Noun (N) = ============== ============== ============= = P ATT VAL Example C = ============== ============== ============= = 1 Type common kniga c proper Ivan p - -------------- -------------- ------------- - 2 Gender masculine stol m feminine masa f neuter vreteno n - -------------- -------------- ------------- - 3 Number singular momtche s plural stolove p l.s. count (dva) stola t - -------------- -------------- ------------- - 4 Case nominative narod n vocative narode v * ***************************** ************* * 5 Definiteness no utchitel n yes zhenata y l.s. short_art moliva s l.s. full_art molivyt f - -------------- -------------- ------------- - 6 Clitic - - - -------------- -------------- ------------- - 7 Animate - - - -------------- -------------- ------------- - 8 Owner_Number - - - -------------- -------------- ------------- - 9 Owner_Person - - ---------------- -------------- ------------- - 10Owned_Number - - =============================== ============= = Combinations === ====== ======== ======== === ==== ==== ==== ==== ==== ======= POS Type Gend Numb Case Def Clit Anim OwnN OwnP OwdN Example === ====== ====== ===== ==== === ==== ==== ==== ==== ==== ======= N [cp] [mfn] [sp] - n - - - - - 1. N [cp] m s - [sf] - - - - - 2. N [cp] [mf] - v - - - - - - 3. N [cp] [mfn] p - y - - - - - 4. N [cp] [fn] s - y - - - - - 5. N c m t - - - - - - - 6. === ====== ======= ==== ==== === ==== ==== ==== ==== ==== ======= Examples: 1. narod, zhena, selo, Ivan, Penka 2. naroda, narodyt 3. narode, zheno 4. narodite, zhenite, selata 5. zhenata, seloto 6. naroda 2. Verb (V) = ============== ============== ============== = P ATT VAL Example C = ============== ============== ============== = 1 Type main govorya m auxiliary sym a - -------------- -------------- -------------- - 2 VForm indicative govorya i imperative govorete m participle govoril p gerund govorejki g - -------------- -------------- -------------- - 3 Tense present govorya p imperfect govoreh i past govoreno s l.s. aorist govorih a - -------------- -------------- -------------- - 4 Person first govorya 1 second govorish 2 third govori 3 - -------------- -------------- -------------- - 5 Number singular govorya s plural govoryat p - -------------- -------------- -------------- - 6 Gender masculine govoril m feminine govorila f neuter govorilo n ************************************************ 7 Voice active govorest a passive govoreno p - -------------- -------------- -------------- - 8 Negative - - -------------- -------------- -------------- - 9 Definiteness no govoril n yes govorilite y short_art govoriliya s full_art govoriliyat f - -------------- -------------- -------------- - 10Clitic - - -------------- -------------- -------------- - 11Case - - -------------- -------------- -------------- - 12Animate - - -------------- -------------- -------------- - 13Clitic_s - ============================================== = Combinations === ===== ==== ==== ==== ===== ==== === === === === ==== == === ======= PoS Type VFrm Tens Pers Numb Gend Voic Neg Def Cl1 Case An Cl2 Example === ===== ==== ==== ==== ===== ==== === === === === ==== == === ======= V [ma] i [pai][123] [sp] - - - - - - - - 1. V [ma] m - 2 [sp] - - - - - - - - 2. V m p [pai] - s [mfn] a - n - - - - 3. V m p [pa] - s m a - [sf] - - - - 4. V m p [pa] - s [fn] a - y - - - - 5. V m p [pai] - p - a - n - - - - 6. V m p [pa] - p - a - y - - - - 7. V m p s - s [mfn] p - n - - - - 8. V m p s - s m p - [sf] - - - - 9. V m p s - s [fn] p - y - - - - 10. V m p s - p - p - [ny] - - - - 11. V a p [ai] - s [mfn] a - n - - - - 12. V a p [a] - s m a - [sf] - - - - 13. V a p [a] - s [fn] a - y - - - - 14. V a p [ai] - p - a - n - - - - 15. V a p [a] - p - a - y - - - - 16. V [ma] g - - - - - - - - - - - 17. === ===== ==== ==== ==== ===== ==== === === === === ==== == === ======= Examples: 1. govorya, sym 2. govori, bydete 3. govorest, govorila 4. govorestiya, govorestiyat 5. govorestta, govoriloto 6. govoresti, govoreli 7. govoresti, govorilite 8. govoren, govorena 9. govoreniya, govoreniyat 10. govorenata, govorenoto 11. govoreni, govorenite 12. bil, bila 13. biliya, biliyat 14. bilata, biloto 15. bili 16. bilite 17. govorejki, bidejki 3. Adjective (A) = ============== ============== =============== == P ATT VAL Example CC = ============== ============== =============== == 1 Type - - - -------------- -------------- --------------- -- 2 Degree - - - -------------- -------------- --------------- -- 3 Gender masculine bedniyat m feminine bedna f neuter bednoto n - -------------- -------------- --------------- -- 4 Number singular bedniya s plural bednite p - -------------- -------------- --------------- -- 5 Case - * ************************************************ 6 Definiteness no bedna n yes bednata y l.s. short_art bedniya s l.s. full_art bedniyat f - -------------- -------------- --------------- -- 7 Clitic - - -------------- -------------- --------------- -- 8 Animate - - -------------- -------------- --------------- -- 9 Formation - - -------------- -------------- --------------- -- 10Owner_Number - - -------------- -------------- --------------- -- 11Owner_Person - ---------------- -------------- --------------- -- 12Owned_Number - ================================================== Combinations === ==== ==== ==== ==== ==== === ==== === === === === === ======= PoS Type Degr Gend Numb Case Def Clit An Frm OnN OnP OdN Example === ==== ==== ==== ==== ==== === ==== === === === === === ======= A - - s [mfn] - n - - - - - - beden, bedna A - - s m - [sf] - - - - - - bedniya, bedniyat A - - s [fn] - y - - - - - - bednata, bednoto A - - p - - [ny] - - - - - - bedni, bednite === ==== ==== ==== ==== ==== === ==== === === === === === ======= 4. Pronoun (P) = ============== ============== =========== = P ATT VAL Example C = ============== ============== =========== = 1 Type personal az p demonstrative tozi d indefinite nyakoj i possessive moj s interrogative koi q relative kojto r reflexive sebe x negative nikoj z general vseki g - -------------- -------------- ----------- - 2 Person first az 1 second ti 2 third toj 3 - -------------- -------------- ----------- - 3 Gender masculine moj m feminine moya f neuter moe n - -------------- -------------- ----------- - 4 Number singular az s plural nie p - -------------- -------------- ----------- - 5 Case nominative toj n dative nemu d accusative nego a - -------------- -------------- ----------- - 6 Owner_Number - - -------------- -------------- ----------- - 7 Owner_Gender - ********************************************* 8 Clitic no sebe_si n yes mi y - -------------- -------------- ----------- - 9 Referent_Type personal koj p possessive tchij s attributive kakva a quantitative kolko q - -------------- -------------- ----------- - 10Syntactic_Type - - -------------- -------------- ----------- - 11Definiteness no svoj n yes svoyata y short_art svoya s full_art svoyat f - -------------- -------------- ----------- - 12Animate - - -------------- -------------- ----------- - 13Clitic_s - - -------------- -------------- ----------- - 14Pronoun_Form - - -------------- -------------- ----------- - 15Owner_Person - - -------------- -------------- ----------- - 16Owned_Number - ============================================= Combinations Personal pronouns === ==== ==== ==== === ==== === === === ==== ==== === ======= PoS Type Pers Gnd Nmb Case OnN OnG Clt RefT SynT Def Example === ==== ==== ==== === ==== === === === ==== ==== === ======= P p [12] - s n - - - - - - 1. P p 3 [mfn] s n - - - - - - 2. P p [123] - p n - - - - - - 3. P p [12] - s [ad] - - [yn] - - - 4. P p 3 [mnf] s [ad] - - [yn] - - - 5. P p [123] - p [ad] - - [yn] - - - 6. Examples: 1. az, ti 2. toj, tya, to 3. nie, vie, te 4. mene, me, mi, mene_me 5. nego, go, nego_go 6. nas, ni, nas_ni, nam Demonstrative pronouns === ==== ==== ==== === ==== === === === ==== ==== === ======= PoS Type Pers Gnd Nmb Case OnN OnG Clt RefT SynT Def Example === ==== ==== ==== === ==== === === === ==== ==== === ======= P d - [mfn] s - - - - p - - tozi,tova P d - - p - - - - p - - tezi P d - - - - - - - q - - tolkova Indefinite, Interrogative, Relative, Negative pronouns === ==== ==== ==== === ==== === === === ==== ==== === ======= PoS Type Pers Gnd Nmb Case OnN OnG Clt RefT SynT Def Example === ==== ==== ==== === ==== === === === ==== ==== === ======= P [iqrz] - m s [nad] - - - p - - 1. P [iqrz] - m s - - - - [as] - - 2. P [iqrz] - [fn] s - - - - [pas] - - 3. P [iqrz] - - p - - - - [pas] - - 4. P [iqrz] - - - - - - - q - - 5. Examples: 1. nyakoj, nyakogo, nyakomu, nikoj 2. nyakakyv, netchij, nikakyv 3. nyakoya, nyakakva, nechiya 4. nyakoi, nyakavi, netchii 5. nyakolko, nikolko, kolko Possessive pronouns === ==== ==== ==== === ==== === === === ==== ==== === ======= PoS Type Pers Gnd Nmb Case OnN OnG Clt RefT SynT Def Example === ==== ==== ==== === ==== === === === ==== ==== === ======= P s [123] m s - - - n - - [nsf] 1. P s [123] [fn] s - - - n - - [ny] 2. P s [123] - p - - - n - - [ny] 3. P s [123] - [sp] - - - y - - - 4. Examples: 1. moj, moya, moyat 2. moya, moyata, tvoya 3. moi, moite, tvoi 4. mi, ti, mu, j, ni, vi, im Reflexive pronouns === ==== ==== ==== === ==== === === === ==== ==== === ======= PoS Type Pers Gnd Nmb Case OnN OnG Clt RefT SynT Def Example === ==== ==== ==== === ==== === === === ==== ==== === ======= P x - - - [ad] - - [ny] p - - 1. P x - m s - - - n s - [nsf] 2. P x - [fn] s - - - n s - [ny] 3. P x - - p - - - n s - [ny] 4. P x - - - - - - y s - - 5. Examples: 1. sebe, se, si, sebe_si 2. svoj, svoyat, svoya 3. svoya, svoyata 4. svoi, svoite 5. si General pronouns === ==== ==== ==== === ==== === === === ==== ==== === ======= PoS Type Pers Gnd Nmb Case OnN OnG Clt RefT SynT Def Example === ==== ==== ==== === ==== === === === ==== ==== === ======= P g - m s [nad] - - - p - - 1. P g - m s - - - - a - - 2. P g - [fn] s - - - - [pa] - - 3. P g - - p - - - - [pa] - - 4. P g - m s - - - - q - [sf] 5. P g - [fn] s - - - - q - [ny] 6. P g - - p - - - - q - [ny] 7. Examples: 1. vseki, vsekigo, vsekimu 2. vsyakakyv 3. vsyaka, vsyakakva 4. vsyakoi, vsyakakvi 5. vsitchkiya, vsitchikiyat 6. vsitchka, vsitchkata 7. vsitchki, vsitchkite 5. Determiner (D) Not applicable. 6. Article (T) Not applicable. 7. Adverb (R) = ============== ============== ============ = P ATT VAL Example C = ============== ============== ============ = 1 Type general tuk g l.s. adjectival umno a - -------------- --------------- ----------- - 2 Degree - ********************************************** 3 Clitic - - -------------- -------------- ------------ - 4 Number - - -------------- -------------- ------------ - 5 Person - ============================================== Combinations ===== ====== ==== ==== ==== ===== ====================== POS Type Degr Clct Numb Pers Example ===== ====== ==== ==== ==== ===== ====================== R g - - - - tuk, mnogo R a - - - - umno, veselo, studeno ======================================================== Note: Adverbs of type adjectival have the same form as adjectives in Gender = neuter, Person = 3, Number = singular. 8. Adposition (S) = ============== ============== ============= == P ATT VAL Example C = ============== ============== ============= == 1 Type preposition na, v p - -------------- -------------- ------------- -- 2 Formation - ************************************************ 3 Case - - -------------- -------------- ------------- -- 4 Clitic - ================================================ Combinations === ===== ==== ==== ==== ============================ POS Type Form Case Cltc Examples === ===== ==== ==== ==== ============================ S p - - - na, za, v, okolo, spored === ===== =========================================== 9. Conjunction (C) = ============== ============== ============== == P ATT VAL Example C = ============== ============== ============== == 1 Type coordinating a, ala, ili c subordinating tche, da s ************************************************* 2 Formation simple a, i, tche s compound i_da, za_da c - -------------- -------------- -------------- -- 3 Coord_Type - - -------------- -------------- -------------- -- 4 Sub_Type - - -------------- -------------- -------------- -- 5 Clitic - - -------------- -------------- -------------- -- 6 Number - - -------------- -------------- -------------- -- 7 Person - ================================================= Combinations === ==== ==== ==== ==== ==== ==== ===== ========================== PoS Type Form CTyp SubT Clit Numb Pers Examples === ==== ==== ==== ==== ==== ==== ===== ========================== C c [sc] - - - - - i, a, ami, tche, tyj_tche C s [sc] - - - - - tche, da, kato, za_da === ==== ==== ==== ==== ==== ==== ===== ========================== 10. Numeral (M) = ============== ============== ============ = P ATT VAL Example C = ============== ============== ============ = 1 Type cardinal edin c ordinal vtori o - -------------- -------------- ------------ - 2 Gender masculine edin m feminine edna f neuter edno n - -------------- -------------- ------------ - 3 Number singular edin s plural edni p - -------------- -------------- ------------ - 4 Case - - ********************************************** 5 Form digit 1984 d roman IX r letter dva l l.s. m-form dvama m l.s. approx stotina a - -------------- -------------- ------------ - 6 Definiteness no edin n yes ednata y short_art ediniya s full_art ediniyat f - -------------- -------------- ------------ - 7 Clitic - - -------------- -------------- ------------ - 8 Class - - -------------- -------------- ------------ - 9 Animate - - -------------- -------------- ------------ - 10Owner_Number - - -------------- -------------- ------------ - 11Owner_Person - - -------------- -------------- ------------ - 12Owned_Number - =============================================== Combinations === ===== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== =========== PoS Type Gend Numb Case Frm Def Clit Clas Anim OwnN OwnP OwdN Example === ===== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== =========== Cardinal M c - - - d - - - - - - - 1. M c - - - r - - - - - - - 2. M c m s - l [nsf] - - - - - - 3. M c [fn] s - l [ny] - - - - - - 4. M c - p - l [ny] - - - - - - 5. M c [mfn] p - l [ny] - - - - - - 6. M c m p - m [ny] - - - - - - 7. M c - p - l [ny] - - - - - - 8. M c - p - a [ny] - - - - - - 9. Ordinal M o m s - - [nsf] - - - - - - 10. M o [fn] s - - [ny] - - - - - - 11. M o - p - - [ny] - - - - - - 12. === ===== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== ==== =========== Examples: 1. 1984 2. IX 3. edin, ediniya, ediniyat 4. edna, ednata, edno, ednoto 5. edni, ednite 6. dva, dvata, dve, dvete 7. dvama, dvamata, trima 8. tri, trite, chetiri 9. stotina, stotinata 10. pyrvi, pyrviya, pyrviyat 11. pyrva, pyrvata 12. pyrvi, pyrvite 11. Interjection (I) = ============== ============== ============ == P ATT VAL Example C = ============== ============== ============ == 1 Type - - -------------- -------------- ------------ -- 2 Formation simple ah s compound bozhe moj! c ============================================ == Combinations === ==== ========== ================================================= POS Type Formation Examples === ==== ========== ================================================= I - s o, hm-m, ha-ha, ah-lele, uvi I - c diavol_da_go_vzeme, ama_rabota, bozhe_moj ===================================================================== 12. Residual (X) = ============== ============== = P ATT VAL C = ============== ============== = P ATT - ================================= 13. Abbreviation (Y) = ============== ============== ============ = P ATT VAL Example C = ============== ============== ============ = 1 Syntactic_Type - - -------------- -------------- ------------ - 2 Gender - -------------- -------------- ------------ - 3 Number - -------------- -------------- ------------ - 4 Case - -------------- -------------- ------------ - 5 Definiteness ============================================== Note: As there are no Attributes for Abbreviations, this subsection is has no examples Combinations ==== ==== ==== ==== === =========================== PoS SynT Gend Numb Case Def Examples ==== ==== ==== ==== === =========================== Y - - - - - t.n., vzh., lat., razg... =================================================== 14. Particle (Q) = ============== ============== =========== == P ATT VAL Example C = ============== ============== =========== == 1 Type negative ne, ni z general a, be g comparative po, naj c verbal da, ste v interrogative li, dali q modal da, dano o - -------------- -------------- ----------- -- 2 Formation simple a, ne s compound hajde_de c - -------------- -------------- ----------- -- 3 Clitic - ============================================== Combinations === ============ ======= ======================= POS Type Form Cltc Examples === ============ ======= ======================= Q [go] [sc] - a, be, neka,hajde_de Q [cvzq] s - po, naj, ne, li, nali === ============ ======= ======================= Observations In Bulgarian superlative particles "po" and "naj" precede the superlative Forms of adjectives and adverbs: "po-hubav", "naj-hubav", "po-bavno", "naj-bavno". \end{verbatim} \end{small} \newpage \section{Application to Czech} \markright{COP project 106 \mte {\hfill}Deliverable D1.1 F --- Czech{\hfill}} The application to Czech has been elaborated by Vladim\'{\i}r Petkevi\v{c}, Faculty of Philosophy, Charles University, Prague. Acknowledgements:\\ The most appreciated base used for the elaboration of the application was constituted by morphosyntactic tables for Czech elaborated by Hana Skoumalov\'{a}. For all possible errors solely the author is to blame. All Czech diacritical characters used have been encoded in the following way: \begin{enumerate} \item The Czech 'hachek' diacritic (\v{ }) over the following characters is marked by the corresponding nondiacritic counterpart followed by '{\tt <}'. This concerns the characters {\tt c, d, e, n, r, s, t, z} followed by {\tt <}. For instance: {\tt t<} \item In Czech, each vowel (a, e, i, o, u, y) can be diacritically marked as to its quantity by ' over the vowel. We shall mark it by {\tt '} {\em following\/} the vowel, e.g. {\tt e'}. In addition to {\tt '} there exists also the ring sign ($^{o}$) over the letter u. Here we shall use this notation: {\tt u0}. Thus, there are the following vocal diacritical alternatives: {\tt a' e' i' o' u' u0 y'} \end{enumerate} \begin{small} \begin{verbatim} 1. Noun (N) 1.1. Lexicon = ============== ============== = ================= ========================= P ATT VAL C Example Czech term = ============== ============== = ================= ========================= 1 Type common c kniha obecne' jme'no proper p Petr vlastni' jme'no - -------------- -------------- - ----------------- ------------------------- 2 Gender masculine m otec masculinum feminine f kniha femininum neuter n slunce neutrum - -------------- -------------- - ----------------- ------------------------- 3 Number singular s kniha singula'r plural p knihy plura'l dual d rukama dua'l - -------------- -------------- - ----------------- ------------------------- 4 Case nominative n kniha nominativ genitive g knihy genitiv dative d knize dativ accusative a knihu akuzativ vocative v kniho! vokativ locative l knize loka'l instrumental i knihou instrumenta'l * ************** ************** * ----------------- ------------------------- 5 Definiteness - - -------------- -------------- - 6 Clitic - - -------------- -------------- - ----------------- ------------------------- 7 Animate l.s. no n hrad nez Adjective ------------------- ----- ------- -------------- ------------------------ Type Form Example Comment ----- ------- -------------- ------------------------ BELI : beli ha'z+beli property of living in the house FAJTA : fajta ma's+fajta of some other kind FELE : fe'le bu'tor+fe'le similar to furniture FORMA : forma toja's+forma egg shaped SZERU : szeru" tej+szeru" milk+y IKEP : i ha'z+i home (e.g. made) SKEP : s, as, os, es, o:s gyerek+es child+ish UKEP : u', u", ju', ju" arc+u' (red)-face+d FFOSZ : tlan,tlen, atlan,etlen, talan,telen 'devoid of', '-less' MER : nyi kana'l+nyi spoon+full Noun,A -> Noun -------------- ----- ------- -------------- ------------------------ Type Form Example Comment ----- ------- -------------- ------------------------ COL : sa'g, bara't+sa'g friend+ship se'g Noun -> Noun -------------- ----- ------- -------------- ------------------------ Type Form Example Comment ----- ------- -------------- ------------------------ DIM : cska,acska, ecske,o:cske, ocska utca'+cska little street FEM : ne' Kova'cs+ne' Mrs. Kova'cs Noun -> Verb -------------- ----- ------- -------------- ------------------------ Type Form Example Comment ----- ------- -------------- ------------------------ FI : z,az,oz,ez, o:z auto'+z go by car Verb -> Adjective ------------------- ----- ------- -------------- ------------------------ Type Form Example Comment ----- ------- -------------- ------------------------ IFOSZT: atlan, etlen felel+etlen sg not being answered MIF : o', o" felel+o" sy who answers MIB : t,ott,ett, o:tt felel+t the answered (question) MIA : ando', endo" felel+endo" sg that should be answered NIVALO: anivalo', enivalo', nivalo' ne'z+nivalo' sg that should be seen Verb -> Adverb ---------------- ----- ------- -------------- ------------------------ Type Form Example Comment ----- ------- -------------- ------------------------ HIN : va,ve olvas+va (while ) reading (the book) Numeral-> Adjective ------------------- ----- ------- -------------- ------------------------ Type Form Example Comment ----- ------- -------------- ------------------------ KIEM : ik hatod+ik six+th LAGOS : lagos,leges ma'sod+lagos second+ary Verb -> Noun -------------- ----- ------- -------------- ------------------------ Type Form Example Comment ----- ------- -------------- ------------------------ IF : a's, e's olvas+a's read+ing (gerund) DES : hatne'k, hetne'k olvas+hatne'k the intention of reading Adj -> Verb -------------- ----- ------- -------------- ------------------------ Type Form Example Comment ----- ------- -------------- ------------------------ FAK : i't sze'p+i't make it pretty (in compounds only) MI : od,ed,o:d va'llas+od+ik becomes strong MIGY : kod,ked, ko:d okos+kod+ik plays the smart (frequently) Verb -> Verb -------------- ----- ------- -------------- ------------------------ Type Form Example Comment ----- ------- -------------- ------------------------ MUV : at,et,tat,tet olvas+tat makes him read GYAK : gat,get,ogat, eget,o:get olvas+gat he reads frequently HAT : hat,het olvas+hat he may read VISSZ : o'd, o"d old+o'dik dissolves SZENV : tatik, tetik olvas+tatik makes the book being read \end{verbatim} \end{small} Examples: If 'szemtele's' (littering, action of throwing away litter) is not in the dictionary we derive it from the verb 'litter' szemetel[V] + e's[IF] (where IF=Verb2Noun). Instead of giving the extra attribute to the verb expressing that it has a derivational suffix we simply give the result of the analysis+conversion: szemetele's[N]. In Hungarian some derivation may follow the inflectional suffix. For these derivations the suffix+derivation together forms a compound derivation. The a new stem is generated from the stem + inflection + derivation segments and the resulting part of speech is determined by the derivation. \begin{small} \begin{verbatim} ------------------ ----- -------------- ------------------------ Type Form Example Comment ------------------ ----- -------------- ------------------------ Nc-sn--ns1- +FAM : e'k ap +m+e'k some people with my father Nc-su--u--- +IKEP : i asztal+onke'nt+i sg. done by each every table Afc-sn--n---- +KIEM : ik nagy+obb+ik the bigger one \end{verbatim} \end{small} There are adverbs that may get case endings. Since case inflections derivate adverb from nouns these constructions can be handled as derivations. That means the stem is the stem + inflection combination and the part of speech is adverb. \begin{small} \begin{verbatim} ------------------ ----- -------------- ------------------------ Type Form Example Comment ------------------ ----- -------------- ------------------------ Ag---- + ablativ t"l akkor+t"l since then \end{verbatim} \end{small} \subsection{Compounding} Compounding is handled in a very similar way to derivation. The rightmost word class is always the resulting one. If it contains some derivation as well then the result is the word class that the derivation determines. Examples: \begin{enumerate} \item Hungarian is similar to German but we have a competence limit: we do not put together more then 2 nouns. These two words might be compounds as well but they must be lexicalized forms. So e.g.: rendo"regyenruha (police uniform) is valid since it is put together from rendo"r (police) and egyenruha (uniform) where rendo"r=rend (order) + o"r (guard) and egyenruha=egyen (uni) + ruha (clothes). \item Of course the situation is more complicated since nouns can be derived from verbs, adjectives or other nouns, and these derivations can be parts of compoundns: e.g. auto'buszmega'llo' (bus stop) where meg+a'll+o' is derived from the verb a'll (stand). Whereas mega'llo' is lexicalized so dictionaries might contain it, the morphological rule is productive so compounding works well for constructions like tandi'jbefizete's (paying the tution fee)= tan(tution)+ di'j(fee)+ be(in) +fizet(pay) +e's(ment). \item Compounding, such as derivation, will not be mirrorred in the \multext\ format morphological analysis* a converter will produce an acceptable segmentation in the corpus. The word class (POS) for these compounds will be the word class of the resulting compounding (e.g. tandi'jbefizete's/N). \end{enumerate} \newpage \section{Application to Romanian} \markright{COP project 106 \mte {\hfill}Deliverable D1.1 F --- Romanian{\hfill}} The application to Romanian has been elaborated by D.Tufi\c{s} and A.M.Barbu. The Romanian diacritic characters used have been encoded in the following way: \begin{itemize} \item % Breve in printed version goes to '(', i.e. 'a(' \u{a} as \verb!a(! \item % Cedilla in printed version goes to ',', i.e. 's,' 't,' \c{s} and \c{t} as \verb!s,! and \verb!t,! \item % Circumflex in printed version goes to '>', i.e. 'a>' 'i>' \^{a} and \^{\i} as \verb!a>! and \verb!i>! \end{itemize} \begin{small} \begin{verbatim} 1. Noun (N) In Romanian the following attribute-value pairs are applicable to Nouns: 1.1 Type Attribute value Ro. example Type common (c) carte proper (p) Ion 1.2 Gender Attribute value Ro. example Gender masculine (m) ba(iatul feminine (f) casa neuter (n) fir (m.sg.),fire (f.pl.) In Romanian the declension of a neuter noun always follows in singular a masculine paradigm and in plural a feminine one. Specific implementations could take advantage of this rule and by organizing the paradigmatic space in partial paradigms (masc-sing, masc-pl, fem-sing, fem-pl) to get rid of neuter value for the gender attribute. 1.3 Number Attribute value Ro. example Number singular (s) fata( plural (p) fete 1.4 Case There are five functional cases in Romanian (nominative, genitive, dative, accusative and vocative) but they are expressed by at most three syncretic forms for each number: 1. nominative-accusative; 2. genitive-dative; 3. vocative. The ambiguity of a syncretic case form may be solved at the syntactic level. Taking into account this syncretism we considered the following attribute-value combinations. Attribute value Ro. example Case direct(d) omul oblique (o) omului vocative (v) omule At the word level, the distinction between the three syncretic forms is possible only for the definite nouns. Masculine indefinite nouns are not differentiated by case (except, sometimes for vocative. The feminine indefinite nouns have different forms in singular for nominative-accusative, genitive-dative and vocative, but the genitive-dative singular forms are also indistinguishable from the plural forms: fata( - nominative-accusative, vocative singular, indefinite fato - vocative singular, indefinite fete - genitive-dative singular, nominative-accusative, genitive-dative, vocative plural, indefinite ba(iat - nominative-accusative, genitive-dative, vocative singular, indefinite ba(iete - vocative singular, indefinite ba(iet,i - nominative-accusative, genitive-dative, vocative, plural, indefinite 1.5 Definiteness In Romanian, nouns can be marked for definiteness with the enclitic definite article. Attribute value Ro. example Definiteness yes (y) omul no (n) om 1.6 Clitic See discussion in 2.10 Attribute value Ro. example Clitic no (n) sora yes (y) soru-mea Combinations Here, and in all the sections to follow, the trailing hyphens are deleted. ========================= Tag Example ========================= Ncmsrn frate Ncmson frate Ncmsvn frate Ncmsry fratele Ncmsoy fratelui Ncmprn frat,i Ncmpon frat,i Ncmpvn frat,i Ncmpry frat,ii Ncmpoy frat,ilor Ncmpvy frat,ilor Ncfsrn sora( Ncfsvn soro Ncfson surori Ncfprn surori Ncfpon surori Ncfpvn surori Ncfsoyy sora((-sii) Ncfpry surorile Ncfpoy surorilor Ncfpvy surorilor Ncfsryy sora((-sa) Ncmsrn creion Ncmson creion Ncmsryy creionu-(i) Ncmsoy creionului Ncfprn creioane Ncfpon creioane Ncfpry creioanele Ncfpoy creioanelor Npfsr Ioana Npfso Ioanei Npmsrn Bucures,ti Npmsry Bucures,tiul 2. Verb (V) The following attribute-value pairs are applicable to verbs in Romanian: 2.1 Type Attribute value Ro. example Type main (m) a vedea auxiliary (a) a avea, a fi, a voi modal (o) a putea, a trebui copulative (c) a fi, a deveni 2.2 VForm Traditionally, Romanian linguistics distinguishes between predicative and non-predicative moods. This distinction may be easily mapped into finite/non-finite dichotomy: indicative, subjunctive and imperative are finite; infinitive, participle and gerund are non-finite (only synthetic (non-compound) moods were mentioned; we use the opposition synthetic-analytic to distinguish between concatenative (synthetic) and compound (analytic) morpho-lexical phenomena). Attribute value Ro. example VForm indicative(i) vine subjunctive (s) vina( imperative (m) vino infinitive (n) veni participle (p) venit gerund (g) venind As only synthetic forms were considered, the values conditional and presumptive for the VForm attribute were left out in Romanian. Another value for VForm which was left out is the Supine. It appears mostly with a preposition, except for a few intransitive verbs when they are subordinated to the impersonal verb a trebui (must). Only the preposition allows for differentiating a supine from a participle-masculine-singular. 2.3 Tense The synthetic values for the attribute Tense, as listed in the table below, apply to the indicative mood. The value p (present) is used also for subjunctive and infinitive. All the other moods either have no tense or have compound form tenses. Attribute value Ro. example Tense present (p) va(d imperfect (i) vedeam past (s) va(zui pluperfect (l) va(zusem 2.4 Person Attribute value Ro. example Person first (1) va(d second (2) vezi third (3) vede The following features are pertinent to those moods which permit an adjectival use, i.e. participle and gerund. However, the adjectival use of gerund is extremely rare (o ma>na( tremurnda( - a shaking hand) and therefore gender and number apply mainly for the participle. 2.5 Number Attribute value Ro. example Number singular (s) ba(tut plural (p) ba(tut,i 2.6 Gender Attribute value Ro. example Gender masculine(m) ba(tut feminine(f) ba(tuta( neuter(n) ba(tut-ba(tute 2.7, 2.8, 2.9 Voice, Negation, Definiteness As these feature values are either realized as compound forms or irrelevant for Romanian, in our word-level encoding they were omitted. 2.10 Clitic The cliticization phenomenon in Romanian is not restricted to verb-pronoun relationship, but may also be observed with the (main) verb and the auxiliary, the noun or adjective with pronoun, with noun or adjective with copula, pronoun with auxiliary, preposition with (indefinite) article, numeral or (indefinite) pronoun, negative adverb with verb, auxiliary or pronoun, and some others (mainly created through the contracted forms of the verb "a fi"-to be). We restrict ourselves to considering only the graphically marked clicitizations. In such cases, the two, three or (sometimes) four constituents of a cliticized word-form are always separated by a hyphen. Omitting the hyphen in such cases is an unacceptable error in written Romanian. The examples below illustrate the specific types of cliticization we will consider: da(-mi-l ! (you give +to me+it/him = give it/him to me) la(satu-ne-ai (left + us+ you have = you have left us) sparge-t,i-s-ar lampa (break + your + s-part-conj + has lamp = I wish your lamp would break) sora(-mii (sister + my = to my sister) fat,a-i (his/her face | the face is) ros,u-i (his/her red | the red is ) m-au (me + they have ) i>ntr-o gaura( (in + a hole) i>ntr-o ora( (in + one hour) i>ntr-unele (into + some of) n-aud (not + I hear = I do not hear) n-am (not + I/we have = I have not ) nu-mi (not + to me ) The order in which different constituents of a cliticized form appear is governed by precise morphological rules. For instance, the auxiliaries always appear in the last position. The main verbs, except those beginning either with the letter "a" or the letter i> and the contracted forms (-s and -i) of the verb "a fi"- to be, always appear in the first position, nouns and adjectives always precede the cliticized pronouns, the negative adverbial particles nu- and n- only appear in the first position, and so on. However, in order to reduce spurious ambiguity in morpho-lexical encoding, we considered the attribute CLITIC as relevant only in those cases where the clicitization phenomenon resulted in graphemic modification of the cliticized term (as it is the epenthetic u in the gerund form shown in the table below). Attribute value Ro. example Clitic no (n) am ridicat yes (y) ridica>ndu-l Combinations ======================== Tag Example ======================== Vmii1s abandonam Vmii2s abandonai Vmii3s abandona Vmii1p abandonam Vmii2p abandonat,i Vmii3p abandonau Vmis1s abandonai Vmis2s abandonas,i Vmis3s abandona( Vmis1p abandonara(m Vmis2p abandonara(t,i Vmis3p abandona( Vmil1s abandonasem Vmil2s abandonases,i Vmil3s abandonase Vmil1p abandonasera(m Vmil2p abandonasera(t,i Vmil3p abandonasera( Vmip1s abandonez Vmsp1s abandonez Vmip2s abandonezi Vmsp2s abandonezi Vmip3s abandoneaza( Vmip3p abandoneaza( Vmsp3s abandoneze Vmsp3p abandoneze Vmsp1p abandona(m Vmsp2p abandonat,i Vmm-2s abandoneaza( Vmm-2p abandonat,i Vmnp abandona Vmp--sm abandonat Vmp--sm---y abandonatu Vmp--sf abandonata( Vmp--pf abandonate Vmp--pm abandonat,i Vmg abandona>nd Vmg-------y abandona>ndu Va--1s as, Voip trebuie Vcip1s sunt Vcip3p sunt Vcip1s----y -s Vcip3p----y -s 3. Adjective (A) 3.1 Type Although it is not common practice in Romanian linguistics, one could make the distinction between qualificative and determinative adjectives. However, the attribute-value pairs proposed in this section are appropriate for the qualificative adjectives. Attribute value Ro. example Type qualificative (f) frumos 3.2 Degree The default value is positive; adjectives have also comparative and superlative degrees, but in most of the cases they are expressed by means of analytical forms (e.g. comp. mai bun (better), superl. cel mai bun (the best)). A few adjectives have intrinsic etymological comparative or superlative meanings (e.g. comparatives: anterior, major; superlatives: optim, maxim, extrem etc.). The prefixes super-, extra-, ultra- etc., are quite productive in forming the quasi-analytic1 superlatives. Adjectives are characterized by gender, number and case. Attribute value Ro. example Degree positive (p) frumos comparative (c) ulterior superlative (s) extrem 3.3 Gender Attribute value Ro. example Gender masculine (m) bun feminine (f) buna( neuter (n) sg.bun/pl.bune 3.4 Number Attribute value Ro. example Number singular (s) bun plural (p) buni 3.5 Case The adjectives present the same case syncretism as the nouns, except for few adjectives that have an additional special form for Genitive-Dative cases in plural (e.g. G.D.pl. multor). Attribute value Ro. example Case direct(r) bunul oblique (o) bunului vocative (v) bunule 3.6 Definiteness In noun-adjective construction, the definite article may attach enclitically to either adjectives or modified nouns (never to both of them). If present, the definite article attaches to the right of the first word in the sequence. Bunul om (The kind man) Omul bun. (The kind man) Attribute value Ro. example Definiteness yes (y) bunul no (n) bun 3.7 Clitic See discussion on clitics in 2.10 Combinations ==================== Tag Example ==================== Afpmsrn bun Afpmson bun Afpmsvn bun Afpmprn buni Afpmpon buni Afpmpvn buni Afpmsry bunul Afpmsoy bunului Afpmpry bunii Afpmpoy bunilor Afpfsrn buna( Afpfsvn buna( Afpfson bune Afpfprn bune Afpfpon bune Afpfpvn bune Afpfsry buna Afpfsoy bunei Afpfpry bunele Afpfpoy bunelor Afcmsrn ulterior Afcmson ulterior Afsmsrn extrem Afp gri 4. Pronoun (P) 4.1 Type In Romanian it is worth differentiating the negative pronoun from other indefinite pronouns: a negative pronoun cannot be an argument for a verb unless the verb itself is negated too (e.g. Nu am va(zut pe nimeni / *Am va(zut pe nimeni). Attribute value Ro. example Type demonstrative(d) acesta indefinite (i) oricine possesive(s) (al) meu int_rel (w) ce personal(p) eu reflexive(x) se negative (z) nimeni emphatic (h) i>nsumi 4.2 Person Attribute value Ro. example Person first (1) eu second (2) tu third (3) el 4.3 Gender Attribute value Ro. example Gender masculine (m) el feminine (f) ea neuter (n) sg.acesta/pl.acestea 4.4 Number Attribute value Ro. example Number singular (s) eu plural (p) noi 4.5 Case For the second person of the personal pronoun in both singular and plural there is the vocative case too.The direct and oblique values are needed for the syncretic causal forms of the other pronouns than the personal ones. Attribute value Ro. example Case nominative (n) el genitive (g) (al) lui dative (d) lui accusative (a) (pe) el vocative (v) tu, voi! direct(r) acesta oblique (o) acestuia 4.6 Owner_Number This attribute is meaningful for the Possessive pronouns and refers to the grammatical number of the possessor. Attribute value Ro. example Owner_Number singular (s) meu plural (p) nostru The gramatical number of the possesed object(s) is expressed by the attribute Number (described above). 4.7 Owner_Gender This attribute is irrelevant for Romanian. 4.8 Clitic See discussion on clitics in 2.10. 4.9, 4.10, 4.11, 4.12, 4.13 Referent_Type, Syntactic_Type, Definiteness, Animate, Clitic_s These attributes are irrelevant for Romanian. 4.14 Pronoun_Form For Romanian we need an attribute (called Pronoun_Form) to make the distinction between strong and weak forms of the same pronoun. All the weak forms can be adjoined to the adjacent words both proclitically or enclitically. In such cases the junction is always graphically marked by a hyphen between the pronoun and the neighboring word. The hyphen also marks possible elisions from either pronoun or the adjacent word. Although in traditional grammar books the demonstrative, int_rel and indefinite pronouns are not characterised by person, in our dictionaries they are recorded (for reasons beyond morpho-lexical encoding) as 3rd person (the same as nouns). However, for the automatic tagging this value has been marked as irrelevant. Attribute value Ro. example Pronoun_Form strong (s) lui weak (w) i>i, i- Combinations ==================================== Tag Example ==================================== Pp1msn--------s eu Pp1msd--------w mi Pp1msd--------s mie Pp1msd--y-----s mi- Pd-msr acesta Pd-mso acestuia Pi-mpr tot,i Ps1fsrs mea Pw-mso ca(rui Pn-msr nimeni Ph1msr i>nsumi Ph1fsr i>nsa(mi Px3msa--------s sine Px3msa--------w se Px3msa--y-----w s- 5. Determiner (D) 5.1 Type The need for a negative value of the determiners' Type attribute is argued on the same lines as in the section on pronoun' s Type. In Romanian the negative determiner is expressed by the unit nici + indefinite article (e.g. nici un, nici o). In Romanian, there are specific forms for the so-called emphatic determiner, which may accompany both a noun and a personal pronoun: fata i>nsa(s,i (the girl herself), also ea i>nsa(s,i (she herself). Attribute value Ro. example Type demonstrative (d) acest indefinite (i) orice possessive (s) meu int_rel (w) ce negative (z) nici un emphatic (h) i>nsus,i 5.2 Person This attribute is meaningful for the Possessive determiners and refers to the grammatical person of the possessor. Attribute value Ro. example Person first (1) meu second (2) ta(u third (3) sa(u 5.3 Gender Attribute value Ro. example Gender masculine (m) meu feminine (f) mea neuter (n) sg.meu/pl.mele 5.4 Number Attribute value Ro. example Number singular (s) meu plural (p) mei 5.5 Case Attribute value Ro. example Case direct(d) aceasta oblique (o) acestei 5.6 Owner_Number This attribute is meaningful for the Possessive determiners and refers to the grammatical number of the possessor. Attribute value Ro. example Owner_Number singular (s) meu plural (p) nostru 5.7 Owner_Gender This attribute is irrelevant for Romanian. 5.8 Clitic See discussion on clitics in 2.10. Attribute value Ro. example Clitic no (n) mama mea yes (y) maica(- mea 5.9 Modific_Type As mentioned in the corresponding section on Pronoun, the Modific_Type attribute is relevant for some determiners too. The prenominal determiner always precedes the noun (e.g.acest ba(iat - this boy), whereas the postnominal determiner appears only after the noun (e.g. ba(iatul acesta - this boy). Attribute value Ro. example Modific_Type prenominal (e) acest postnominal (o) acesta Combinations ======================= Tag Example ======================= Dd-mso---e acestui Dd-mso---o acestuia Di-mpr tot,i Ds1fsrs mea Dw-msr care Dw-mso ca(rui Dz-msr nici_un Dh1msr i>nsumi Dh1fsr i>nsa(mi Although in traditional grammar books the demonstrative, indefinite and int_rel determiners are not characterised by person, in our dictionaries they are recorded (for reasons beyond morpho-lexical encoding) as 3rd person (the same as nouns). However, for the automatic tagging this value has been marked as irrelevant. 6. Article (T) 6.1 Type Although it presents only a few items, the article in Romanian has four types, unlike in most of the European languages. Beside the two recommended types: definite and indefinite which have the generally known semantic value, Romanian uses two additional types of articles, which are semantically subordinated to the definite article but which have special forms and meanings: - the possessive article (also called genitival article) is an element in the structure of the possessive pronoun, of the ordinal numeral (e.g. al meu (mine) and al treilea (the third)), and of the indefinite genitive forms of the nouns (e.g. capitol al ca(rt,ii (chapter of the book)). - the demonstrative article links a definite noun to its determinants, links a numeral or an adjective to a noun, and it is a constituent part of the relative superlative (e.g. fata cea mare (the elder girl), cel lenes, (the lazy), respectively prietenul cel mai bun (the best friend)). Notice that the definite article has only enclitic forms, except for one proclitical form (lui + proper noun: lui Ion). Attribute value Ro. example Type definite (f) lui indefinite (i) un possessive (s) al demonstrative (d) cel 6.2 Gender Attribute value Ro. example Gender masculine (m) un feminine (f) o neuter (n) sg.cel/pl.cele 6.3 Number Attribute value Ro. example Number singular (s) un plural (p) nis,te 6.4 Case Attribute value Ro. example Case direct(r) cel oblique (o) celui 6.5 Clitic The inflected forms of the foreign-origin words (mainly nouns) not fully assimilated, are usually written with a hyphen between the base-form and the inflectional ending. In our encoding, we classified these endings (which are supposed to be split by the segmenter) as clitic articles (clitic attribute is always "y") which can be either definite (type=f, "-istul") or indefinite (type=i, "ist") and are characterised by gender (gender=m, "ist"; gender=f, "ista("), number (number=s, "ist"; number=p, "is,ti") and case (case=r, "istul"; case=o, "istului"). Combinations ===================== Tag Example ===================== Tfmso lui Tffso lui Timsr un Tsmpr ai Tdfso celei Timsry -ist Timsoy -ist Tfmsry -istul Tfmsoy -istului 7. Adverb (R) 7.1 Type The distinction proposed here considers the principal syntactic properties of the adverbs. For Romanian, the general type includes most of the pronominal adverbs (demonstrative: aici (here), indefinite: oriunde (anywhere)). As argued before for pronouns and determiners, a distinct negative value is needed for adverbs as well (nica(ieri - nowhere, niciodata( - never). The particle type covers those adverbs which can dislocate verbal compound forms (ex. Ea a tot ca>ntat -- She has ever sung) or mark degrees (ex. circa (about), foarte (very), prea (too)). Such adverbs are cam, mai, prea, s,i, tot, foarte etc. A useful distinction in Romanian considers the adverbs which can have predicative role, that is they can govern a subordinate sentence (ex. Fires,te ca( o s,tiu -- Certainly I know it). Here (for uniformity within a multilingual environment), they are squeezed into the modifier class. No formal distinction is made between the interrogative adverbs and the relative ones. Attribute value Ro. example Type general (g) bine, acolo particle (p) mai, cam negative (z) nica(ieri modifier (m) fires,te, poate int_rel(w) cum 7.2 Degree In Romanian, the comparative and superlative of adverbs is formed analytically with mai (put,in), cel mai (put,in), foarte: ex. mai repede (faster), cel mai devreme (at the earliest). Nonetheless there are some adverbs with comparative or superlative meaning (ex. optim, ulterior,. definitiv). These adverbs can be used for expressing the absolute superlative of other adverbs or adjectives: ex. extrem de bine, formidabil de frumos. Attribute value Ro. example Degree positive (p) bine comparative (c) ulterior superlative (s) extrem 7.3 Clitic See discussion on clitics in 2.10. Combinations ===================== Tag Example ===================== Rgp repede Rgs extraordinar Rgc ulterior Rp mai Rz nica(ieri Rm probabil Rw cum 8. Adposition (S) 8.1 Type It is the preposition type which is only pertinent to Romanian, although some intercalating adpositions may be seen as a sort of circumposition, for instance i>ntre...s,i... (between..and...). Attribute value Ro. example Type preposition (p) la, pe, i>n 8.2 Formation In Romanian there is a distinct class of compound prepositions. Each of them forms a formal and semantic unit, although graphically they stay unfused, e.g. de la, pe la, de pe, etc. Attribute value Ro. example Formation simple (s) la, pe, i>n compound (c) de la 8.3 Case This attribute marks the subcategorisation properties of the adpositions. Attribute value Ro. example Case genitive (g) i>naintea dative (d) datorita( accusative (a) la 8.4 Clitic See discussion on clitics in 2.10. Combinations ====================== Tag Example ====================== Spsa i>n Spsay i>ntr- Spsd datorita( Spca de_la 9. Conjunction (C) 9.1 Type The distinction between coordinating and subordinating conjunctions is pertinent to Romanian as well. Attribute value Ro. example Type coordinating (c) s,i, dar subordinating (s) ca(, daca( 9.2 Formation Likewise prepositions, we can distinguish two kinds of conjunctions in Romanian: - simple conjunctions: e.g. s,i,dar,des,i etc. - conjunctions formed periphrastically, with some word/phrase combined by a conjunction: din moment ce, fa(ra( sa(, fat,a( de cum etc. Attribute value Ro. example Formation simple (s) deoarece compound (c) de_vreme_ce 9.3 Coord_Type In Romanian, there are three kinds of conjunctions depending on their usage: as such or together with other conjunctions or adverbs: - simple, between conjuncts: Ion s,i Maria (John and Mary); - repetitive, before each conjunct: ori Ion ori Maria ori... (either John or Mary or...) - correlative, before a conjoined phrase, it requires specific coordinators between conjuncts: ata>t mama ca>t s,i tata (both mother and father). With respect to the place of the conjunctions, most of them stay before the conjunct, except for: as,adar, deci (so), dar, i>nsa( (but), daca( (if), which also appear with expressive value inside the conjoined sentence. Attribute value Ro. example Coord-Type simple (s) s,i,deoarece repetit (r) fie...fie... correlat (c) ati>t...ca>t s,i 9.3 Sub_Type In Romanian, each conjunction requires another mood, so that the diversity may be controlled by subcategorisation rules. Attribute value Ro. example Sub-Type negative (z) nici positive (p) dar This attribute distinguishes among the positive and negative conjunctions, providing means to control verbal double negation, (as in case of the negative pronouns, determiners and adverbs): nici NU am venit, nimeni NU vorbes,te, nici_un tren N-a trecut, nica(ieri N-am va(zut 9.5 Clitic See discussion on clitics in 2.10. Attribute value Ro. example Clitic no (n) ca( as,a yes (y) c-as,a Combinations ====================== Tag Example ====================== Ccsps s,i Ccrps fie...fie Csrzs nici...nici Csspc de_vreme_ce 10. Numeral (M) 10.1 Type Traditional Romanian grammars usually distinguish seven numeral types, where five of them have specific forms and the other two are obtained by composition. The first group is made up by the following numeral types: cardinal (trei-three), ordinal (al treilea-the third), fractional (treime-one third), multiple (i>ntreit-trine), collective (ama>ndoi-both). The second group contains the numeral types which are composed by means of other parts of speech: distributive (ca>te trei-...each three...), adverbial (de trei ori-thrice) and again the collective numeral which also has compound forms (tot,i trei-all three). Nonetheless, as the numerals of the second group have a weak syntactic cohesion, namely each composition element may be regarded as an element of the sentence, with its own grammatical function, these last numeral types are irrelevant for the morphosyntactic annotation. Attribute value Ro. example Type cardinal (c) trei ordinal (o) (al) treilea fractal (f) treime multiple (m) i>ntreit collect(l) tustrei In Romanian (as in many other languages) several numerals have noun behaviour (some grammarians classify such numerals as nouns) with gender and declension of their own, which they preserve even in the composition of the superior order numerals; these are, for instance, suta( (hundred), mie (thousand), milion (million) and miliard (billion). In a sentence most numerals may fulfill the function of other parts of speech like noun, determiner or adverb. 10.2 Gender Attribute value Ro. example Gender masculine (m) doi, primul feminine (f) doua(, prima neuter (n) (un) milion, (doua() milioane 10.3 Number Attribute value Ro. example Number singular (s) primul plural (p) primii 10.4 Case Attribute value Ro. example Case direct(r) primul oblique (o) primului 10.5 Form Attribute value Ro. example Form digit (d) 1690 letter (l) unsprezece both (b) 5 mii roman (r) XIV 10.6 Definiteness By virtue of their noun or adjective value, some numerals may take the enclitic article (prim/primul - first/the first). Consequently for the Romanian, definiteness attribute helps distinguish the enclitic forms from the other forms. Attribute value Ro. example Definiteness yes (y) primul no (n) prim 10.7 Clitic See discussion on clitics in 2.10. Combinations ======================= Tag Example ======================= Mcmprl doi Mcmpol doi Momsrl doilea Momsol doilea Mlmpr ama>ndoi Momsrlyy primu-i Mffpoly treimilor 11. Interjections (I) In Romanian there are no relevant subcategories of interjections. ==================== Tag Example ==================== I oh,ah,au ==================== 12. Residual (X) No attributes are defined for this category. ========================== Tag Example ========================== X show, a+b, retro- ========================== 13. Abbreviation (Y) The Syntactic_Type attribute is useful for specifying the grammatical category of an abbreviation. Although the values for this attribute could range over the part of speech categories in the language, in Romanian most of the abbreviations falls into noun class. 13.1 Syntactic_Type Attribute value Ro. example Syntactic_Type nominal(n) d-na (doamna) verbal (v) v. (vezi) adjectival (a) ant. (anterior) adverbial(r) f. (foarte) 13.2 Gender Attribute value Ro. example Gender masculine (m) d-lui feminine (f) d-na neuter(n) apt. 13.3 Number Attribute value Ro. example Number singular (s) d-na plural (p) d-nele 13.4 Case Attribute value Ro. example Case direct (r) d-na oblique (o) d-nei 13.5 Definiteness Attribute value Ro. example Definiteness yes (y) d-nele no (n) d-ne Combinations ====================== Tag Example ====================== Ynmsry d-ul Ynfsoy d-nei Ynnsry apt. 14. Particle (Q) 14.1 Type Attribute value Ro. example Type negation(z) nu,n- infinitive (n) a subjunctive (s) sa( aspect fi future o 14.2 Formation This attribute is not irrelevant for Romanian. 14.3 Clitic See discussion on clitics in 2.10. Attribute value Ro. example Clitic yes (y) n-am no (n) nu am Combinations ================= Tag Example ================= Qz nu Qz-y n- Qn a Qs sa( Qa fi Qf o \end{verbatim} \end{small} \newpage \section{Application to Slovene} \markright{COP project 106 \mte {\hfill}Deliverable D1.1 F --- Slovene {\hfill}} The application to Slovene has been elaborated by Toma\v{z} Erjavec and Peter Holozan. Acknowledgements:\\ The authors thank Velimir Gjurin, France \v{Z}agar, Vladim\'{\i}r Petkevi\v{c}, Lydia Sinapova and David Stermole for their much appreciated comments and suggestions. All errors of course remain our own. All Slovene diacritical characters used have been encoded in the following way: \begin{enumerate} \item The Slovene 'hachek' diacritic (\v{ }) is marked by the corresponding nondiacritic counterpart followed by '{\tt <}'. The following possibilities exist:\\ {\tt c< s< z<} \end{enumerate} \begin{small} \begin{verbatim} 1. Noun (N) 1.1 Lexicon = ============== ============== = ================= ========================= P ATT VAL C Example Slovene term = ============== ============== = ================= ========================= 1 Type common c stol obc