The Romanian word-forms lexicon was created based on a 35.000-lemma lexicon by means of our EGLU natural language processing platform (D. Tufis: ''A Generic Platform for Developing Language Resources and Applications'', in Proceeding of the Second International TELRI Seminar, Kaunas, April 1997) . Since several words in the corpus were not in the EGLU lexicon, most of them were manually lemmatised, introduced in the unification-based lexicon and later on expanded to the full paradigms of every new lemma. The Romanian word-form lexicon is actually made of two parts: the main one contains only words attested by the Explanatory Dictionary of Romanian (DEX); all the other words, appearing in the corpus were entered an auxiliary lexicon.
The auxiliary lexicon contains, among other things, proper names, technical terms and the weird (made-up) words from OrwellÕs 1984 (newspeak dialect).
The table below provides information on the data content of the main dictionary that is used for the corpus analysis.
The MSDs (Morpho Syntactic Descriptions) represent a set of codes as developed in the MULTEXT-EAST project (Tomaz Erjavec, Monica Monachini (eds)``MULTEXT-EAST: Specifications and Notation for Lexicon Encoding'', WP1, Task 1.1 final report, September 1997). The morpho-syntactic descriptions are provided as strings, using a linear encoding. In this notation, the position in a string of characters corresponds to an attribute, and specific characters in each position indicate the value for the corresponding attribute. That is, the positions in a string of characters are numbered 0, 1, 2, etc., and are used in the following way:
By convention, trailing hyphens are not included in the lexical MSDs. Such specifications provide a simple and relatively compact encoding, and are in intention similar to feature-structure encoding used in unification-based grammar formalisms.
The attributes and mainly the values of the attributes were chosen considering only word-level encoding. This is why, some values described in grammar text-books, but assuming compounding (such as compound tenses), were not considered in the MSD encoding.
For a given word-form several MSDs might be applicable (accounting this way for homography). The set of all the MSDs applicable to a given word defines the MSD-ambiguity class for that word. The Romanian lexicon contains 1326 MSD-ambiguity classes (for more details on the Romanian dictionary and corpora encoding and several relevant statistics see: D.Tufis, et all ``Corpora and Corpus-Based Morpho-Lexical Processing'' in Tufis D., Andersen P. (eds.): Recent Advances in Romanian Language Technology, Editura Academiei, Bucharest, 1997).
The corpus used in the experiments and evaluation reported here was made of the integral texts in two books: Orwell's 1984 and Plato's Republic. A brief overview of these texts is given below:
The existent SGML mark-up (TEI conformant) was stripped off and the text has been tokenised. Please note that a token is not necessarily a word: one orthographic word may be split into several tokens - the Romanian "da-mi-l" (give it to me) is split into 3 tokens) or several orthographic words may be combined into one token (the Romanian words "de la" are combined into one token "de_la"). Each lexical unit in the tokenised text was automatically annotated with all its applicable MSDs and then hand disambiguated. This was the main resource, which we called MSD-tagged corpus, for training and evaluating the tagger. The figure below exemplifies the MSD-tagged corpus.
Într- Spsay o Tifsr zi Ncfsrn senină Afpfsrn şi Ccssp friguroasă Afpfsrn de Spsa aprilie Ncms-n , pe Spsa când Rw ceasurile Ncfpry băteau Vmii3p ora Ncfsry treisprezece Mc-p-l , Winston Npms-n Smith Np , cu Spsa bărbia Ncfsry înfundată Afpfsrn în Spsa piept Ncms-n pentru Spsa a Qn scăpa Vmnp de Spsa vântul Ncmsry care Dw3--r---e -l Pp3msa--y-----w lua Vmii3s pe Spsa sus Rgp , se Px3--a--------w strecură Vmis3s iute Rgp prin Spsa uşile Ncfpry de Spsa sticlă Ncfsrn ...
The tagset for Romanian contains 79 tags for different morpho-syntactic categories, plus 8 tags for punctuation. The tagset has been derived by a trial-error procedure from the 674 morphosyntactic description codes (MSDs) defined for encoding the Romanian lexicon. By analysing the MSD clustering in the ambiguity classes, we started to eliminate several attributes in the MSD codes that would reduce cardinality for the MSD ambiguity classes. If one thinks an MSD in terms of a (flat) feature structure, then this elimination of attributes in a given MSD represents a generalisation of the corresponding MSD.
The generalisation process was performed according to the following general guide lines:
Let us consider a few examples to clarify this methodological procedure. In case of nouns and adjectives the attributes `case' and `number' were preserved since when the latter modifies the former, their case and number must agree. Since nouns and adjectives are ambiguous in many cases with respect to these attributes when considered in isolation, but rarely when considered in co-occurrence, preserving these attributes proved to be extremely helpful for the performance of the tagger. On the other hand, gender (which also is subject to the agreement requirement) is very rarely ambiguous. The gender of a noun, adjective, pronoun, determiner, article numeral, or participle is almost always recoverable from the word-form itself, and in those very rare cases when it is not, the immediate context (plus the agreement rule) does the job. We have not found in our corpora even one instance where this would not hold. Although it was preserved in the initial tagset, when it was removed the accuracy of the tagger increased (the number of tags decreased significantly).
The `definiteness' attribute is also fully recoverable from the wordform, but given the grammatical constraints, keeping it was very useful in helping the tagger to discriminate among nouns adjectives and participles (ex: frumosul baiat versus frumosul baiatului or iubitul baiat versus iubitul baiatului or baiatul iubit).
With verbs, the distinction was preserved between finite verbs (main and auxiliary) and non-finite verbs (infinitive, participle and gerund). For finite verbs the attribute `person' was preserved and only for auxiliaries the `number' attribute was kept. Eliminating the tense attribute dramatically decreased the number and the cardinality of the ambiguity classes. Eliminating the `mood' attribute further decreased the number and the cardinality of the ambiguity classes.
The most problematic categories were pronoun and determiner (this class contains what grammar books traditionally call adjectival pronouns and due to several commonalities it was merged with the pronominal class). As with the nominal categories, we opted for preserving the `case' and `number' attributes.
However, with only these two criteria the tagset generator classified most of the pronominal MSDs into functionally and distributionally unrelated classes. The only exceptions were the reflexive pronouns and determiners in accusative or dative, which clustered together (they are explicitly marked for this cases and are insensitive to number). The next step was to classify the pronominal and determiner MSDs according to their types, number and case. With these new criteria, the number of tags remain the same, but the clustering of MSDs was more linguistically motivated. Then, by observing the MSD ambiguity classes, we manually assigned to a few number of MSDs different tags, so that hard statistical disambiguation should be avoided. The most obvious case was the distinction between numeral and article readings for the words un (a/an or one masculine) o (a/an or one feminine). To identify the numeral readings of these words some semantic criteria would have been necessary (for instance to define a subclass of nouns ``measure unit''), so we decided to classify them only as articles.
Similar considerations applied for all the other word classes. In the table below the actual tagset is given.
1. A = Adjective 2. AN = Adjective, indefinite 3. APN = Adjective, plural, indefinite 4. APON = Adjective, plural, oblique,indefinite 5. APOY = Adjective, plural, oblique, definite 6. APRY = Adjective, plural, direct, definite 7. ASN = Adjective, singular, indefinite 8. ASON = Adjective, singular, oblique, indefinite 9. ASOY = Adjective, singular, oblique, definite 10. ASRY = Adjective, singular, direct, definite 11. ASVN = Adjective, singular, vocative, indefinite 12. ASVY = Adjective, singular, vocative, definite 13. C = Conjunction 14. CVR = Conjunction or Adverb 15. I = Interjection 16. M = Numeral 17. NP = Proper Noun 18. NN = Common Noun, singular 19. NPN = Common Noun, plural, indefinite 20. NPOY = Common Noun, plural, oblique, definite 21. NPRN = Common Noun, plural, direct, indefinite 22. NPRY = Common Noun, plural, direct, definite 23. NPVY = Common Noun, plural, vocative, definite 24. NSN = Common Noun, singular, indefinite 25. NSON = Common Noun, singular, oblique, indefinite 26. NSOY = Common Noun, singular, oblique, definite 27. NSRN = Common Noun, singular, direct, indefinite 28. NSRY = Common Noun, singular, direct, definite 29. NSVN = Common Noun, singular, vocative, indefinite 30. NSVY = Common Noun, singular, vocative, definite 31. NSY = Common Noun, singular, definite 32. PI = Quantifier Pronoun or Determiner (Indefinite or negative) 33. PXA = Reflexive Pronoun, accusative 34. PXD = Reflexive Pronoun, dative 35. PSP = Pronoun or Determiner, possessive or emphatic, plural 36. PSS = Pronoun or Determiner, possessive or emphatic, singular 37. PPPA = Personal Pronoun, plural, accusative, week form 38. PPPD = Personal Pronoun, plural, dative 39. PPSA = Personal Pronoun, singular, accusative 40. PPSD = Personal Pronoun, singular, dative 41. PPSN = Personal Pronoun, singular, nominative, non-third person 42. PPSO = Personal Pronoun, singular, oblique 43. PPSR = Personal Pronoun, singular, direct 44. PPPO = Personal Pronoun, plural, oblique 45. PPPR = Personal Pronoun, plural, direct 46. RELO = Pronoun or Determiner, relative, oblique 47. RELR = Pronoun or Determiner, relative, direct 48. DMPO = Pronoun or Determiner, demonstrative, plural, oblique 49. DMSO = Pronoun or Determiner, demonstrative, singular, oblique 50. DMPR = Pronoun or Determiner, demonstrative, plural, direct 51. DMSR = Pronoun or Determiner, demonstrative, singular, direct 52. QN = Infinival Particle 53. QS = Subjunctive Particle 54. QF = Future Particle 55. QZ = Negative Particle 56. R = Adverb 57. S = Preposition 58. TP = Article, indefinite or possessive, plural 59. TPO = Article, non-possessive, plural, oblique 60. TPR = Article, non-possessive, plural, direct 61. TS = Article, definite or possessive, singular 62. TSO = Article, non-possessive, singular, oblique 63. TSR = Article, non-possessive, singular, direct 64. V1 = Verb, main, 1st person 65. V2 = Verb, main, 2nd person 66. V3 = Verb, main, 3rd person 67. VA1 = Verb, auxiliary, 1st person 68. VA1P = Verb, auxiliary, 1st person, plural 69. VA1S = Verb, auxiliary, 1st person, singular 70. VA2P = Verb, auxiliary, 2nd person, plural 71. VA2S = Verb, auxiliary, 2nd person, singular 72. VA3 = Verb, auxiliary, 3rd person 73. VA3P = Verb, auxiliary, 3rd person, plural 74. VA3S = Verb, auxiliary, 3rd person, singular 75. VG = Verb, gerund 76. VN = Verb, infinitive 77. VP = Verb, participle 78. X = Residual 79. Y = Abbreviation
The training corpus, or (see example below) was generated from the MSD-annotated tokenised corpus by substituting the MSDs with their corresponding corpus tags (described in the previous section). We called this kind of resource an MSD-corpus.
Într- S o TSR zi NSRN senină ASRN şi CVR friguroasă ASRN de S aprilie NSN , COMMA pe S când R ceasurile NPRY băteau V3 ora NSRY treisprezece M , COMMA Winston NSN Smith N , COMMA cu S bărbia NSRY înfundată ASRN în S piept NSN pentru S a Qn scăpa VN de Sp vântul NSRY care PR -l PSA lua V3 pe S sus R , COMMA se PA strecură V3 iute R prin S uşile NPRY de S sticlă NSRN ...
Out of the training corpus, 80% was retained for the proper training and the rest of 20% (the first parts of both 1984 and Republic) used for the validation purposes.
The training process is merely a reformatting exercise. The data contained in the training corpus is sorted and filtered twice, once to extract the lexicon (if it does not exist already) and once for the tag trigrams. The extraction of the relevant information is done mostly with standard Unix text processing tools and two special purpose programs. As the final step the three letter word-endings are extracted from the lexicon and a `guess list' is created from them.
The tagger is ready for use with the new resources as soon as the data files are accessible to the server program (it is not even necessary to restart the server for this). The output format of the tagger, which we call tag-corpus is a vertical text with all possible tags; the tags have a probability value assigned and are sorted so that the most likely tag comes first (see example below[+]).
... o [TSR:-4.7998][PSA:-7.1013][Qf:-9.839][VA3S:-11.4813][I:-11.8867] zi [NSRN:-1.0662][V2:-10.6697] ...
This allows expressing a judgement on the confidence with which the tag has been assigned: if the probability difference between the first tag and the next tag(s) is comparatively large, the decision is more certain than near equal probabilities.
Apart from assessing the confidence this can also be used as a starting point for manual correction of the text, as those words with similar probabilities are more likely to contain errors.