[Next] [Up] [Previous]
Next: A Complexity Metric for Up: Probabilistic Tagging in Previous: Adapting the Tagger

Language Resources

The Romanian word-forms lexicon was created based on a 35.000-lemma lexicon by means of our EGLU natural language processing platform (D. Tufis: ''A Generic Platform for Developing Language Resources and Applications'', in Proceeding of the Second International TELRI Seminar, Kaunas, April 1997) . Since several words in the corpus were not in the EGLU lexicon, most of them were manually lemmatised, introduced in the unification-based lexicon and later on expanded to the full paradigms of every new lemma. The Romanian word-form lexicon is actually made of two parts: the main one contains only words attested by the Explanatory Dictionary of Romanian (DEX); all the other words, appearing in the corpus were entered an auxiliary lexicon.

The auxiliary lexicon contains, among other things, proper names, technical terms and the weird (made-up) words from OrwellÕs 1984 (newspeak dialect).

The table below provides information on the data content of the main dictionary that is used for the corpus analysis.


[IMAGE ]

The MSDs (Morpho Syntactic Descriptions) represent a set of codes as developed in the MULTEXT-EAST project (Tomaz Erjavec, Monica Monachini (eds)``MULTEXT-EAST: Specifications and Notation for Lexicon Encoding'', WP1, Task 1.1 final report, September 1997). The morpho-syntactic descriptions are provided as strings, using a linear encoding. In this notation, the position in a string of characters corresponds to an attribute, and specific characters in each position indicate the value for the corresponding attribute. That is, the positions in a string of characters are numbered 0, 1, 2, etc., and are used in the following way:

¥

By convention, trailing hyphens are not included in the lexical MSDs. Such specifications provide a simple and relatively compact encoding, and are in intention similar to feature-structure encoding used in unification-based grammar formalisms.

The attributes and mainly the values of the attributes were chosen considering only word-level encoding. This is why, some values described in grammar text-books, but assuming compounding (such as compound tenses), were not considered in the MSD encoding.

For a given word-form several MSDs might be applicable (accounting this way for homography). The set of all the MSDs applicable to a given word defines the MSD-ambiguity class for that word. The Romanian lexicon contains 1326 MSD-ambiguity classes (for more details on the Romanian dictionary and corpora encoding and several relevant statistics see: D.Tufis, et all ``Corpora and Corpus-Based Morpho-Lexical Processing'' in Tufis D., Andersen P. (eds.): Recent Advances in Romanian Language Technology, Editura Academiei, Bucharest, 1997).

The corpus used in the experiments and evaluation reported here was made of the integral texts in two books: Orwell's 1984 and Plato's Republic. A brief overview of these texts is given below:


[IMAGE ]

The existent SGML mark-up (TEI conformant) was stripped off and the text has been tokenised. Please note that a token is not necessarily a word: one orthographic word may be split into several tokens - the Romanian "da-mi-l" (give it to me) is split into 3 tokens) or several orthographic words may be combined into one token (the Romanian words "de la" are combined into one token "de_la"). Each lexical unit in the tokenised text was automatically annotated with all its applicable MSDs and then hand disambiguated. This was the main resource, which we called MSD-tagged corpus, for training and evaluating the tagger. The figure below exemplifies the MSD-tagged corpus.

Într- 	Spsay
o	Tifsr
zi	Ncfsrn
senină	Afpfsrn
şi	Ccssp
friguroasă	Afpfsrn
de	Spsa
aprilie	Ncms-n
,	
pe	Spsa
când	Rw
ceasurile	Ncfpry
băteau	Vmii3p
ora	Ncfsry
treisprezece	Mc-p-l
,	
Winston	Npms-n
Smith	Np
,	
cu	Spsa
bărbia	Ncfsry
înfundată	Afpfsrn
în	Spsa
piept	Ncms-n
pentru	Spsa
a	Qn
scăpa	Vmnp
de	Spsa
vântul	Ncmsry
care	Dw3--r---e
-l	Pp3msa--y-----w
lua	Vmii3s
pe	Spsa
sus	Rgp
,	
se	Px3--a--------w
strecură	Vmis3s
iute	Rgp
prin	Spsa
uşile	Ncfpry
de	Spsa
sticlă	Ncfsrn
...

The Tagset

The tagset for Romanian contains 79 tags for different morpho-syntactic categories, plus 8 tags for punctuation. The tagset has been derived by a trial-error procedure from the 674 morphosyntactic description codes (MSDs) defined for encoding the Romanian lexicon. By analysing the MSD clustering in the ambiguity classes, we started to eliminate several attributes in the MSD codes that would reduce cardinality for the MSD ambiguity classes. If one thinks an MSD in terms of a (flat) feature structure, then this elimination of attributes in a given MSD represents a generalisation of the corresponding MSD.

The generalisation process was performed according to the following general guide lines:

¥

Let us consider a few examples to clarify this methodological procedure. In case of nouns and adjectives the attributes `case' and `number' were preserved since when the latter modifies the former, their case and number must agree. Since nouns and adjectives are ambiguous in many cases with respect to these attributes when considered in isolation, but rarely when considered in co-occurrence, preserving these attributes proved to be extremely helpful for the performance of the tagger. On the other hand, gender (which also is subject to the agreement requirement) is very rarely ambiguous. The gender of a noun, adjective, pronoun, determiner, article numeral, or participle is almost always recoverable from the word-form itself, and in those very rare cases when it is not, the immediate context (plus the agreement rule) does the job. We have not found in our corpora even one instance where this would not hold. Although it was preserved in the initial tagset, when it was removed the accuracy of the tagger increased (the number of tags decreased significantly).

The `definiteness' attribute is also fully recoverable from the wordform, but given the grammatical constraints, keeping it was very useful in helping the tagger to discriminate among nouns adjectives and participles (ex: frumosul baiat versus frumosul baiatului or iubitul baiat versus iubitul baiatului or baiatul iubit).

With verbs, the distinction was preserved between finite verbs (main and auxiliary) and non-finite verbs (infinitive, participle and gerund). For finite verbs the attribute `person' was preserved and only for auxiliaries the `number' attribute was kept. Eliminating the tense attribute dramatically decreased the number and the cardinality of the ambiguity classes. Eliminating the `mood' attribute further decreased the number and the cardinality of the ambiguity classes.

The most problematic categories were pronoun and determiner (this class contains what grammar books traditionally call adjectival pronouns and due to several commonalities it was merged with the pronominal class). As with the nominal categories, we opted for preserving the `case' and `number' attributes.

However, with only these two criteria the tagset generator classified most of the pronominal MSDs into functionally and distributionally unrelated classes. The only exceptions were the reflexive pronouns and determiners in accusative or dative, which clustered together (they are explicitly marked for this cases and are insensitive to number). The next step was to classify the pronominal and determiner MSDs according to their types, number and case. With these new criteria, the number of tags remain the same, but the clustering of MSDs was more linguistically motivated. Then, by observing the MSD ambiguity classes, we manually assigned to a few number of MSDs different tags, so that hard statistical disambiguation should be avoided. The most obvious case was the distinction between numeral and article readings for the words un (a/an or one masculine) o (a/an or one feminine). To identify the numeral readings of these words some semantic criteria would have been necessary (for instance to define a subclass of nouns ``measure unit''), so we decided to classify them only as articles.

Similar considerations applied for all the other word classes. In the table below the actual tagset is given.

1.   A    = Adjective
2.   AN   = Adjective, indefinite
3.   APN  = Adjective, plural, indefinite
4.   APON = Adjective, plural, oblique,indefinite
5.   APOY = Adjective, plural, oblique, definite
6.   APRY = Adjective, plural, direct, definite
7.   ASN  = Adjective, singular, indefinite
8.   ASON = Adjective, singular, oblique, indefinite
9.   ASOY = Adjective, singular, oblique, definite
10.  ASRY = Adjective, singular, direct, definite
11.  ASVN = Adjective, singular, vocative, indefinite
12.  ASVY = Adjective, singular, vocative, definite
13.  C    = Conjunction
14.  CVR  = Conjunction or Adverb
15.  I    = Interjection
16.  M    = Numeral
17.  NP   = Proper Noun
18.  NN   = Common Noun, singular
19.  NPN  = Common Noun, plural, indefinite
20.  NPOY = Common Noun, plural, oblique, definite
21.  NPRN = Common Noun, plural, direct, indefinite
22.  NPRY = Common Noun, plural, direct, definite
23.  NPVY = Common Noun, plural, vocative, definite
24.  NSN  = Common Noun, singular, indefinite
25.  NSON = Common Noun, singular, oblique, indefinite
26.  NSOY = Common Noun, singular, oblique, definite
27.  NSRN = Common Noun, singular, direct, indefinite
28.  NSRY = Common Noun, singular, direct, definite
29.  NSVN = Common Noun, singular, vocative, indefinite
30.  NSVY = Common Noun, singular, vocative, definite
31.  NSY  = Common Noun, singular, definite
32.  PI   = Quantifier Pronoun or Determiner (Indefinite or negative)
33.  PXA  = Reflexive Pronoun, accusative
34.  PXD  = Reflexive Pronoun, dative
35.  PSP  = Pronoun or Determiner, possessive or emphatic, plural
36.  PSS  = Pronoun or Determiner, possessive or emphatic, singular
37.  PPPA = Personal Pronoun, plural, accusative, week form
38.  PPPD = Personal Pronoun, plural, dative
39.  PPSA = Personal Pronoun, singular, accusative
40.  PPSD = Personal Pronoun, singular, dative
41.  PPSN = Personal Pronoun, singular, nominative, non-third person
42.  PPSO = Personal Pronoun, singular, oblique
43.  PPSR = Personal Pronoun, singular, direct
44.  PPPO = Personal Pronoun, plural, oblique
45.  PPPR = Personal Pronoun, plural, direct
46.  RELO = Pronoun or Determiner, relative, oblique
47.  RELR = Pronoun or Determiner, relative, direct
48.  DMPO = Pronoun or Determiner, demonstrative, plural, oblique
49.  DMSO = Pronoun or Determiner, demonstrative, singular, oblique
50.  DMPR = Pronoun or Determiner, demonstrative, plural, direct
51.  DMSR = Pronoun or Determiner, demonstrative, singular, direct
52.  QN   = Infinival Particle
53.  QS   = Subjunctive Particle
54.  QF   = Future Particle
55.  QZ   = Negative Particle
56.  R    = Adverb
57.  S    = Preposition
58.  TP   = Article, indefinite or possessive, plural
59.  TPO  = Article, non-possessive, plural, oblique
60.  TPR  = Article, non-possessive, plural, direct
61.  TS   = Article, definite or possessive, singular
62.  TSO  = Article, non-possessive, singular, oblique
63.  TSR  = Article, non-possessive, singular, direct
64.  V1   = Verb, main, 1st person
65.  V2   = Verb, main, 2nd person
66.  V3   = Verb, main, 3rd person
67.  VA1  = Verb, auxiliary, 1st person
68.  VA1P = Verb, auxiliary, 1st person, plural
69.  VA1S = Verb, auxiliary, 1st person, singular
70.  VA2P = Verb, auxiliary, 2nd person, plural
71.  VA2S = Verb, auxiliary, 2nd person, singular
72.  VA3  = Verb, auxiliary, 3rd person
73.  VA3P = Verb, auxiliary, 3rd person, plural
74.  VA3S = Verb, auxiliary, 3rd person, singular
75.  VG   = Verb, gerund
76.  VN   = Verb, infinitive
77.  VP   = Verb, participle
78.  X    = Residual
79.  Y    = Abbreviation

The Training Corpus

The training corpus, or (see example below) was generated from the MSD-annotated tokenised corpus by substituting the MSDs with their corresponding corpus tags (described in the previous section). We called this kind of resource an MSD-corpus.

Într- 	S
o	TSR
zi	NSRN
senină	ASRN
şi	CVR
friguroasă	ASRN
de	S
aprilie	NSN
,	COMMA
pe	S
când	R
ceasurile	NPRY
băteau	V3
ora	NSRY
treisprezece	M
,	COMMA
Winston	NSN
Smith	N
,	COMMA
cu	S
bărbia	NSRY
înfundată	ASRN
în	S
piept	NSN
pentru	S
a	Qn
scăpa	VN
de	Sp
vântul	NSRY
care	PR
-l	PSA
lua	V3
pe	S
sus	R
,	COMMA
se	PA
strecură	V3
iute	R
prin	S
uşile	NPRY
de	S
sticlă	NSRN
...

Out of the training corpus, 80% was retained for the proper training and the rest of 20% (the first parts of both 1984 and Republic) used for the validation purposes.

The Training Process

The training process is merely a reformatting exercise. The data contained in the training corpus is sorted and filtered twice, once to extract the lexicon (if it does not exist already) and once for the tag trigrams. The extraction of the relevant information is done mostly with standard Unix text processing tools and two special purpose programs. As the final step the three letter word-endings are extracted from the lexicon and a `guess list' is created from them.

The tagger is ready for use with the new resources as soon as the data files are accessible to the server program (it is not even necessary to restart the server for this). The output format of the tagger, which we call tag-corpus is a vertical text with all possible tags; the tags have a probability value assigned and are sorted so that the most likely tag comes first (see example below[+]).

...
o [TSR:-4.7998][PSA:-7.1013][Qf:-9.839][VA3S:-11.4813][I:-11.8867] 
zi [NSRN:-1.0662][V2:-10.6697] 
...

This allows expressing a judgement on the confidence with which the tag has been assigned: if the probability difference between the first tag and the next tag(s) is comparatively large, the decision is more certain than near equal probabilities.

Apart from assessing the confidence this can also be used as a starting point for manual correction of the text, as those words with similar probabilities are more likely to contain errors.


[Next] [Up] [Previous]
Next: A Complexity Metric for Up: Probabilistic Tagging in Previous: Adapting the Tagger

Multext-East