[Next] [Up] [Previous]
Next: Client/Server Up: Probabilistic Tagging in Previous: Principles of P-O-S Tagging

The Algorithm

QTAG works by combining two sources of information: a dictionary of words with their possible tags and the corresponding frequencies and a matrix of tag sequences, also with associated frequencies. These resources can easily be generated from a pre-tagged corpus.

The tagging works on a window of three tokens, which is filled up with two dummy words at the beginning and the end of the text. Tokens are read and added to the window which is shifted by one position to the left each time. The token that `falls' out of the window is assigned a final tag.

The tagging procedure is as follows:

read the next token
look it up in the dictionary
if not found, guess possible tags
for each possible tag
1. calculate [IMAGE ], the probability of the token to have the specified tag
2. calculate [IMAGE ], the probability of the tag to follow the tags [IMAGE ] and [IMAGE ].
3. calculate [IMAGE ], the joint probability of the individual tag assignment together with the contextual probability.
repeat the computation for the other two tags, but using different values for the contextual probability: the probabilities of the tag being surrounded and followed by the two other tags respectively.

For each recalculation (three for each token) the resulting probabilities are combined to give the overall probability of the tag being assigned to the token. As these values become very small very quickly, they are represented as logarithms to the base 10 internally. For output the tags are sorted according to their probability, and the difference in probabilities between the tags gives some measure of the confidence with which the tag ought to be correct[+].

Multext-East