There are two main approaches to part-of-speech tagging: rule-based and probabilistic. The tagger presented in this document belongs to the purely probabilistic ones. That means, that for disambiguating tags within a text it only uses probabilities, and no rule-based mechanism.
The first step in any tagging process is to look up the token to be tagged in a dictionary. If the token cannot be found, the tagger has to have some fallback mechanism, such as a morphological component or some heuristic methods. The difficult task is to deal with ambiguities: only in trivial cases will there be exactly one tag per word.
This is were the two approaches differ: while the rule-based approach tries to apply some linguistic knowledge (usually encoded in rules) in order to rule out illegal tag combinations, a probabilistic tagger determines which of the possible sequences is more probable, using a language model that is based on the frequencies of transitions between different tags.
A rule-based language model can be created by a human using linguistic knowledge, but it is not possible to `hand-code' a probabilistic language model. These models are generally created from training data, ie they learn by example. This is the way QTAG works.
The tagger used for the experiment presented here was QTAG, a probabilistic tagger which had previously been used at Corpus Research in combination with an email interface[+] It was re-implemented in order to make it language independent.
The basic algorithm is fairly straightforward: at first, the tagger looks up all possible tags that the current word can have, together with their respective probabilities. This information (which holds for the word in isolation) is then combined with the probability for each tag to occur in a sequence preceeded by the two previous tags. The tag with the highest combined score is selected. Two further processing steps also take into account the scores of the tag as the second and first element of the triplet as the following two words are evaluated.