In order to enable the tagger to work with Romanian data, all elements specific to English that had been hard-coded had to be removed. This mainly concerned the inbuilt morphologic component that would deal with unknown words, ie those that were not contained in the lexicon. All other information is contained in external resource files which are not specific to a language with regards to their format.
A separate issue concerns dealing with non-ASCII characters. The way
which was chosen for those was to use their SGML entity names, a
literal description enclosed in an ampersand and a semicolon, eg
ä. As the input data for tagger is supposed to be
tokenised (ie tokens separated by white spaces) the impact of an
extended character on segmentation was not considered.
To increase the flexibility of the tagger for future uses with further languages the internal lexicon and context information were taken out of the processing core and the tagger was re-implemented following a client-server model: the client requests all necessary data from the server through a special protocol, and this allows to switch to a different tagset (ie a different language) easily.
The hard-coded morphology was replaced by another resource, a list of the final three letters of all words from the lexicon with their respective tag probabilities. This mechanism seems sufficiently general to yield good results even in languages with a rather complex morphology at little cost.
As the tagging program had to be changed anyway, it was re-implemented in Java, in order to allow the client to run on different machines more easily. The server component is still written in C. The total size of the three relevant (executable) byte-code files is less than ten kilobytes.
Furthermore, since the tagger is implemented as a class, it can easily be used as a module in other linguistic applications.