§segmentation
|
The texts are manually segmented into
translation units, encoded as "anonymous blocks", and
then automatically into sentences, words, punctuation
marks and whitespace.
|
§interpretation
|
The text has been automatically
tokenised, part-of-speech tagged and lemmatised. For
Slovene, the ToTrTaLe tool was used, while English was
processed with TreeTagger using the Penn Treebank model.
Two tags are given for each word. For Slovene, @ctag gives
the reduced SPOOK
tag, while @ana gives the complete JOS
morphosyntactic tag. For English, @ctag gives the
original TreeTagger (Penn) PoS tag, while @ana gives its
mapping to its equivalent SPOOK
tag.
|