Available for research purposes upon receipt of agreement.
In published work based on this resource please cite the appropriate publication from the home page of the project.
The novels have their markup normalised: a) structure
annotation with div and p (attributes xml:id and type); b)
segmentation annotation with s (attribute xml:id); c) tokenisation
annotation with w, c, (attribute type) d) linguistic
annotation with w attributes lemma and ana.
Segmentation into paragraphs follows the printed sources;
it therefore not 1-1 with the English original. Segmentation into
sentences was performed automatically and then hand-validated.
Tokenisation into words and punctuation symbols was perfumed
on the basis of MULTEXT-East lexica, mostly with the MULTEXT tools
'mtseg' and then hand-validated.
The linguistic interpretation of the text consists of
marking up the word tokens with their context disambiguated lemma and
MULTEXT-East morphosyntactic description. The various texts have
undergone various amounts of validation, so error-rates between
them differ.
The MULTEXT-East morphosyntactic descriptions (MSDs) follow
the revised common tables of lexical specifications
MULTEXT-East/Mondilex. The lexical MSDs have been converted to a fslib,
a feature-structure library, while their decomposition into features
is given in a flib, a feature library.
The words in the texts have theirs MSD encoded as the value
of the ana (#IDREF) attribute. This attribute refers to a fs, which, in
turn, refers via its #IDREFS feats to the f elemetns that define it.