This work is licenced under the Creative
Commons Attribution 3.0 licence. You should give the
original authors of the digital resource credit. In scientific
publications this means citing the relevant publication or
publications describing the work on this digital resource. The
bibliography is available from the page http://nl.ijs.si/imp.
Sampling for this corpus was performed in two
steps. First, complete documents were selected from the available
historical books and newspapers. In the second step individual
pages were randomly (but subject to certain constraints) sampled
from the documents, to arrive at 1100 pages.
The transcriptions were hand-corrected to
correspond to the facsimile. Errors in the originals have not
been corrected but are marked-up with the tag "Xt".
The texts are segmented into "anonymous blocks",
which are then typed to paragraphs, headings, captions, etc. The
blocks are then (automatically) segmented into sentences and
these into words, punctuation markes and whitespace.
Word-level linguistic annotation comprises the
normalised form of the historical word (lower-case and vowel
diacritics removed), the modernised form of the word, the lemma
and its coarse grained morphosyntactic description, i.e. its PoS
tag. Extinct words have assigned a gloss giving the closes
contemporary equivalents and the source from where this gloss was
gleaned. These annotations were first automatically assigned and
then manually corrected.
The two-letter language codes follow ISO 639 and
are defined in the language usage element. An exception is the IANA
code "sl-bohoric" designating Slovene written in the Bohorič
alphabet.
Coarse-grained morphosyntactic descriptions
follow the IMP morphosyntactic specification, c.f. http://nl.ijs.si/imp/msd