goo300k

Dept. of Knowledge Technologies, JSI

TEI Header

§file description
§title statement
§title

goo300k reference corpus of historical Slovene
§principal researcher
§name Tomaž Erjavec
§address

Dept. of Knowledge Technologies

Jožef Stefan Institute

Jamova cesta 39

SI-1000 Ljubljana

Slovenia

§statement of responsibility
§name Maja Žorga Dulmin
§responsibility

Linguistic annotation leader.
§statement of responsibility
§name Darja Fišer
§responsibility

Linguistic annotation, preparation of annotator materials.
§statement of responsibility
§name Tina Benčina
§name Katja Cingerle
§name Metod Čepar (ZRC SAZU)
§name Alenka Jelovšek (ZRC SAZU)
§name Urška Kamenšek
§name Nina Mikulin
§name Zala Šmid
§responsibility

Linguistic annotatiton.
§statement of responsibility
§name

Erich Prunč (Karl-Franzens University, Graz)
§responsibility

Principal for digital library AHLib (corpus units "FPG").
§statement of responsibility
§name Alenka Kavčič Čolić (NUK)
§responsibility

NUK principal for EU IMPACT project.
§statement of responsibility
§name Ines Vodopivec, Maša Kodrič, Daša Pokorn (NUK)
§responsibility

OCR correction and markup of text areas in PAGE XML (corpus units "NUK" and "FPGN").
§statement of responsibility
§name Miran Hladnik (FF UNI LJ)
§responsibility

Principal for Wikisource project "Slovene literature classics" (corpus units "WIKI").
§statement of responsibility
§name Domen Kermc
§responsibility

Oversight of transcription correction and conversion of MediaWiki to TEI format (corpus units "WIKI").
§statement of responsibility
§name Matija Ogrin (ZRC SAZU)
§responsibility

ZRC SAZU principal for Google Award project.
§statement of responsibility
§name Kozma Ahačič (ZRC SAZU)
§responsibility

ZRC SAZU principal for OCR correction and annotation.
§statement of responsibility
§name Metod Čepar, Alenka Jelovšek (ZRC SAZU)
§responsibility

OCR correction (corpus units "ZRC").
§edition statement
§edition 1.0
§extent 1100 pages<term> , 293919 words<term>
§publication statement
§distributor
§address

Department of Knowledge Technologies

Jožef Stefan Institute

Jamova cesta 39

SI-1000 Ljubljana

Slovenia

§publication place http://nl.ijs.si/imp
§availability

This work is licenced under the Creative Commons Attribution 3.0 licence. You should give the original authors of the digital resource credit. In scientific publications this means citing the relevant publication or publications describing the work on this digital resource. The bibliography is available from the page http://nl.ijs.si/imp.

§date 2014-01-10
§source description
§citation list
§bibliographic citation

corresponds to = goo168-ZRC_00001-1584
author Dalmatin, Jurij
date 1584
§bibliographic citation

corresponds to = goo168-ZRC_00002-1695
author Janez Svetokriški
date 1695
§bibliographic citation

corresponds to = goo18B-NUK_13105-1768
author Canisius, Petrus; Parhamer, Ignaz; Pohlin, Marko
date 1768
§bibliographic citation

corresponds to = goo18B-NUK_13130-1769
author Sailer, Sebastian; Pohlin, Marko
date 1769
§bibliographic citation

corresponds to = goo18B-NUK_10187-1777
author Pohlin, Marko
date 1777
§bibliographic citation

corresponds to = goo168-ZRC_00003-1784
author Japelj, Jurij
date 1784
§bibliographic citation

corresponds to = goo18B-NUK_13067-1789
author Breznik, Anton
date 1789
§bibliographic citation

corresponds to = goo18B-NUKR10214-1790
author Linhart, Anton Tomaž
date 1790
§bibliographic citation

corresponds to = goo18B-NUK_10224-1794
author Japelj, Jurij
date 1794
§bibliographic citation

corresponds to = goo18B-NUKR10221-1799
author Vodnik, Valentin
date 1799
§bibliographic citation

corresponds to = ioo19A-NUK_10029-1800
date 1800
§bibliographic citation

corresponds to = ioo19A-NUK_07541-1810
author Alvian, Fecit Franciscus
date 1810
§bibliographic citation

corresponds to = goo19A-NUK_10220-1811
author Lhomond, Charles Francois; Vodnik, Valentin
date 1811
§bibliographic citation

corresponds to = goo19A-FPGN04488-1830
author Schmid, Christoph von
date 1830
§bibliographic citation

corresponds to = goo19A-FPGN04554-1836
author Schmid, Christoph von
date 1836
§bibliographic citation

corresponds to = goo19A-FPGN04557-1841
author Schmid, Christoph von
date 1841
§bibliographic citation

corresponds to = goo19A-NUKP14041-1843
date 1843
§bibliographic citation

corresponds to = goo19A-NUKP14041-1844
date 1844
§bibliographic citation

corresponds to = ioo19A-NUKP14041-1845
date 1845
§bibliographic citation

corresponds to = ioo19A-NUKP14041-1846
date 1846
§bibliographic citation

corresponds to = goo19A-FPG_00008-1847
author Zschokke, Heinrich
date 1847
§bibliographic citation

corresponds to = ioo19A-NUKP14041-1847
date 1847
§bibliographic citation

corresponds to = goo19A-FPG_00009-1848
author Zschokke, Heinrich
date 1848
§bibliographic citation

corresponds to = goo19A-FPG_04401-1848
author Schiller, Friedrich
date 1848
§bibliographic citation

corresponds to = ioo19A-NUKP14041-1848
date 1848
§bibliographic citation

corresponds to = ioo19A-FPGN06523-1849
author Campe, Joachim Heinrich
date 1849
§bibliographic citation

corresponds to = ioo19A-NUKP14041-1849
date 1849
§bibliographic citation

corresponds to = goo19B-FPG_00012-1850
author N.N.
date 1850
§bibliographic citation

corresponds to = goo19B-FPG_04260-1850
author Zschokke, Heinrich
date 1850
§bibliographic citation

corresponds to = goo19B-FPGN00016-1850
author Schmid, Christoph von
date 1850
§bibliographic citation

corresponds to = goo19B-FPG_00017-1851
author Schmid, Christoph von
date 1851
§bibliographic citation

corresponds to = goo19B-NUKP14041-1851
date 1851
§bibliographic citation

corresponds to = goo19B-FPG_00018-1852
author Jais, Aegidius
date 1852
§bibliographic citation

corresponds to = goo19B-FPG_00020-1852
author N.N.
date 1852
§bibliographic citation

corresponds to = goo19B-FPG_00021-1853
author Wilhelm, J.
date 1853
§bibliographic citation

corresponds to = goo19B-FPG_00026-1853
author Schmid, Christoph von
date 1853
§bibliographic citation

corresponds to = goo19B-FPGN00023-1853
author Bauberger, Wilhelm
date 1853
§bibliographic citation

corresponds to = goo19B-FPGN04375-1853
author Beecher-Stowe, Harriet Elisabeth
date 1853
§bibliographic citation

corresponds to = goo19B-FPG_00030-1854
author Anibas, Georg
date 1854
§bibliographic citation

corresponds to = goo19B-FPG_00031-1854
author Jais, Aegidius
date 1854
§bibliographic citation

corresponds to = goo19B-FPG_00040-1855
author Schmid, Christoph von
date 1855
§bibliographic citation

corresponds to = goo19B-NUKP14041-1855
date 1855
§bibliographic citation

corresponds to = goo19B-FPG_00045-1856
author Rohlwes, Johann Nikolaus
date 1856
§bibliographic citation

corresponds to = goo19B-NUKP14041-1856
date 1856
§bibliographic citation

corresponds to = goo19B-FPG_00047-1857
author Donin, Ludwig
date 1857
§bibliographic citation

corresponds to = goo19B-FPG_00049-1857
author Schmid, Christoph von
date 1857
§bibliographic citation

corresponds to = goo19B-NUKP14041-1857
date 1857
§bibliographic citation

corresponds to = goo19B-FPG_00061-1861
author Schiller, Friedrich
date 1861
§bibliographic citation

corresponds to = goo19B-FPG_00063-1862
author Schiller, Friedrich
date 1862
§bibliographic citation

corresponds to = goo19B-FPG_00066-1863
author Andersen, Hans Christian
date 1863
§bibliographic citation

corresponds to = goo19B-FPG_00070-1864
author Hoffmann, Franz
date 1864
§bibliographic citation

corresponds to = goo19B-FPG_00082-1866
author Schiller, Friedrich
date 1866
§bibliographic citation

corresponds to = goo19B-FPGN00080-1866
author N.N.
date 1866
§bibliographic citation

corresponds to = goo19B-FPGN00084-1866
author Schmid, Christoph von
date 1866
§bibliographic citation

corresponds to = goo19B-FPG_00086-1867
author Raupach, Ernst
date 1867
§bibliographic citation

corresponds to = goo19B-FPG_06328-1867
author Schamberger, František Ferdinand
date 1867
§bibliographic citation

corresponds to = goo19B-FPGN00085-1867
author Fellöcker, Sigmund
date 1867
§bibliographic citation

corresponds to = goo19B-FPGN00088-1867
author Wiseman, Nicholas Patrick Stephen
date 1867
§bibliographic citation

corresponds to = goo19B-FPG_00093-1869
author Schödler, Friedrich Karl Ludwig
date 1869
§bibliographic citation

corresponds to = goo19B-FPG_00094-1869
author Schödler, Friedrich Karl Ludwig
date 1869
§bibliographic citation

corresponds to = goo19B-FPG_00095-1869
author Schödler, Friedrich Karl Ludwig
date 1869
§bibliographic citation

corresponds to = goo19B-FPG_00106-1871
author Schödler, Friedrich Karl Ludwig
date 1871
§bibliographic citation

corresponds to = goo19B-FPG_00110-1872
author Schmid, Christoph von
date 1872
§bibliographic citation

corresponds to = goo19B-FPG_00125-1875
author Schödler, Friedrich Karl Ludwig
date 1875
§bibliographic citation

corresponds to = goo19B-FPG_00128-1875
author Trientl, Adolf
date 1875
§bibliographic citation

corresponds to = goo19B-FPG_04640-1875
author Schödler, Friedrich Karl Ludwig
date 1875
§bibliographic citation

corresponds to = goo19B-FPG_00155-1880
author Schmid, Christoph von
date 1880
§bibliographic citation

corresponds to = goo19B-FPG_02738-1881
author Musäus, Johann Karl August
date 1881
§bibliographic citation

corresponds to = goo19B-FPG_04299-1883
author Schmid, Christoph von
date 1883
§bibliographic citation

corresponds to = goo19B-FPG_07725-1883
author Mosenthal, Salomon Hermann
date 1883
§bibliographic citation

corresponds to = goo19B-FPG_00188-1884
author Schmid, Christoph von
date 1884
§bibliographic citation

corresponds to = goo19B-FPG_00192-1885
author Anibas, Georg
date 1885
§bibliographic citation

corresponds to = goo19B-FPG_00195-1885
author Močnik, Franc
date 1885
§bibliographic citation

corresponds to = goo19B-FPG_00202-1886
author Seeberg, A.
date 1886
§bibliographic citation

corresponds to = goo19B-FPG_00203-1886
author Baumbach, Rudolf
date 1886
§bibliographic citation

corresponds to = goo19B-FPG_05783-1886
author Kümmel, Max
date 1886
§bibliographic citation

corresponds to = goo19B-FPG_00211-1887
author Huber, Josef
date 1887
§bibliographic citation

corresponds to = goo19B-FPG_00212-1887
author Grimm, Jakob in Wilhelm
date 1887
§bibliographic citation

corresponds to = goo19B-FPG_00214-1887
author Mich, Josef
date 1887
§bibliographic citation

corresponds to = goo19B-FPG_00234-1891
author Marquardt, Paul
date 1891
§bibliographic citation

corresponds to = goo19B-FPG_00235-1891
author Goethe, Hermann
date 1891
§bibliographic citation

corresponds to = goo19B-FPG_00237-1891
author Morre, Carl
date 1891
§bibliographic citation

corresponds to = goo19B-FPG_00261-1896
author Wiedemann, Franz
date 1896
§bibliographic citation

corresponds to = goo19B-FPG_00265-1897
author Anderl, Adalbert
date 1897
§bibliographic citation

corresponds to = goo19B-FPG_02747-1898
author Wagner, Richard
date 1898
§bibliographic citation

corresponds to = goo19B-FPG_04194-1898
author May, Karl
date 1898
§bibliographic citation

corresponds to = goo19B-FPG_04233-1898
author Krauss, Victor von
date 1898
§bibliographic citation

corresponds to = goo19B-FPG_07908-1898
author May, Karl
date 1898
§bibliographic citation

corresponds to = goo19B-FPG_06534-1899
author Spillmann, Joseph
date 1899
§encoding description
§project description

Google research award ‘Language Models for Historical Slovene’ (2011–2012).

§project description

EU project IMPACT: ‘Improving Access to Text’ (2010–2011).

National and University Library and Jožef Stefan Institute: Production of Ground Truth Dataset of Historical Slovene (for "NUK" and "FPGN" sigla)

§project description

AHLib Austrian Academy of Sciences and Jožef Stefan Institute: selection of books, obtaining the images (for "FPG" sigla).

§sampling declaration

Sampling for this corpus was performed in two steps. First, complete documents were selected from the available historical books and newspapers. In the second step individual pages were randomly (but subject to certain constraints) sampled from the documents, to arrive at 1100 pages.

§editorial practice declaration
§correction principles

The transcriptions were hand-corrected to correspond to the facsimile. Errors in the originals have not been corrected but are marked-up with the tag "Xt".

§quotation

Quotation marks have been left in the text, and are not explicitly marked up.

§segmentation

The texts are segmented into "anonymous blocks", which are then typed to paragraphs, headings, captions, etc. The blocks are then (automatically) segmented into sentences and these into words, punctuation markes and whitespace.

§interpretation

Word-level linguistic annotation comprises the normalised form of the historical word (lower-case and vowel diacritics removed), the modernised form of the word, the lemma and its coarse grained morphosyntactic description, i.e. its PoS tag. Extinct words have assigned a gloss giving the closes contemporary equivalents and the source from where this gloss was gleaned. These annotations were first automatically assigned and then manually corrected.

§standard values

The two-letter language codes follow ISO 639 and are defined in the language usage element. An exception is the IANA code "sl-bohoric" designating Slovene written in the Bohorič alphabet.

Coarse-grained morphosyntactic descriptions follow the IMP morphosyntactic specification, c.f. http://nl.ijs.si/imp/msd

§tagging declaration
§namespace

name = http://www.tei-c.org/ns/1.0
§tag usage

gi = facsimile occurs = 89
facsimile
§tag usage

gi = surface occurs = 1100
surface
§tag usage

gi = graphic occurs = 4400
graphic
§tag usage

gi = text occurs = 89
text
§tag usage

gi = body occurs = 89
text body
§tag usage

gi = div occurs = 1100
text division
description

Each division represents one page.
§tag usage

gi = lb occurs = 21124
line break
§tag usage

gi = ab occurs = 8397
anonymous block
§tag usage

gi = gap occurs = 586
gap
§tag usage

gi = s occurs = 22065
s-unit
§tag usage

gi = w occurs = 402650
word
description

Linguistically annotated word token. Attributes are @lemma (base form) and @ana (IMP PoS tag).
§tag usage

gi = pc occurs = 64867
punctuation character
§tag usage

gi = c occurs = 295145
character
description

This element contains only space characters to denote whitespace in text.
§tag usage

gi = choice occurs = 106998
choice
description

Marks a choice between the original and modernised form of a word.
§tag usage

gi = orig occurs = 106998
original form
description

The original form of the word.
§tag usage

gi = reg occurs = 106998
regularization
description

Modernised form of a word.
§tag usage

gi = desc occurs = 9178
description
description

For a facsimile surface gives a short description of the page, and for archaic words contains the gloss of the word and the source of the gloss.
§tag usage

gi = gloss occurs = 8078
gloss
description

Gloss of an archaic word.
§tag usage

gi = bibl occurs = 8078
bibliographic citation
description

Source of the gloss for an archaic word.
§classification declarations
§taxonomy

id = Text.taxonomy
§category

id = Text.medium
category description
term

medium
category

id = Text.manuscript
category description
term

manuscript
category

id = Text.book
category description
term

book
category

id = Text.magazine
category description
term

magazine
category

id = Text.newspaper
category description
term

newspaper
§category

id = Text.type
category description
term

text type
category

id = Text.fiction
category description
term

fiction
category

id = Text.prose
category description
term

prose
category

id = Text.drama
category description
term

drama
category

id = Text.poetry
category description
term

poetry
category

id = Text.nonfiction
category description
term

non-fiction
category

id = Text.religious
category description
term

religious
§category

id = Text.status
category description
term

original/translation
category

id = Text.original
category description
term

original
category

id = Text.translation
category description
term

translation
§feature system declaration
§feature library

§feature

name = CATEGORY id = N0-en corresponds to = S0-sl
symbolic value

value = Noun
§feature

name = Type id = N1.c-en corresponds to = S1.o-sl
symbolic value

value = common
§feature

name = Type id = N1.p-en corresponds to = S1.l-sl
symbolic value

value = proper
§feature

name = Gender id = N2.m-en corresponds to = S2.m-sl
symbolic value

value = masculine
§feature

name = Gender id = N2.f-en corresponds to = S2.z-sl
symbolic value

value = feminine
§feature

name = Gender id = N2.n-en corresponds to = S2.s-sl
symbolic value

value = neuter
§feature

name = CATEGORY id = V0-en corresponds to = G0-sl
symbolic value

value = Verb
§feature

name = Type id = V1.m-en corresponds to = G1.g-sl
symbolic value

value = main
§feature

name = Type id = V1.a-en corresponds to = G1.p-sl
symbolic value

value = auxiliary
§feature

name = Aspect id = V2.e-en corresponds to = G2.d-sl
symbolic value

value = perfective
§feature

name = Aspect id = V2.p-en corresponds to = G2.n-sl
symbolic value

value = progressive
§feature

name = Aspect id = V2.b-en corresponds to = G2.v-sl
symbolic value

value = biaspectual
§feature

name = CATEGORY id = A0-en corresponds to = P0-sl
symbolic value

value = Adjective
§feature

name = Type id = A1.g-en corresponds to = P1.p-sl
symbolic value

value = general
§feature

name = Type id = A1.s-en corresponds to = P1.s-sl
symbolic value

value = possessive
§feature

name = Type id = A1.p-en corresponds to = P1.d-sl
symbolic value

value = participle
§feature

name = Degree id = A2.p-en corresponds to = P2.n-sl
symbolic value

value = positive
§feature

name = Degree id = A2.c-en corresponds to = P2.p-sl
symbolic value

value = comparative
§feature

name = Degree id = A2.s-en corresponds to = P2.s-sl
symbolic value

value = superlative
§feature

name = CATEGORY id = R0-en corresponds to = R0-sl
symbolic value

value = Adverb
§feature

name = Type id = R1.g-en corresponds to = R1.s-sl
symbolic value

value = general
§feature

name = Type id = R1.r-en corresponds to = R1.d-sl
symbolic value

value = participle
§feature

name = Degree id = R2.p-en corresponds to = R2.n-sl
symbolic value

value = positive
§feature

name = Degree id = R2.c-en corresponds to = R2.r-sl
symbolic value

value = comparative
§feature

name = Degree id = R2.s-en corresponds to = R2.s-sl
symbolic value

value = superlative
§feature

name = CATEGORY id = P0-en corresponds to = Z0-sl
symbolic value

value = Pronoun
§feature

name = CATEGORY id = M0-en corresponds to = K0-sl
symbolic value

value = Numeral
§feature

name = Form id = M1.d-en corresponds to = K1.a-sl
symbolic value

value = digit
§feature

name = Form id = M1.r-en corresponds to = K1.r-sl
symbolic value

value = roman
§feature

name = Form id = M1.l-en corresponds to = K1.b-sl
symbolic value

value = letter
§feature

name = CATEGORY id = S0-en corresponds to = D0-sl
symbolic value

value = Preposition
§feature

name = CATEGORY id = C0-en corresponds to = V0-sl
symbolic value

value = Conjunction
§feature

name = CATEGORY id = Q0-en corresponds to = L0-sl
symbolic value

value = Particle
§feature

name = CATEGORY id = I0-en corresponds to = M0-sl
symbolic value

value = Interjection
§feature

name = CATEGORY id = Y0-en corresponds to = O0-sl
symbolic value

value = Abbreviation
§feature

name = CATEGORY id = X0-en corresponds to = N0-sl
symbolic value

value = Residual
§feature

name = Type id = X1.f-en corresponds to = N1.j-sl
symbolic value

value = foreign
§feature

name = Type id = X1.t-en corresponds to = N1.t-sl
symbolic value

value = typo
§feature

name = Type id = X1.p-en corresponds to = N1.p-sl
symbolic value

value = program
§feature-value library

§feature structure

id = Ncm corresponds to = Som
CATEGORY = Noun, Type = common, Gender = masculine
§feature structure

id = Ncf corresponds to = Soz
CATEGORY = Noun, Type = common, Gender = feminine
§feature structure

id = Ncn corresponds to = Sos
CATEGORY = Noun, Type = common, Gender = neuter
§feature structure

id = Npm corresponds to = Slm
CATEGORY = Noun, Type = proper, Gender = masculine
§feature structure

id = Npf corresponds to = Slz
CATEGORY = Noun, Type = proper, Gender = feminine
§feature structure

id = Npn corresponds to = Sls
CATEGORY = Noun, Type = proper, Gender = neuter
§feature structure

id = Va corresponds to = Gp
CATEGORY = Verb, Type = auxiliary
§feature structure

id = Vme corresponds to = Ggd
CATEGORY = Verb, Type = main, Aspect = perfective
§feature structure

id = Vmp corresponds to = Ggn
CATEGORY = Verb, Type = main, Aspect = progressive
§feature structure

id = Vmb corresponds to = Ggv
CATEGORY = Verb, Type = main, Aspect = biaspectual
§feature structure

id = Agp corresponds to = Ppn
CATEGORY = Adjective, Type = general, Degree = positive
§feature structure

id = Agc corresponds to = Ppp
CATEGORY = Adjective, Type = general, Degree = comparative
§feature structure

id = Ags corresponds to = Pps
CATEGORY = Adjective, Type = general, Degree = superlative
§feature structure

id = App corresponds to = Pdn
CATEGORY = Adjective, Type = participle, Degree = positive
§feature structure

id = Asp corresponds to = Psn
CATEGORY = Adjective, Type = possessive, Degree = positive
§feature structure

id = Rgp corresponds to = Rsn
CATEGORY = Adverb, Type = general, Degree = positive
§feature structure

id = Rgc corresponds to = Rsr
CATEGORY = Adverb, Type = general, Degree = comparative
§feature structure

id = Rgs corresponds to = Rss
CATEGORY = Adverb, Type = general, Degree = superlative
§feature structure

id = Rr corresponds to = Rd
CATEGORY = Adverb, Type = participle
§feature structure

id = P corresponds to = Z
CATEGORY = Pronoun
§feature structure

id = Md corresponds to = Ka
CATEGORY = Numeral, Form = digit
§feature structure

id = Mr corresponds to = Kr
CATEGORY = Numeral, Form = roman
§feature structure

id = Ml corresponds to = Kb
CATEGORY = Numeral, Form = letter
§feature structure

id = S corresponds to = D
CATEGORY = Preposition
§feature structure

id = C corresponds to = V
CATEGORY = Conjunction
§feature structure

id = Q corresponds to = L
CATEGORY = Particle
§feature structure

id = I corresponds to = M
CATEGORY = Interjection
§feature structure

id = Y corresponds to = O
CATEGORY = Abbreviation
§feature structure

id = X corresponds to = N
CATEGORY = Residual
§feature structure

id = Xf corresponds to = Nj
CATEGORY = Residual, Type = foreign
§feature structure

id = Xt corresponds to = Nt
CATEGORY = Residual, Type = typo
§feature structure

id = Xp corresponds to = Np
CATEGORY = Residual, Type = program
§text-profile description
§language usage
§language

ident = sl
§term

Slovene
§language

ident = sl-bohoric
§term

Slovene written using the Bohorič alphabet
§language

ident = sl-dajnko
§term

Slovene in Dajnko alphabet
§language

ident = sl-metelko
§term

Slovene in Metelko alphabet
§language

ident = de
§term

German
§language

ident = la
§term

Latin
§language

ident = en
§term

English
§revision description
§change Tomaž Erjavec<name>: Corpus merge & driver file.
§date 2014-01-10


Datum: 2014-01-10

Avtorske pravice za besedilo te izdaje določa licenca Creative Commons Priznanje avtorstva 3.0.