Natural language server
Dept. of Knowledge Technologies
Jožef Stefan Institute


Slovene	English

IMP language resources for historical Slovene

On this page you can find openly accessible language resources for historical Slovene. Most are from the XIX^th century, although older books are also included, all the way to a sample from Dalmatin's Bible from 1584. If you want to read the books, have a look at the digital library with over 650 units or 45,000 pages; the glossary shows archaic spellings of words and their glosses in contemporary Slovene. The concordancer supports the use of corpus-based methods for diachronic studies of Slovene, while the integral resources encoded in XML are useful for the development of Human Language Technologies for processing historical Slovene, and can be downloaded from the CLARIN.SI repository. The motivation, compilation and structure of the IMP resources for historcal Slovene are explained in detail in the bibliography of the project.

Digital Library

Indexes sorted by: title, author, year (showing title pages) and id
The DL converted to an annotated (automatically modernised, lemmatised and tagged) corpus IMP in noSketch Engine

The digital library contains over 650 units (books, newspapers and some manuscripts) from the end of the 16th century to 1918 with the majority from 1850 onwards. Each document contains the header (metadata), facsimile and transcription. The identifiers of the documents mark their origin:

WIKI: The largest part of the library, comprising books, newspaper articles and installments, as well as some manuscripts from the Wikisource project ‘Slovene literary classics’, containing literature by Slovene authors (1776-1918);
FPG: the AHLib collection of books translated into Slovene from German (1848–1918);
NUK: older books (1750-1820) and selected issues of one newspaper (1850-1900) prepared in the scope of the EU IMPACT project by the National and University Library of Slovenia (NUK);
ZRC: small samples of three religious texts (1584, 1695, 1784) prepared by the Scientific Research Center of the Slovene Academy of Sicences and Arts.

Corpora of historical Slovene

The reference corpus of historical Slovene goo300k contains the text from 1,100 pages (about 300,000 tokens) sampled from the IMP collection with hand-validated linguistic annotation. The current version of the corpus is dcoumented in its TEI header. The larger IMP corpus contains the complete IMP text collection and is automatically annotated. On the CLARIN.SI installation of NoSketch Engine you can also search the goo300k and IMP corpora.

The corpora are structurally annotated for texts (with bibliogprahic information) and pages (with links to the facsimile and page in digital library). Each word token (e.g. "lubesni") in the corpora is annotated with:

modernised form ("ljubezni");
lemma ("ljubezen")
MSD tag ("Ncm"); the tagset is defined in the IMP morphosyntactic specifications.

Lexicon of historical Slovene

The lexicon was created automatically from an annotated corpus (4 mil. tokens) and contains only attested and manually verified word-forms and examples of use. A dictionary entry contains the modern-day lemma with conjoined information, i.e. the part-of-speech and, for archaic words, their closest modern equivalents. The lemma is followed by its wordforms and each of these has all its attested historical wordforms with examples of usage.

The current version of the lexicon is described in its TEI header and is available for browsing as a set of statically mounted HTML pages with hyperlinks into the concordancer, digital libraray and relevant on-line dictionaries. As the complete lexicon is rather large and contains words that will not be of interest to most, the lexicon is made available in several sizes:

small lexicon
contains only archaic words, where each word has associated contemporary equivalents;
medium lexicon
contains only those entries where at least one word-form that has an archaic spelling;
large lexicon
contains all entires except for uninteresting ones, such as numerals, typos, etc.;
complete lexicon
contains all entires.

Download

The IMP resouces are also available for download via the CLARIN.SI repository:

Reference corpus of historical Slovene goo300k 1.2:
hdl.handle.net/11356/1025
Digital library and corpus of historical Slovene IMP 1.1:
hdl.handle.net/11356/1031
Word IMP corpus n-grams 2.0:
hdl.handle.net/11356/1194
Lexicon of historical Slovene imp25k 1.1:
hdl.handle.net/11356/1032

If you use the above resources please cite them as given in the CLARIN.SI repository. Additionally, it would be nice if you also cite the following reference publication:

Tomaž Erjavec. 2015. The IMP historical Slovene language resources. Language resources and evaluation 49/3, 753-775. doi: 10.1007/s10579-015-9294-7.

The encoding of library, corpus and lexicon all follow the IMP XML schema, which is based on TEI P5 Guidelines. The corpus and lexicon are also available in a simpler (tabular) format containing the basic information from the source XML.

You can read more about the purpose, compilation and structure of the resources in the IMP bibliography.

Acknowledgements

We wish to thank the following people for collaborating in the compilation of the IMP language resources: Kozma Ahačič, Tina Benčina, Katja Cingerle, Metod Čepar, Darja Fišer, Miran Hladnik, Alenka Jelovšek, Urška Kamenšek, Alenka Kavčič Čolić, Domen Kermc, Maša Kodrič, Simon Krek, Nina Mikulin, Matija Ogrin, Daša Pokorn, Erich Prunč, Zala Šmid, Ines Vodopivec in Maja Žorga Dulmin.

The work was supported by EU IP IMPACT “Improving Access to Text” and the Google research award in the humanities “Developing Language Models of Historical Slovene”.

Bibliography

Tomaž Erjavec. 2015. The IMP historical Slovene language resources. Language resources and evaluation 49/3, 753-775. doi: 10.1007/s10579-015-9294-7. [PDF]

Yves Scherrer, Tomaž Erjavec. 2015. Modernising historical Slovene words. Natural Language Engineering. doi: 10.1017/S1351324915000236.

Tomaž Erjavec. 2014. The IMP project: developing resources for historical Slovene. Talk given at the First ENeL Workshop. September 29, 2014, Bled [slides].

Yves Scherrer, Tomaž Erjavec. 2013. Modernizing historical Slovene words with character-based SMT. 4th Workshop on Balto-Slavic Natural Language Processing, August 8-9, Sofia, Bulgaria. ACL 2013. pp. 58-62.

Tomaž Erjavec. 2012. The goo300k corpus of historical Slovene. Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), Istambul.

Tom Kenter, Tomaž Erjavec, Maja Žorga Dulmin, Darja Fišer. 2012. Lexicon construction and corpus annotation of historical language with the CoBaLT editor. In Proceedings of the EACL Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, Avignon, France, April. Association for Computational Linguistics.

Ines Jerele, Tomaž Erjavec, Daša Pokorn, Alenka Kavčič-Čolić. 2012. Optical Character Recognition of Historical Texts: End-User Focused Research for Slovenian Books and Newspapers from the 18th and 19th Century. Review of the National Center for Digitization 21/2012, Faculty of Mathematics, Belgrade.

Tomaž Erjavec. Automatic linguistic annotation of historical language: ToTrTaLe and XIX century Slovene. Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, 2011, Portland.

Tomaž Erjavec, Christoph Ringlstetter, Maja Žorga, Annette Gotscharek. A lexicon for processing archaic language: the case of XIXth century Slovene. Proceedings of WoLeR: ESSLLI Workshop on Lexical Resources, 2011, Ljubljana.

Tomaž Erjavec, Christoph Ringlstetter, Maja Žorga, Annette Gotscharek. Towards a Lexicon of XIXth Century Slovene. Proceedings of the Seventh Language Technologies Conference Ljubljana, 2010.

Tomaž Erjavec, Ines Jerele, Maša Kodrič. 2011. Izdelava korpusa starejših slovenskih besedil v okviru projekta IMPACT. V: KRANJC, Simona (ur.). Meddisciplinarnost v slovenistiki, (Obdobja, Simpozij, = Symposium, 30). Ljubljana: Znanstvena založba Filozofske fakultete, 2011, 41-47

Tomaž Erjavec. Slovenska prevodna književnost 1848-1918 : digitalna knjižnica in korpus AHLib. V: KRANJC, Simona (ur.). Meddisciplinarnost v slovenistiki, (Obdobja, Simpozij, = Symposium, 30). Ljubljana: Znanstvena založba Filozofske fakultete, 2011, str. 33-40. [COBISS.SI-ID 25362215]

Local copy of papers.

Further links

AHLib: digital library and corpus of historical Slovene books
JOS: language resources for contemporary Slovene
Text Encoding Initiative and TEI P5