IMP language resources for historical Slovene
- Indexes sorted by: title, author, year (showing title pages) and id
- The DL converted to an annotated (automatically modernised, lemmatised and tagged) corpus IMP in noSketch Engine
The digital library contains over 650 units (books, newspapers and some manuscripts) from the end of the 16th century to 1918 with the majority from 1850 onwards. Each document contains the header (metadata), facsimile and transcription. The identifiers of the documents mark their origin:
- WIKI: The largest part of the library, comprising books, newspaper articles and installments, as well as some manuscripts from the Wikisource project ‘Slovene literary classics’, containing literature by Slovene authors (1776-1918);
- FPG: the AHLib collection of books translated into Slovene from German (1848–1918);
- NUK: older books (1750-1820) and selected issues of one newspaper (1850-1900) prepared in the scope of the EU IMPACT project by the National and University Library of Slovenia (NUK);
- ZRC: small samples of three religious texts (1584, 1695, 1784) prepared by the Scientific Research Center of the Slovene Academy of Sicences and Arts.
Corpora of historical Slovene
The reference corpus of historical Slovene goo300k contains the text from 1,100 pages (about 300,000 tokens) sampled from the IMP collection with hand-validated linguistic annotation. The current version of the corpus is dcoumented in its TEI header. The larger IMP corpus contains the complete IMP text collection and is automatically annotated. On nl.ijs.si the goo300k and IMP corpora can be searched on our installation of NoSketch Engine.
The corpora are structurally annotated for texts (with bibliogprahic information) and pages (with links to the facsimile and page in digital library). Each word token (e.g. "lubesni") in the corpora is annotated with:
- modernised form ("ljubezni");
- lemma ("ljubezen")
- MSD tag ("Ncm"); the tagset is defined in the IMP morphosyntactic specifications.
Lexicon of historical Slovene
The lexicon was created automatically from an annotated corpus (4 mil. tokens) and contains only attested and manually verified word-forms and examples of use. A dictionary entry contains the modern-day lemma with conjoined information, i.e. the part-of-speech and, for archaic words, their closest modern equivalents. The lemma is followed by its wordforms and each of these has all its attested historical wordforms with examples of usage.
The current version of the lexicon is described in its TEI header and is available for browsing as a set of statically mounted HTML pages with hyperlinks into the concordancer, digital libraray and relevant on-line dictionaries. As the complete lexicon is rather large and contains words that will not be of interest to most, the lexicon is made available in several sizes:
- small lexicon
contains only archaic words, where each word has associated contemporary equivalents;
- medium lexicon
contains only those entries where at least one word-form that has an archaic spelling;
- large lexicon
contains all entires except for uninteresting ones, such as numerals, typos, etc.;
- complete lexicon
contains all entires.
The IMP resouces are also available for download via the CLARIN.SI repository:
- Reference corpus of historical Slovene goo300k 1.2:
- Digital library and corpus of historical Slovene IMP 1.1:
- Lexicon of historical Slovene imp25k 1.1:
If you use the above resources please cite them as given in the CLARIN.SI repository. Additionally, it would be nice if you also cite the following reference publication:
The encoding of library, corpus and lexicon all follow the IMP XML schema, which is based on TEI P5 Guidelines. The corpus and lexicon are also available in a simpler (tabular) format containing the basic information from the source XML.
You can read more about the purpose, compilation and structure of the resources in the IMP bibliography.
We wish to thank the following people for collaborating in the compilation of the IMP language resources: Kozma Ahačič, Tina Benčina, Katja Cingerle, Metod Čepar, Darja Fišer, Miran Hladnik, Alenka Jelovšek, Urška Kamenšek, Alenka Kavčič Čolić, Domen Kermc, Maša Kodrič, Simon Krek, Nina Mikulin, Matija Ogrin, Daša Pokorn, Erich Prunč, Zala Šmid, Ines Vodopivec in Maja Žorga Dulmin.
The work was supported by EU IP IMPACT “Improving Access to Text” and the Google research award in the humanities “Developing Language Models of Historical Slovene”.
Yves Scherrer, Tomaž Erjavec. 2015. Modernising historical Slovene words. Natural Language Engineering. doi: 10.1017/S1351324915000236.
Yves Scherrer, Tomaž Erjavec. 2013. Modernizing historical Slovene words with character-based SMT. 4th Workshop on Balto-Slavic Natural Language Processing, August 8-9, Sofia, Bulgaria. ACL 2013. pp. 58-62.
Tomaž Erjavec. 2012. The goo300k corpus of historical Slovene. Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), Istambul.
Tom Kenter, Tomaž Erjavec, Maja Žorga Dulmin, Darja Fišer. 2012. Lexicon construction and corpus annotation of historical language with the CoBaLT editor. In Proceedings of the EACL Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, Avignon, France, April. Association for Computational Linguistics.
Ines Jerele, Tomaž Erjavec, Daša Pokorn, Alenka Kavčič-Čolić. 2012. Optical Character Recognition of Historical Texts: End-User Focused Research for Slovenian Books and Newspapers from the 18th and 19th Century. Review of the National Center for Digitization 21/2012, Faculty of Mathematics, Belgrade.
Tomaž Erjavec. Automatic linguistic annotation of historical language: ToTrTaLe and XIX century Slovene. Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, 2011, Portland.
Tomaž Erjavec, Christoph Ringlstetter, Maja Žorga, Annette Gotscharek. A lexicon for processing archaic language: the case of XIXth century Slovene. Proceedings of WoLeR: ESSLLI Workshop on Lexical Resources, 2011, Ljubljana.
Tomaž Erjavec, Ines Jerele, Maša Kodrič. 2011. Izdelava korpusa starejših slovenskih besedil v okviru projekta IMPACT. V: KRANJC, Simona (ur.). Meddisciplinarnost v slovenistiki, (Obdobja, Simpozij, = Symposium, 30). Ljubljana: Znanstvena založba Filozofske fakultete, 2011, 41-47
Tomaž Erjavec. Slovenska prevodna književnost 1848-1918 : digitalna knjižnica in korpus AHLib. V: KRANJC, Simona (ur.). Meddisciplinarnost v slovenistiki, (Obdobja, Simpozij, = Symposium, 30). Ljubljana: Znanstvena založba Filozofske fakultete, 2011, str. 33-40. [COBISS.SI-ID 25362215]
- AHLib: digital library and corpus of historical Slovene books
- JOS: language resources for contemporary Slovene
- SSJ project: Communication in Slovene
- Text Encoding Initiative and TEI P5