Datasets for modernising historical Slovene words 2015-09-01 Tomaž Erjavec Here you can find the lexicons described and used in the experiment reported in: @Article{csmt-histosl, author = {Yves Scherrer and Toma\v{z} Erjavec}, title = {{Modernising historical Slovene words}, journal = {Natural language engineering}, year = {2015}, doi = {10.1017/S1351324915000236} url = {http://dx.doi.org/10.1017/S1351324915000236} } The historical langauge resources are available under CC-BY: if you use them please refer to http://hdl.handle.net/11356/1025 and, in published work, please quote the reference above. The lexicon of contemporary Slovene used in the experiment is available from http://nl.ijs.si/ssj/sloleks/ as "sloleks-wfl4.tbl.gz" It is available under CC-BY-NC-SA. For further information please consult the README in the directory The historical dataset: - goo18B.wfl training word-form lexicon for second half of 18th century Slovene - goo19A.wfl training word-form lexicon for first half of 19th century Slovene - goo19B.wfl training word-form lexicon for second half of 19th century Slovene - foo18B.wfl testing word-form lexicon for second half of 18th century Slovene - foo19A.wfl testing word-form lexicon for first half of 19th century Slovene - foo19B.wfl testing word-form lexicon for second half of 19th century Slovene The lexicons have been automatically extracted from the IMP historica corpora and contain hand-validated entries. They are encoded as tab separated UTF-8 files with the following columns 1) wform: the wordform as it appears in the corpus, but lowecased 2) nform: the normalised wordform, converted to contemporary alphabet 3) mform: the modernised word-form; if it is not in the contemporary lexicon it has a * suffix; if this is an orthographic normalisation of an otherwise extinct (archaic) word, the suffix is ! (or *!) 4) frequency in the corpus Example from goo18B: bętesh betež betež!* 1 The wordform "bętesh" is normalised as "betež", has the modernised form "betež", which is an archaic word and does not appear in the lexicon of contemporary Slovene. It appears once in the training corpus for the second half of the 19th century.