Datasets for modernising historical Slovene words 2013-06-27 Tomaž Erjavec Here you can find the lexicons described and used in the experiment reported in: @InProceedings{HistoMT-BSNLP13, author = {Yves Scherrer and Tomaž Erjavec}, title = {{Modernising historical Slovene words with character-based SMT}}, booktitle = {Proceedings of the 4th Workshop on Balto-Slavic Natural Language Processing (BSNLP 2013)}, year = {2013}, month = {August}, date = {8-9}, address = {Sofia, Bulgaria}, publisher = {ACL}, } The lexicon of contemporary Slovene used in the experiment is available from http://nl.ijs.si/ssj/sloleks/sloleks-wfl4.tbl.gz Two lexicons of historical Slovene are given here: - goo-wfl3.tbl which is the training lexicon - foo-wfl3.tbl which is the testing lexicon - the lines with starred forms should be removed to get an identical testing lexicon to the one used in the paper Lexicons are encoded as tab separated UTF-8 files with the following columns: 1) nform: the normalised word-form, (lower cased, vowel diacritics removed and possibly converted from the old into the contemporary alphabet) 2) mform: the modernised word-form (if the form is not in the contemporary lexicon it has a * suffix) 3) the frequency of the pair with space-separated frequencies in 50 year periods. For example, "18B:33 19A:25" means that the entry appeared 33 time in texts from 1750-1799 and 25 times from 1800-1849. OOV: The out-of-vocabulary words (mforms), i.e. those that are not present in the contemporary Slovene lexicon Sloleks are marked by suffixing an asteriks to the mform, e.g. "aphanitov afanitov* ...". Such forms have been used for training, but not for testing.