Datasets for modernising historical Slovene words
			      2013-06-27
			    Tomaž Erjavec

Here you can find the lexicons described and used in the experiment
reported in:

@InProceedings{HistoMT-BSNLP13,
  author = {Yves Scherrer and Tomaž Erjavec},
  title = {{Modernising historical Slovene words with character-based SMT}},
  booktitle = {Proceedings of the 4th Workshop on Balto-Slavic Natural Language Processing (BSNLP 2013)},
  year = {2013},
  month = {August},
  date = {8-9},
  address = {Sofia, Bulgaria},
  publisher = {ACL},
 } 

The lexicon of contemporary Slovene used in the experiment is
available from http://nl.ijs.si/ssj/sloleks/sloleks-wfl4.tbl.gz

Two lexicons of historical Slovene are given here:

- goo-wfl3.tbl which is the training lexicon

- foo-wfl3.tbl which is the testing lexicon - the lines with starred
  forms should be removed to get an identical testing lexicon to the
  one used in the paper

Lexicons are encoded as tab separated UTF-8 files with the following columns: 

1) nform: the normalised word-form, (lower cased, vowel diacritics removed 
   and possibly converted from the old into the contemporary alphabet) 

2) mform: the modernised word-form (if the form is not in the contemporary 
   lexicon it has a * suffix) 

3) the frequency of the pair with space-separated frequencies in 50
   year periods. For example, "18B:33 19A:25" means that the entry
   appeared 33 time in texts from 1750-1799 and 25 times from
   1800-1849.

OOV: The out-of-vocabulary words (mforms), i.e. those that are not
present in the contemporary Slovene lexicon Sloleks are marked by
suffixing an asteriks to the mform, e.g. "aphanitov afanitov* ...".
Such forms have been used for training, but not for testing.