Feature published in ELSNews 7.3, p.16, July 1998.

Language Resources in Central and Eastern Europe

Tomaz Erjavec,
Jozef Stefan Institute,
Ljubljana

While Language Resources (LRs) for CEE languages are, in general, less developed than those for EU languages, recent years have seen a marked upsurge in available and publicised CEE resources. In many cases this is due to EU projects. The Copernicus Programme in particular not only provided funding for resource-oriented projects (e.g. MULTEXT-East, Onomastica) but also, either directly or indirectly (Awareness Seminars, ELSNet goes East), raised awareness of their importance in CEE, not least among the funding bodies in these countries.

Of course, these language resources have not been developed from scratch in the last few years; but the recent focus on LRs in language technology has meant that they have been standardised, publicised and, crucially, made more widely available. ELRA, for instance, is starting to offer CEE resources in addition to the EU-language ones. Of particular importance to CEE LRs has been the Copernicus Concerted Action Trans-European Language Resources Infrastructure (TELRI, [http://www.ids-mannheim.de/telri/]). TELRI has connected CEE language technology centres with each other and with EU centres; it has produced a double CD-ROM, containing multilingual LRs of almost all CEE languages; initiated the TRACTOR resource collection. TELRI(-II), which will concentrate on the TRACTOR initiative, is to run another three years. TELRI has now been established as a permanent association, based in Germany, to maintain the action in the longer term.

As previously in the EU, the maturity of language technologies is influencing resource development in CEE. In Slovenia, for example, a publishing house recently went ahead -- without government funding -- with a project to collect a large reference corpus of Slovene; they now feel it is indispensable for producing quality dictionaries.

Finally, CEE LRs are being produced outside of their 'home countries' as well; for example, at the LREC conference there was a presentation of an on-line corpus of Bosnian texts [1], from the University of Oslo, while a US project at CLR, New Mexico, has the Serbo-Croatian language included in its multilingual onomasticon [2].

It is difficult to give an overview of the kinds of resources that exist for CEE languages, because the situation differs too much from country to country. But in what follows I will give a general outline. This outline excludes Russian, which because of its very large number of speakers and its specific history has a special status among CEE languages (for a dedicated survey on Russian resources see [http://infomage.mipt.rssi.ru:8080/sections/lingvo.html]). A great number of LRs have been produced in Russia; but unfortunately many are now being irretrievably lost, as there is no funding to maintain them, or the organisations that created them no longer exist.

In general, significant corpora have been or are being produced for a number of CEE languages, often TEI-annotated and PoS-tagged (e.g. Romanian, Hungarian). In many cases, recent EU funding helped with corpus projects, such as the the Bulgarian corpora and resource tools that were produced by the Linguistic Modelling Laboratory of the Bulgarian Academy of Sciences [http://www.lml.acad.bg/]. The largest to date is the Czech National Corpus, which currently contains almost a 100 million words. Furthermore, a large tree-bank is being annotated for Czech, with both efforts being funded by the Grant Agency of the Czech Republic. Some corpora are also freely available, or even have on-line querying, e.g. a portion of the above mentioned CnC [http://ucnk.ff.cuni.cz/] and the corpus of Estonian written texts [http://www.cl.ut.ee/]. On the whole, however, freely available or PoS tagged corpora are still scarce, and treebanks, large parallel corpora and sense-tagged corpora non-existent.

Estonia deserves to be mentioned in the context of machine readable dictionaries: in a happy marriage of (Soros) funding, copyright holders and language technology experts, they offer free WWW searches on a number of their dictionaries. However, such availability of machine readable dictionaries is an exception rather than the rule. On the other hand, there is a growing number of (usually morphological) lexica available; so, for example, Bulgarian lexica are already being offered by ELRA, and lexica for six CEE languages have been produced by the MULTEXT-East project, and have been made available on the TELRI CD-ROM.

Speech processing is quite well-developed in a number of CEE countries. Speech resources have only recently become the focus of attention, often via EU projects. But now, due to the growing interest of large industries (e.g. Siemens), speech corpora for a variety of settings (e.g. studio, telephone line), purposes (basic phonetic research, speech recognition, speech synthesis), and languages (e.g. Polish, Slovak) are being produced.

Lest the above sound too optimistic, it should be remembered that CEE LR development lags significantly behind EU languages. Quite a few CEE languages do not have their equivalent of the Brown corpus, for example. One reason is that government funding in CEE countries tends to be scarce, and EU funds insufficient. Moreover, the language industries have a harder time developing in these countries, and multinational/multilingual industries invest less in them. At LREC this was demonstrated quite well by the chart presented by Microsoft representative David Brooks, which showed the four bands in which they prioritise European languages for localisation. Starting with English and followed by EU languages, the third was 'major' CEE languages, i.e. those with a sufficient number of speakers/GNP, and the last 'minor' CEE languages. It is probably up to EU to balance these categories with financial as well as political support.

References

Diana Santos: Providing Access to Language Resources through the WorldWideWeb: the Oslo Corpus of Bosnian Texts. In: Proceedings of the First International Conference on Language Resources and Evaluation, LREC'98, Granada 1998.
Svetlana Sheremetyeva, Jim Cowie, Sergei Nirenburg, Remi Zajac: Multilingual Onomasticon as a Multipurpose NLP Resource. In: Proceedings of the First International Conference on Language Resources and Evaluation, LREC'98, Granada 1998.

Language Resources in Central and Eastern Europe

Tomaz Erjavec, Jozef Stefan Institute,Ljubljana

References

Tomaz Erjavec,
Jozef Stefan Institute,
Ljubljana