Booklet of abstracts of the
Sixth Language Technologies Conference
IS-LTC 2008

Papers in Slovene

Jerneja Žganec Gros, Aleš Mihelič, Mario Žganec, Uliana Dorofeeva, Nikola Pavešić

An Efficient Unit-Selection Method for Embedded Concatenative Speech Synthesis

Memory and processing power requirements are important factors when designing TTS systems for embedded devices. We describe an accelerated unit-selection methods, which we designed for an embedded implementation of a polyphone concatenative TTS system. The results of objective measurements of computational speed, along with results of subjective listening tests, which have been conceived according to ITU-T recommendations, are provided at the end of the paper.

Andrej Žgank, Marko Kos, Bojan Kotnik, Mirjam Sepesy Maučec, Tomaž Rotovnik, Zdravko Kačič

Improved version of UMB Broadcast News Slovenian continuous speech recognition system

This paper presents the next version of a Slovenian continuous speech recognition system for the Broadcast News domain. The UMB Broadcast News system is currently the most complex Slovenian speech recognition system. It is built on Slovenian BNSI Broadcast News speech and text database. Several new, complex, modules were incorporated in the new UMB BN system. The major modifications were done in the area of acoustic segmentation, feature extraction and acoustic modeling. The system evaluation was performed using the complete BNSI database evaluation set, which contains spoken material in diverse acoustic conditions. The usage of new methods successfully improved the performance of the UMB BN Slovenian continuous speech recognition system.

Matej Grašič, Marko Kos, Zdravko Kačič

Impact of prior speech/non speech segmentation on speaker segmentation

This paper addresses the impact of speech/non speech pre-segmentation on the performance of speaker turn detection/segmentation. In the article a GMM approach for speech/non segmentation and classification is presented. For the purpose of speaker segmentation e.g. speaker turn detection the BIC segmentation approach was used. The methods were evaluated within the Broadcast News domain, where the Slovenian BNSI database was used.

Darinka Verdonik

Annotating discourse marker type

With the demand for more powerful NLP applications and for the use of corpora in pragmatic and discourse studies comes a need for discourse and pragmatic attributes in language resources. In this paper, I focus on the annotation of discourse markers. I propose a classification of discourse markers which consists of four categories, ideational markers, interactional markers, markers of production processes and interpretation markers. The classification is a foundation for further corpus based analysis of discourse markers and for the evaluation of language resources in NLP applications.

Darinka Verdonik, Andrej Žgank, Agnes Pisanski Peterlin

Validating the annotation of discourse markers in Turdis-2 and BNSIint corpora

The annotation of discourse markers in a corpus may sometimes depend on annotator interpretation. To assess to what extent the results of a corpus analysis of discourse marker use depends on annotator interpretation and to evaluate the precision of the annotation scheme used in the annotation of discourse markers in Slovene, a validation of the annotation of a representative sample of the corpus material used was carried out. The results showed which discourse markers show greater variability and which discourse markers could be used to upgrade the annotation scheme.

Kristina Hmeljak Sangawa, Tomaž Erjavec

Extracting examples from a parallel corpus for a Japanese-Slovene dictionary

A Japanese-Slovene learners' dictionary is being produced in cooperation between the Faculty of Arts of the University of Ljubljana and the Jožef Stefan Institute. The process of building the dictionary relies on users' collaboration and feedback and on using language technologies and resources. The paper reports on the augmentation of the dictionary with example sentences, automatically extracted from a Japanese-Slovene parallel corpus that was built for this purpose. The corpus was compiled using Japanese-Slovene text produced at the Faculty of Arts as part of student course-work and parallel texts collected from the Web. We present the compilation and annotation of the parallel corpus, the method of selecting examples to be included in the dictionary, and give an informal evaluation of the results. The methodology presented can serve as a model for low-cost production of lexicographic material.

Darja Fišer, Tomaž Erjavec

Presentation and analysis of Slovene wordnet

The paper presents the first freely available Slovene semantic lexicon called sloWNet which was developed automatically from already existing freely available corpus and lexical resources. In the construction process, polysemous words were disambiguated with a wordaligned multilingual parallel corpus and already existing wordnets for these langauges. On the other hand, translations for monosemous words were obtained from bilingual sources. SloWNet contains almost 20,000 literals or 17,000 synsets which are mostly nominal. The paper focuses on the analysis of wordnet with respect to what kind of concepts are found in sloWNet, which domains they belong to, what resource they were created from and what relations hold between them. We take a closer look at hypernymy, the most common relation in wordnet, and compute the length of hyponymy chains. The second part of the analysis compares the wordnet vocabulary with the jos100k corpus by examining to what extent nouns from the corpus are covered in sloWNet and how well the senses of polysemous words are represented in sloWNet.

Peter Holozan

Automatic lexicon extraction from parallel corpus using Interlingua and sense disambiguation

The problem of non-statistical machine translation is huge amount of manual work needed to build a dictionary. That was the reason to try automatically to extract the lexicon from parallel corpus with help of an analyser for translation to Interlingua using sense disambiguation in a process. Some encountered problems are described together with possible solutions. A sample of extracted lexicon is provided.

Tomaž Erjavec, Simon Krek

The JOS Language Resources: Morphosyntactic Specifications and Annotated Corpora

The JOS morphosyntactic resources for Slovene consist of the specifications and two corpora: jos100k, a 100,000 word balanced monolingual sampled corpus annotated with hand validated morphosyntactic descriptions (MSDs) and lemmas, and jos1M, the 1 million word partially hand validated corpus. The two corpora have been sampled from the 620 million word Slovene reference corpus FidaPLUS. The JOS resources have a standardised encoding, with the MULTEXT-East-type morphosyntactic specifications and the corpora encoded according to the Text Encoding Initiative Guidelines P5. JOS resources are available as a dataset for research under the Creative Commons licence and are meant to facilitate developments of HLT for Slovene.

Špela Arhar, Nina Ledinek

The JOS morphosyntactic tags: the revision and upgrade of the tagset for automatic morphosyntactic annotation of Slovene

The paper presents the revised and upgraded tagset for morphosyntactic annotation of Slovene which is one of the first results of the JOS project – "Linguistic annotation of the Slovene language". The JOS tagset was designed with the aim to become the standard tagset for morphosyntactic tagging of Slovene and it is now available to be examined and evaluated by the interested professional public. The paper starts with the discussion of the underlying reasons for the revision and with the arguments for choosing the Multext- East tagset as the basis of the upgrade. We describe the revision process and underline the most problematic issues in morphosyntactic annotation. The revision and upgrade are exemplified by the changes in the morphosyntactic features and values for the verb. The paper concludes with the information on the availability of the new tagset and its documentation.

Vesna Mikolič, Ana Beguš, Davorin Dukič, Miha Koderman

The Use of the Multilingual Corpus of Tourist Texts and Its Influence on the Annotation of the Corpus

The article first presents the project 'Multilingual corpus of tourist texts: information source and analytical database of Slovene natural and cultural heritage'. The aim of the project is to build a comparable and partly parallel corpus of tourist texts in Slovene, Italian and English. The corpus will be used as a translation resource and for research in linguistics and tourism. The article then describes the procedure of metatextual and morphological annotation of the Multilingual corpus of tourist texts on the basis of the planned uses of the corpus.

Špela Vintar, Tomaž Erjavec

iKorpus and terminology extraction for Islovar

The Slovene specialized vocabulary of Computer Science is represented in the bilingual online dictionary project Islovar. To support the editorial and terminographical efforts involved in the making of Islovar, a corpus of Computer Science texts has been compiled out of several years' DSI conference proceedings. Since the corpus contained only one text type, the scientific article, and from a single source, it was never used in a thorough and systematic term extraction experiment. Firstly, this paper describes an upgraded, enlarged and tagged version of the corpus, now called iKorpus, and secondly we present the results of automatic term extraction performed on this corpus. We give a comparison between the terms currently included in the Islovar dictionary and the extracted term candidates, and the paper concludes with a discussion of the results and future perspectives.

Papers in English

Rok Gajšek, Anja Podlesek, Luka Komidar, Gregor Sočan, Boštjan Bajec, Vitomir Štruc, Valentin Bucik, France Mihelič

AvID: Audio – Video Emotional Database

Initial attempts in design, recording strategies and collection of a multi-modal emotional speech database are presented. Our goal is to obtain a database to enable both the experiments in speaker identification/verification and detection of the emotional state of persons involved in communication. Especially we pay attention to gather data involving spontaneous emotions and therefore obtain more realistic training and testing conditions for experiments. Spontaneous emotions were induced with specially planned scenarios including playing computer games and adaptive intelligence tests. At the moment multi-modal speech from speakers was recorded and basic evaluations on data were processed.

Vesna Šatev, Nicolas Nikolov

Using the Web as a Corpus for Extracting Abbreviations in the Serbian Language

In this paper we discuss the results of extracting abbreviations in the Serbian language by using the web as a corpus. The results are compared to those retrieved by using the standard corpus of contemporary Serbian language. Using the web as a corpus is a very recent trend. It is a valuable source of data for research in computational linguistics and information extraction. Still, there are no adequate tools for searching the web, which are geared to linguistic needs. We chose crawling as a process for collecting data from the web, in order to extract abbreviations in the Serbian language. We show that, by using the web as a corpus, a higher number of abbreviations can be found and they are more recent.

Damir Ćavar, Ivo-Pavao Jazbec, Siniša Runjaić

Interoperability and Rapid Bootstrapping of Morphological Parsing and Annotation Automata

We discuss the design and development of a finite state transducer for morphological segmentation, annotation, and lemmatization that allows for merging of three major functionalities into one high-performance monolithic automaton. It is designed to be flexible, extensible, and applicable to any language that allows for purely morphotactic modeling on the lexical level of morphological structure. The annotation schema used in an initial Croatian language model is a direct mapping from the GOLD ontology of linguistic concepts and features, which increases the potential for interoperability, but also opens up advanced possibilities for a DL-based post-processing.

Jelena Tomašević, Gordana Pavlović-Lažetić

Productivity of concepts in Serbian Wordnet

Wordnet is an online lexical database designed for use under program control. It is based on word meaning, rather than word forms. All of the words that can express a given sense are grouped together in a synonym set (synset) representing a concept. All concepts are linked with semantic relations forming semantic network. The network is basically a forest consisting of concept hierarchies rooted in top ontology concepts. In this paper we describe several measures for determining productivity of some concept in order to find those concepts that most effectively represent hierarchy they belong to. They are different from top ontology concepts (which are too general), and could be considered as ontological concepts associated with classes characterized by hierarchies rooted in them. Determining most productive concepts may be applied to text classification in different ways. Information retrieval and information extraction could be made more efficient if they are based on such kind of classification.

Jelena Tomašević, Gordana Pavlović-Lažetić

A Readability Checker with Supervised Learning using Deep Syntactic and Semantic Indicators

Checking for readability or simplicity of texts is important for many institutional and individual users. Formulas for approximately measuring text readability have a long tradition. Usually, they exploit surface-oriented indicators like sentence length, word length, word frequency, etc. However, in many cases, this information is not adequate to realistically approximate the cognitive difficulties a person can have to understand a text. Therefore we use deep syntactic and semantic indicators in addition. The syntactic information is represented by a dependency tree, the semantic information by a semantic network. Both representations are automatically generated by a deep syntactico-semantic analysis. A global readability score is determined by applying a nearest neighbor algorithm on 3,000 ratings of 300 test persons. The evaluation showed, that the deep syntactic and semantic indicators lead to quite comparable results to most surface-based indicators. Finally, a graphical user interface has been developed which highlights difficult-to-read text passages, depending on the individual indicator values, and displays a global readability score.

Jernej Vičič

Rapid development of data for shallow transfer RBMT translation systems for highly inflective languages

The article describes a new way of constructing rule-based machine translation systems (RBMT), in particular shallow-transfer RBMT suited for related languages. The article describes methods that automate parts of the construction process. The methods were evaluated on a case study: the construction of a fully functional machine translation system of closely related language pair Slovenian - Serbian. The Slovenian language and The Serbian language belong to the group of southern Slavic languages that were spoken mostly in the former Yugoslavia. The economies of the nations where these languages are spoken are closely connected and younger generations, the post-Yugoslavia breakage generations, have difficulties in mutual communication, so there is a big interest in construction of such translation system. The system is based on Apertium (Oller and Forcada, 2006), an open-source shallow-transfer RBMT toolkit. Thorough evaluation of the translation system is presented and conclusions present the strong and the weak points of this approach and explore the grounds for further work.

Primož Jakopin, Aleksandra Bizjak Končar

Part-of-Speech Tagging of Slovenian, 12 years after

The paper begins with a brief overview of the efforts and accomplishments in the field of part-of-speech tagging of Slovenian texts. Quite a few research institutions have participated, and the most prominent Slovenian language technology enterprise. An overview of the POS-tagged 1.3 mil. word text corpus at the Fran Ramovš Institute of the Slovenian Language ZRC SAZU follows. The tags of the machine-tagged texts have been verified by linguists and serve as a resources for the POS-tagging of the 240 mil. Nova beseda corpus.

Jan Rupnik, Miha Grčar, Tomaž Erjavec

Improving morphosyntactic tagging of Slovene by tagger combination

Part-of-speech (PoS) or, better, morphosyntactic tagging is the process of assigning morphosyntactic categories to words in a text, an important pre-processing step for most human language technology applications. PoS-tagging of Slovene texts is a challenging task since the size of the tagset is over one thousand tags (as opposed to English, where the size is typically around sixty) and the state-of the- art tagging accuracy is still below levels desired. The paper describes an experiment aimed at improving tagging accuracy for Slovene, by combining the outputs of two taggers – a proprietary rule-based tagger developed by the Amebis HLT company, and TnT, a tri-gram HMM tagger, trained on a hand-annotated corpus of Slovene. The two taggers have comparable accuracy, but there are many cases where, if the predictions of the two taggers differ, one of the two does assign the correct tag. We investigate training a classifier on top of the outputs of both taggers that predicts which of the two taggers is correct. We experiment with selecting different classification algorithms and constructing different feature sets for training and show that some cases yield a meta-tagger with a significant increase in accuracy compared to that of either tagger in isolation.

Željko Agić, Marko Tadić, Zdravko Dovedan

Combining Part-of-Speech Tagger and Inflectional Lexicon for Croatian

This paper investigates several methods of combining output of a second order hidden Markov model part-of-speech/morphosyntactic tagger and a high-coverage inflectional lexicon for Croatian. Our primary motivation was to improve overall tagging accuracy of Croatian texts by using our newly-developed tagger. We also wanted to compare its tagging results – both standalone and utilizing the morphological lexicon – to the ones previously described in (Agić and Tadić, 2006), provided by the TnT statistical tagger which we used as a reference point having in mind that both implement the same tagging procedure. At the beginning we explain the basic idea behind the experiment, its motivation and importance from the perspective of processing the Croatian language. We also describe all the tools and language resources used in the experiment, including their operating paradigms and input and output format details that were of importance. With the basics presented, we describe in theory all the possible methods of combining these resources and tools with respect to their paradigm, input and production capabilities and then put these ideas to test using the F-measure evaluation framework. Results are then discussed in detail and conclusions and future work plans are presented.

Page updated 2008-10-12, et

Booklet of abstracts of the Sixth Language Technologies Conference IS-LTC 2008

Papers in Slovene

Jerneja Žganec Gros, Aleš Mihelič, Mario Žganec, Uliana Dorofeeva, Nikola Pavešić

An Efficient Unit-Selection Method for Embedded Concatenative Speech Synthesis

Andrej Žgank, Marko Kos, Bojan Kotnik, Mirjam Sepesy Maučec, Tomaž Rotovnik, Zdravko Kačič

Improved version of UMB Broadcast News Slovenian continuous speech recognition system

Matej Grašič, Marko Kos, Zdravko Kačič

Impact of prior speech/non speech segmentation on speaker segmentation

Darinka Verdonik

Annotating discourse marker type

Darinka Verdonik, Andrej Žgank, Agnes Pisanski Peterlin

Validating the annotation of discourse markers in Turdis-2 and BNSIint corpora

Kristina Hmeljak Sangawa, Tomaž Erjavec

Extracting examples from a parallel corpus for a Japanese-Slovene dictionary

Darja Fišer, Tomaž Erjavec

Presentation and analysis of Slovene wordnet

Peter Holozan

Automatic lexicon extraction from parallel corpus using Interlingua and sense disambiguation

Tomaž Erjavec, Simon Krek

The JOS Language Resources: Morphosyntactic Specifications and Annotated Corpora

Špela Arhar, Nina Ledinek

The JOS morphosyntactic tags: the revision and upgrade of the tagset for automatic morphosyntactic annotation of Slovene

Vesna Mikolič, Ana Beguš, Davorin Dukič, Miha Koderman

The Use of the Multilingual Corpus of Tourist Texts and Its Influence on the Annotation of the Corpus

Špela Vintar, Tomaž Erjavec

iKorpus and terminology extraction for Islovar

Papers in English

Rok Gajšek, Anja Podlesek, Luka Komidar, Gregor Sočan, Boštjan Bajec, Vitomir Štruc, Valentin Bucik, France Mihelič

AvID: Audio – Video Emotional Database

Vesna Šatev, Nicolas Nikolov

Using the Web as a Corpus for Extracting Abbreviations in the Serbian Language

Damir Ćavar, Ivo-Pavao Jazbec, Siniša Runjaić

Interoperability and Rapid Bootstrapping of Morphological Parsing and Annotation Automata

Jelena Tomašević, Gordana Pavlović-Lažetić

Productivity of concepts in Serbian Wordnet

Jelena Tomašević, Gordana Pavlović-Lažetić

A Readability Checker with Supervised Learning using Deep Syntactic and Semantic Indicators

Jernej Vičič

Rapid development of data for shallow transfer RBMT translation systems for highly inflective languages

Primož Jakopin, Aleksandra Bizjak Končar

Part-of-Speech Tagging of Slovenian, 12 years after

Jan Rupnik, Miha Grčar, Tomaž Erjavec

Improving morphosyntactic tagging of Slovene by tagger combination

Željko Agić, Marko Tadić, Zdravko Dovedan

Combining Part-of-Speech Tagger and Inflectional Lexicon for Croatian

Booklet of abstracts of the
Sixth Language Technologies Conference
IS-LTC 2008