Research project J6-7094:
Slovene scientific texts: resources and description

The development and use of Slovene academic language at universities and in research is one of the central questions of the Slovene language policy. The problem is highlighted in the National Programme for Language Policy of the Republic of Slovenia 2014–2018 and a number of European studies also draw attention to the impact that the knowledge and development of academic discourse have on language vitality. It is therefore of fundamental importance to develop contemporary reference language resources that will help empower Slovene academic language and to undertake comprehensive research based on a representative sample of such language.

In recent years, Slovene universities have started to establish institutional repositories of scientific publications, containing various types of texts from PhD theses to scientific and professional papers. An important milestone is the establishment of the National Portal for Open Science, http://openscience.si/, launched in 2013, which aggregates access to the digital libraries of individual universities. The portal already offers access to over 123,000 Slovene language publications from a wide range of disciplines. These publications are a highly valuable but so far completely unused source of data on Slovene academic writing, including terminological data.

The goal of the project is to overcome these limitations in several ways. First, it will compile a corpus of Slovene academic writing containing texts harvested from the Open Science portal. The texts will be extracted from their source (usually PDF) format, which involves developing methods for text clean-up and structure extraction, and up-conversion to a uniform and standardised XML representation. The corpus will be linguistically annotated, with new tools and resources developed to improve the quality of the annotations. Text classification methods will be developed as well, in order to enhance the usability of the Open Science portal by allowing better faceted search and recommender systems for university librarians entering the publications into the repositories.

The corpus will serve as the basis for studies in terminology extraction. The extracted term candidates will be exported to a public online dictionary viewer and editor, so that Slovene scientific communities from a range of subject fields will be able to engage in the management of their terminologies. A very important aspect of the work undertaken in the project will be the first empirically based study of Slovene academic discourse, founded on the compiled corpus. Data usability studies and in-depth interviews will also be conducted in an attempt to determine the process and obstacles for academic writing in Slovene.

The project will make its results as widely available as possible: the produced language resources and tools will be made freely and openly available to the wider research community, which will also improve the state-of-the art of corpus linguistics, digital humanities, and language technologies for Slovene. The resources will be archived in the repository of the research infrastructure CLARIN.SI, which will undertake the maintenance of the corpus after the close of the project. Furthermore, the project will engage with the Slovene scientific community through two workshops.

The project will be conducted by ten researchers from four academic institutions with distinct but complementary expertise to attain its goals: to strengthen Slovene academic language; to make Slovene better equipped for functioning in the information society; and to promote open dissemination of scientific results.