Research project J6-7094:Slovene scientific texts: resources and description

The development and use of Slovene academic language at universities and in research is one of the central questions of the Slovene language policy. The problem is highlighted in the National Programme for Language Policy of the Republic of Slovenia 2014–2018 and a number of European studies also draw attention to the impact that the knowledge and development of academic discourse have on language vitality. It is therefore of fundamental importance to develop contemporary reference language resources that will help empower Slovene academic language and to undertake comprehensive research based on a representative sample of such language.

Slovene universities have established institutional repositories of scientific publications, containing various types of texts from PhD theses to scientific and professional papers. An important milestone is the establishment of the National Portal for Open Science, launched in 2013, which aggregates access to the digital libraries of individual universities and other institutions. The portal offers access to over 123,000 Slovene language publications from a wide range of disciplines. These publications are a highly valuable but have, so far completely unused source of data on Slovene academic writing, including terminological data.

The goal of the project was to overcome these limitations in several ways. First, compiled a corpus of Slovene academic writing containing PhD, MSc/MA and BSc/BA theses harvested from the Open Science portal. The texts were extracted from their source PDF format, which involved developing methods for text clean-up and structure extraction, and up-conversion to a uniform and standardised TEI representation. The corpus was linguistically annotated, with new tools and resources developed to improve the quality of the annotations.

The corpus served as the basis for studies in terminology extraction. The extracted term candidates will be exported to a public online dictionary viewer and editor, so that Slovene scientific communities from a range of subject fields will be able to engage in the management of their terminologies. An important aspect of the work undertaken in the project was the first empirically based study of Slovene academic discourse, founded on the compiled corpus. Data usability studies and in-depth interviews were also be conducted in an attempt to determine the process and obstacles for academic writing in Slovene.

The project made its results as widely available as possible: the produced language resources and tools are made freely and openly available to the wider research community, which also improves the state-of-the art of corpus linguistics, digital humanities, and language technologies for Slovene. The complete corpus, as well as its three subcorpora are available for analysis via the CLARIN.SI concordancers and for download from the CLARIN.SI repository:

Corpus KAS (complete corpus): http://hdl.handle.net/11356/1244
Corpus KAS-dr (PhD theses): http://hdl.handle.net/11356/1265
Corpus KAS-mag (MSc/MA theses): http://hdl.handle.net/11356/1266
Corpus KAS-dipl (BSc/BA theses): http://hdl.handle.net/11356/1267

The project was conducted by ten researchers from four academic institutions with distinct but complementary expertise to attain its goals: to strengthen Slovene academic language; to make Slovene better equipped for functioning in the information society; and to promote open dissemination of scientific results.

The project is presented in the publication:

ERJAVEC, Tomaž, FIŠER, Darja, LJUBEŠIĆ, Nikola. The KAS corpus of Slovenian academic writing. Language Resources & Evaluation, 2020. https://doi.org/10.1007/s10579-020-09506-4.

The pape is freely available for reading at https://rdcu.be/b7GrB.