Resources, methods, and tools for the understanding, identification, and classification of various forms of socially unacceptable discourse in the information society

Description of the Research project

Socially unacceptable discourse (SUD), such as hate, discriminatory, offensive or threatening speech, is by no means a new phenomenon. It has, however, recently gained significant momentum due to a number of substantial societal, cultural and economic changes. SUD is a global and multifaceted societal problem, a fact reflected in warnings issued by all key international organisations, such as the Council of Europe, the European Union, the Organisation for Security and Cooperation in Europe, and the United Nations. As shown by the few linguistic analyses of hate speech carried out in Slovenia (Červ and Kalin Golob 2012, Campos Fereira et al. 2012), the scope of unacceptable discourse practices far exceeds the framework of linguistics and can only be successfully tackled in cooperation with social scientists and legal experts. What is more, the boom of information-communication technology and computer-mediated communication (CMC), as well as the speed at which information is spread on the Internet have allowed such discourse practices to gain an unprecedented reach and impact that can only be efficiently mitigated with automatic approaches.

The proposed project combines state-of-the-art quantitative and qualitative multidisciplinary approaches, such as the methods of corpus linguistics, critical discourse analysis, legal analysis and sociological survey methods, employed to research the use and perception of socially unacceptable communication in its sociocultural context. In addition, through the use of exploratory and inferential statistics as well as statistical modelling, the proposed novel data-driven approaches on unstructured and semi-structured datasets will move the frontiers of the traditional humanities and social sciences. The project has a strong empirical basis, as most research is performed on large corpora of Slovene SUD and CMC, which will be automatically annotated for the phenomena of interest. To develop the required automatic annotation methods, we will produce manually annotated datasets and train machine learning (ML) methods to build predictive models of the researched phenomena.

Project goals

First comprehensive interdisciplinary treatment of linguistic, sociological, legal and technological dimensions of different forms of SUD
Improved understanding of characteristics of SUD as linguistic phenomenon and the social contexts in which explicit and implicit forms of discriminatory language
Improved understanding of the differences between legal and illegal forms of on-line communication