Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Towards countering hate speech against journalists on social media

Towards countering hate speech against journalists on social media TOWARDS COUNTERING HATE SPEECH AGAINST JOURNALISTS ON SOCIAL MEDIA A PREPRINT Polychronis Charitidis Stavros Doropoulos Stavros Vologiannidis DataScouting DataScouting International Hellenic University 30 Vakchou Street, 54629 30 Vakchou Street, 54629 Terma Magnisias, 62124 Thessaloniki, Greece Thessaloniki, Greece Serres, Greece pcharitidis@datascouting.com doro@datascouting.com svol@teicm.gr Ioannis Papastergiou Sophia Karakeva DataScouting DataScouting 30 Vakchou Street, 54629 30 Vakchou Street, 54629 Thessaloniki, Greece Thessaloniki, Greece ipapaste@datascouting.com soka@datascouting.com May 4, 2020 ABSTRACT The damaging effects of hate speech on social media are evident during the last few years, and several organizations, researchers and social media platforms tried to harness them in various ways. Despite these efforts, social media users are still affected by hate speech. The problem is even more apparent to social groups that promote public discourse, such as journalists. In this work, we focus on countering hate speech that is targeted to journalistic social media accounts. To accomplish this, a group of journalists assembled a definition of hate speech, taking into account the journalistic point of view and the types of hate speech that are usually targeted against journalists. We then compile a large pool of tweets referring to journalism-related accounts in multiple languages. In order to annotate the pool of unlabeled tweets according to the definition, we follow a concise annotation strategy that involves active learning annotation stages. The outcome of this paper is a novel, publicly available collection of Twitter datasets in five different languages. Additionally, we experiment with state-of-the-art deep learning architectures for hate speech detection and use our annotated datasets to train and evaluate them. Finally, we propose an ensemble detection model that outperforms all individual models. 1 Introduction Hate Speech is not a new phenomenon. However, before the advent of email, online comments and networking platforms, the threshold to utter it to any effect at all was much higher. People had to draft a letter, buy postage, and send their missive through the mail. Since then, the formulation and dissemination of hate speech have become easy, instant, potentially ubiquitous, public, and therefore much more damaging. In fact, it not only poisons and thus effectively undermines free and open discourse on the Internet, which is bad enough in itself, but also constitutes a threat to the individuals and organizations it is directed at. The increasing propagation of hate speech through social media has drawn the attention of governments and organiza- tions. In May 2016, the European Commission agreed with Facebook, Microsoft, Twitter, and YouTube to a code of conduct [1] to prevent and counter the spread of illegal online hate speech. Several other large companies joined the code of conduct later. Although these initiatives mitigate hate speech incidents, the elimination of hate speech needs further work. Following this paradigm, the research community made efforts towards countering hate speech. The latest literature contains an increasing number of works that deal with the problem of automatic hate speech detection. Hate arXiv:1912.04106v2 [cs.IR] 30 Apr 2020 A PREPRINT - M AY 4, 2020 speech detection methodologies aim to classify social media posts into those than contain hate speech and those that do not, while some works even try to identify the type of hate speech. In this paper, we present some of the outcomes of DACHS (“A Data-driven Approach to Countering Hate Speech”) project. While, all victims of hate speech are equally in need of protection and defense, for the purpose of DACHS, we focus on journalism as a test case. As professional arbiters of the public sphere, journalists run afoul of hate speech originators practically by default. Journalists are multipliers of societal discourse and their relative prominence, and high audience reach makes them vulnerable to hate speech. Report in [2] highlights the rapid spread of hate speech against journalists that infringes their freedom of expression. To assist their work and further promote free speech, the DACHS project aims to counter hate speech directed at journalists. One of the main goals of DACHS is to build a Twitter alert monitoring mechanism that notifies journalists about cases where hateful tweets are posted in their Twitter feed. Also, they can receive email reports with statistics about hate speech in their timeline at specified time intervals. Optionally, journalists can suggest or flag tweets that they consider to be hate speech. During DACHS, hate speech against journalists in Twitter is studied in 5 languages: English, French, German, Spanish, and Greek. This work makes the following contributions. First, it defines hate speech from a journalistic point of view, taking into account examples of hate speech directed at journalists. The definition is formed after extensive discussions with journalists from the European Journalism Centre and its main attributes are that it is simple, concise and accounts for large-scale annotation. Second, it presents a concise two-stage annotation strategy. Both of these stages sample tweets from the collected unlabeled data and generate batches of tweets to be submitted for human annotation. The first stage generates the initial batch using keywords and existing hate speech datasets to filter tweets and annotate them. The second stage is responsible for generating all subsequent batches making use of active learning. This strategy is used to annotate a large pool of journalist-related tweets generating large-scale hate speech datasets in multiple languages. The datasets were made publicly available to assist further research on the field. The third contribution of this work is that it uses these datasets to train various state-of-the-art deep learning models, including an ensemble model that outperforms all individual models. To the best of our knowledge, this is the first work that studies hate speech in multiple languages. The rest of the paper is structured as follows. Section 2 includes a brief overview of the current state-of-the-art that addresses hate speech. This includes an overview of different hate speech definitions, existing hate speech datasets and automated detection methods. In section 3 we present the definition of hate speech that is being used throughout the paper. Sections 4 and 5, present the data collection and annotation methodologies respectively. Section 6 describes the performed experimental study and demonstrates the results of this work. In the last section, we conclude this work and we present some future steps related to hate speech detection. Dataset Source Size Language Labels Wulczyn et al. [3] Wikipedia 100k EN offensive Founta et al. [4] Twitter 80k EN offensive/abusive/hate speech Davidson et al. [5] Twitter 25k EN hate speech/offensive Waseem et al. [6] Twitter 16k EN sexism/racism Sharma et al. [7] Twitter 9k EN multiple hate speech classes Gibert et al. [8] Other 10k EN hate speech ElSherief et al. [9] Twitter 28k EN multiple hate speech classes Kwok et al. [10] Twitter 24k EN racism Ross et al. [11] Twitter 541 DE racism Wiegand et al. [12] Twitter 9.5k DE insult/abuse/profanity Del Vigna et al. [13] Facebook 17.5k IT multiple hate speech classes Table 1: Related dataset information 2 Related Work 2.1 Definitions One of the challenges in studying negative online behavior, and hate speech in particular, is the lack of a clear, common definition [14]. Generally speaking, hate speech could be described as the expression of hatred towards an individual or group of individuals because of a characteristic they share, or a group to which they belong. In [3] the term personal https://hatedetection.com/ https://ejc.net/ 2 A PREPRINT - M AY 4, 2020 attack is used to describe offensive online behavior, while other studies focus on offensive or abusive speech and online harassment [15, 5, 16]. Other works like [17], address particular types of online harassment or hate, like misogyny. The actual term hate speech is used in many previous works [18, 19, 20, 10, 7, 6, 8]. Even though these definitions share many common characteristics, there are distinct differences even between definitions that are using the same term to describe negative online behavior. Authors in [21] argue that even more formal definitions of illegal hate speech, like the EU definition [22] or the United Nations definition [23], contain words that are open to interpretation. Illegal hate speech is further examined in another European project that takes into account the heterogeneity and complexity of different legislations. Furthermore, authors in [24] thoroughly discuss and define hate crimes in cyberspace. In our case, our intention is to find and work with a notion of hate speech that takes into consideration the journalistic point of view and at the same time is easy to understand. To this end, we examined the related literature and created a new definition that is in line with the project’s requirements. 2.2 Datasets With the advent of social media, research on hate speech was intensified during the last few years. A critical step to achieve further progress in the detection of online hate speech is the availability of large scale datasets. There have been relatively few efforts focusing on the creation of hate speech datasets from social media. Davidson et al. [5] collected Twitter data using a hate speech lexicon compiled with the help of Hatebase.org in English. They employed crowd-sourcing to label tweets into three categories: hate speech, offensive language, and those with neither. Waseem et al. [25, 6] provide a hate speech dataset, which contains 16k tweets, and describe the respective annotation procedure, in which an initial manual search was conducted on Twitter to collect common slurs and terms about religion, sexual orientation, gender, and ethnic minorities. The dataset was then manually annotated regarding the existence of sexism or racism. Sharma et al. [7] collected a set of 9k tweets containing harmful speech and they manually annotated them in three classes based on their degree of hateful intent. The authors of [8] crawled data from a white supremacy forum to extract and to manually annotate over 10k sentences as hate speech or not. Authors in [9] describe a multi-step classification process and they provide a comprehensive hate speech dataset containing more than 28k tweets with various types of hate related to sexual orientation, gender, ethnicity, etc. Finally, in older works, many researchers have relied on creating their own hand-coded hate speech datasets as in [10, 18]. The majority of hate speech related studies focus on the English language. However, in [11, 12], hate speech against refugees is studied in the German language. The authors in [13] crawled Facebook comments from public Italian pages and annotated them with a variety of hate categories to distinguish different notions of hate speech. In addition, there are a lot of datasets addressing offensive and toxic online behavior. Kaggle’s Toxic Comment Classification Challenge dataset [26] consists of 150k Wikipedia comments annotated for toxic behavior. Kaggle hosts additional large scale toxic speech datasets like [27, 28]. Studies in [3, 4] use crowdsourcing to provide abuse-related annotation on 100k English Wikipedia comments and 80k tweets respectively. Finally, smaller datasets as in [16, 19], focus on the annotation of toxic versus non-toxic online comments. Table 1 presents a summary on the datasets available in related literature. 2.3 Detection methods Existing hate speech detection methods address the problem as a supervised classification task [29]. Traditional methods rely on manually designing and encoding features of textual data into feature vectors, that are used as inputs to algorithms, such as Naive Bayes, Logistic Regression, SVM and Random Forest. These methods are adopted by numerous hate and offensive speech detection studies, such as [19, 5, 20, 10, 30, 18, 25, 6, 31, 32]. These studies experiment with various features, including bag-of-words representation, character-level, word-level n-gram features, syntactic features, linguistic features, and comment embedding features. Following the more recent deep-learning paradigm, several studies use neural networks to detect hateful and toxic content. Neural networks learn abstract feature representations from input data through multiple stacked layers. The key difference from traditional models is that deep learning models automate the feature extraction process and the multi-layer structure provides more efficient feature representations. Many studies have shown that deep learning and neural network methods outperform traditional methods on hate speech detection tasks [33, 15]. The most popular network architectures are Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). In the context of hate speech classification, CNN extracts meaningful features from word or character combinations [34, 33, 15, 35], while RNN learns word or character dependencies in sequences of words [32, 34, 13, 36]. Combinations of CNN and RNN models are also applied in [37]. http://mandola-project.eu/ 3 A PREPRINT - M AY 4, 2020 Tweet Type @USER They shud DEPORT that mot******ker Back to Iran 1 @USER Please somebody, kill him but before you do torture him to death! 2 @USER Whats the matter you cowards have someone disagree with you and your coward 1 journalists and boom they are taken off platforms for daring to have an opinion against you.... I hope you all rot in hell Wall Street Journal for the cowards you are! @USER Kill the NBC journalists !!! 2 @USER Let me fix this for you. Chicken good...Queers Bad. Solved it for you. 3 @USER @USER ACCORDING TO WHITE PEOPLE... When whites kill = Lone Wolf Mental 4 Illness (Even though they’ve killed all over the globe for Centuries). When black people kill = Entire black race is violent . @USER @USER Just cut the diplomatic ties to UK. We Germans call them island monkeys. Or 4 island apes. Table 2: Hate speech tweets and the type of hateful attack that corresponds to the second bullet of the definition in Section 3 3 Definition To consistently annotate a large Twitter corpus, there is a need for a clear and simple definition. We define hate speech in a way that is easy for annotators to label tweets but also for other non-expert groups, like journalists, to further enhance the dataset or provide feedback. The proposed definition is formed after extensive discussions with journalists from the European Journalism Centre through a small focus group and continuous evaluation and feedback from journalists. After looking at hundreds of hateful tweets and several meetings, it was decided that the presence of hate speech should be concluded by answering to two simple key questions. These questions refer to the tweet content and they are presented below: Does it target a person or group? Does it contain a hateful attack? 1. Violent speech 2. Support for death/disease/harm 3. Statement of inferiority relating to a group they identify with (like LGBTQI) 4. Call for segregation A positive answer to both bullets should make the annotator flag the post as hate speech. The second question can refer to any of the four subcategories that are listed above. This definition was evaluated with a larger group of journalists and proved to be concise and easy to understand. To give a better intuition about the hate speech definition, we provide some examples of annotated tweets in English language. The examples are listed in Table 2. Note that all of these tweets comply with the first requirement of the above definition, meaning that these tweets target a person or a group. Table 2 also identifies the type of hateful attack on each tweet in four classes, as described in the second bullet of the definition. 4 Data collection One of the goals of this work is to create multilingual hate speech datasets consisting of tweets that originate from a journalistic context, alongside with binary annotations about hate speech. To generate these datasets, a large list of unlabeled tweets is collected. As a side-note, during data collection and management all EU General Data Protection Regulations were followed. The data collection process started with the creation of a list of journalism-related Twitter accounts. In a subsequent step, we retrieve the tweets related to these accounts to assemble the pool of unlabeled data. Apart from English, which is the most common language in the related literature, this process is applied for French, German, Greek and Spanish. A straightforward way to compose such a list is to manually identify a list of well-known accounts of journalists and news outlets and focus the data collection on those accounts. However, preliminary experiments showed that the volume of data that could be collected following this approach is limited, at least by using the standard version of Search API that does not provide historical data. https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets.html 4 A PREPRINT - M AY 4, 2020 Language Accounts Tweets EN 12,306 92,324,248 DE 3,436 12,436,132 ES 1,765 49,453,601 FR 2,794 34,118,951 GR 3,577 2,147,668 Table 3: Total number of journalism-related Twitter accounts and total number of tweets retrieved per language To overcome this issue we identify a larger number of journalist-related accounts by using Twitter lists. Twitter lists are curated groups of Twitter accounts and are usually centered around specific themes. In order to find Twitter lists related to journalism, we collected the Twitter accounts of well-known news outlets and journalism-related organizations and automated the process of fetching all list members. Despite the fact that the majority of the collected accounts are related to either journalists or news outlets, we also noticed the presence of non-journalistic accounts, which could potentially tamper with the journalistic focus of our data collection. To tackle this issue, we manually filtered the irrelevant accounts. Since annotating (for relevance with journalism) the full list of accounts would involve a considerable manual effort, we prioritized the annotation of the most popular accounts, since those accounts usually attract a larger number of tweets. In the first iteration of account collection, we annotated the accounts using the following classes: a) journalist, b) news outlet, c) irrelevant. The initial seed consists of 200 manually collected journalism-related accounts for each language. Table 3 lists the number of journalism-related accounts per language. Having a validated list of journalism-related Twitter accounts for each language, we set up, as the next step, a mechanism to collect tweets from the feeds of these accounts. To this end, the Twitter Search API is used, which returns a sample of Tweets posted in the past 7 days. The API is rate-limited at 180 requests per 15-min window when user authentication is used and at 450 requests per 15-min window when application authentication is used. We used application authentication having also in mind that each call can return a maximum of 100 tweets. Despite these limitations, by effectively utilizing the API within the imposed call rate limitations, we manage to collect a large number of tweets that is sufficient for creating a sizeable, journalism-oriented hate speech database. Typically, the Twitter Search API query consists of a series of keywords that should be contained in the set of returned tweets, along with account-based search operators. We opted for using search queries that do not restrict the tweet contents and only used account-based search operators to limit results. More specifically, we used the to: (e.g., to:BBCNews) and the “@” (e.g. @BBCNews) operators in order to collect tweets authored in reply to and tweets mentioning those specific accounts. Specifically, every 15 minutes, the module sequentially performed N = 450 N API requests, evenly distributed safe across the 15-minute window. N [0; 450) corresponds to a safety parameter that is used to avoid pushing the API to safe its limits. We use N = 200 and thus a maximum of N = 250 calls is performed per every 15-min window. Since safe each call is associated with one account, and the number of calls that can be performed within each 15-min window is much smaller than the pool of target accounts, an account prioritization/selection mechanism is implemented. This prioritization approach fetched data from all accounts and measured an estimate of the rate of incoming tweets. Using this estimate, the available API calls are distributed on accounts, which are expected to have received a sufficient number of new tweets since the last time that they were queried. Specifically, whenever a new API call is performed, an estimated number of available new tweets is calculated for each account based on the account’s estimated incoming tweet rate and the time that has passed since the last time it was fetched. Then, one account is randomly selected among those whose estimated number of available new tweets is larger than a user-specified threshold (for which reasonable values should lie close to 100, the maximum number of results per request). After a call is performed, the incoming tweet rate for each account is updated by dividing the number of returned tweets with the time difference between the newest and the oldest tweet (a very low rate is assigned in case a call returns zero results). Moreover, the last fetched timestamp is recorded to facilitate the calculation of the estimated number of incoming tweets. Initially, all accounts are assigned a fixed, high incoming tweet rate to ensure that all accounts will be fetched at least once. This approach ensured that all accounts were queried proportionally to their actual, dynamic incoming tweet rate, thus maximizing the amount of tweets that can be collected. We apply this approach for every language, mining tweets from the corresponding pool of journalistic Twitter accounts and store the retrieved tweets in a mongoDB database. To ensure that tweet language matches the language of the query accounts, we inspect the tweet metadata information. The total number of the collected unlabeled tweets per language is listed in Table 3. We denote the unlabeled pool of tweets by U . The unlabeled pool of tweets for each language is denoted by U where i 2 fEN; DE; ES; FR; GRg. https://help.twitter.com/en/using-twitter/twitter-lists 5 A PREPRINT - M AY 4, 2020 Notice that there are significant differences in the number of tweets retrieved per language. This is expected due to the different number of query accounts used to retrieve tweets, the language popularity and the Twitter usage per country. The total collection period was approximately 6 months (1/10/2018 - 8/5/2019). 5 Annotation process In this Section, we describe the annotation process for labelling the pool of tweetsU . The purpose of this process is to create a labeled dataset denoted by L. Note that L consists of 5 different language datasets denoted by L where i 2 fEN; DE; ES; FR; GRg and k denotes the current number of annotated batches that is included in the dataset. The annotation process consists of two sampling stages, the initial sampling stage and the active learning sampling stage. In the initial sampling stage, we use two approaches described in the following subsection, and select a defined number of tweets from the U pool to generate the initial annotation batch for each language i. We denote annotation batches k 1 for each language by B where k is the number of the generated batch. For the initial sampling stage k = 1. Each B i i 1 1 in this stage, is submitted for human annotation. After, the annotation of the initial B , the first labeled datasets L are i i generated. The active learning sampling stage,generates all the subsequent B with k > 1, for further annotation and dataset expansion, through an iterative process. In the first iteration, we use L generated from the initial sampling stage, and 2 2 we employ an active learning sampling approach to generate B from U . We then submit B for annotation and we i i 1 2 expand the L to L by appending the new annotated batch. We repeat this process until there are no more tweets i i available in the U or we exhaust the annotation budget. 5.1 Initial sampling stage By observing Table 3, we notice that there is a large number of unlabeled tweets in each language. To submit such sizable data for human annotation, is not only overwhelming to the annotators, but makes the whole process very expensive and hard to supervise. Additionally, we expect that the number of positive annotations will be very small compared to the negative annotations. To validate this, we conduct a brief preliminary annotation round, annotating 2000 random tweets per language and inspecting the number of positive annotations. As expected, we observe an apparent scarcity of positive annotations. Table 4 shows that for the preliminary batch in English language, among 2000 annotations, only 0.4% are positive. Based on these observations, it is evident that we need to develop a process, which would generate batches for annotation of manageable size and exhibiting a higher positive annotation ratio. To mitigate the lack of positive annotations, we consider two different approaches for generating the annotation batches of tweets (B ). The first and the most popular approach in the literature, is using keyword-based sampling and thus creating the annotation batch with tweets containing specific keywords. In the hate speech context, these keywords are usually offensive words or words that can be used to express hatred. The second approach, which is novel in the literature, is to use existing hate speech datasets in order to train hate speech detection models and, in a subsequent step, apply these models to the pool of unlabeled tweets and sample the tweets with higher hate speech probability. For this task we train a CNN model, which is described later in section 6.2. In a sense, this approach shares similarities with transfer learning [38], where we essentially transfer the established hate speech definition from another work to sample our data. We refer to this approach as dataset-based sampling. One evident shortcoming of this approach is that it requires the availability of such datasets, something that is possible only for a few languages. Both of these approaches introduce bias in the sampled tweets [39] and consequently in the generated datasets. It is obvious that dataset generation with such extreme imbalance between classes, as the one presented in this work, requires a sampling approach that is inclined towards the minority class. Although bias is inevitable, there are ways that it can be moderated. To moderate bias for the keyword-based approach, we compile a large list of keywords. These lists included offensive slang, phrases and words that could potentially express hate when used in the appropriate context and keywords related to several forms of possible discrimination (religion, gender, refugees etc). For the dataset-based approach, we adopt three different approaches to tackle bias. First, we use datasets from works that have less strict definition of hate speech compared to ours. Second, in cases where the dataset-based approach is not applicable due to lack of existing datasets we combine multiple different datasets with diverse definitions and scopes. Third, we use a loose classification threshold in the resulting CNN model, sampling tweets from a wide range of hate speech probability. Finally, to further reduce bias in the generated annotation batch, we concatenate random sampled tweets for both keyword-based and dataset-based approaches. 1 1 We apply the dataset-based sampling approach to generate B and B by using some of the related datasets in EN DE Table 1. We apply the keyword-based sampling for the Spanish, French and Greek due to lack of related datasets. 6 A PREPRINT - M AY 4, 2020 Specifically, for the English language, we train a CNN model using the dataset presented in [5]. Although their definition of hate speech differs from the one described in this work, the dataset is suitable for training, since [5] establishes a loose definition of hate speech compared to ours (i.e a lot of offensive and insulting tweets are considered as hate speech). For the case of the German language, two different datasets are combined. The first dataset [11], included tweets referring to refugees and included binary hate speech annotations. Additionally, the GermEval 2018 dataset [12] is used, which is a series of shared task evaluation campaigns that focused on natural language processing for the German language. Just like the English case, a CNN model is trained using this dataset. To generate the first annotation batch for each one of these languages, we applied the corresponding CNN model to the pool of unlabeled tweets calculating the hate speech probability for each unlabeled tweet. Then we randomly sample 8000 tweets with hate speech probability that falls in range [0.2-1.0] for each language. This wide range is chosen to reduce the bias of the generated batch as discussed previously. For the rest of the languages (Spanish, French, Greek), we employ the keyword-based approach. For each language we assemble a large list of keywords, including 500 to 1000 keywords per language. These lists included offensive slang, phrases and words that could potentially express hate when used in the appropriate context and keywords related to several kinds of possible discrimination (religion, gender, refugees etc). We use these keywords to sample from the pool of unlabeled tweets and fetch up to 8000 tweets per language. For each language, 2000 additional randomly selected tweets are concatenated to the corresponding batch, in order to further mitigate bias imposed by the keyword-based and dataset-based approaches. This leads to the generation of 1 1 the initial B that contain 10000 tweets per language. Note that tweets in B are removed from U , in order not be i i retrieved again during the creation of the next batches. After the generation of initial batches, they are submitted to the corresponding annotators to produce L . We describe the manual annotation process in the next subsection. Table 4 shows that annotated B exhibits a significant increase in the ratio of positive annotations compared to the preliminary annotation batch. This is expected because keyword-based and dataset-based sampling favour the hate speech class. Batch type Language i 1 2 Preliminary B B i i EN 0.4% 1.9% 6.33% DE 0.2% 1.13% 3.47% ES 0.12% 0.89% 2.51% FR 0.3% 2.36% 7.37% GR 0.13% 0.70% 1.32% Table 4: Ratio of positive annotations in different batches and languages. Preliminary batch includes 2k randomly sampled tweets for each language. B include 10k tweets selected with keyword-based or dataset-based approach in the initial sampling stage. B include 10k tweets from the first batch of active learning sampling stage 5.2 Manual annotation After generating each batch of tweets for each language, we proceed to the manual annotation task. This process is described in this subsection. There are many annotation methodologies that can be used on a large corpus of data in the related literature. For this work, we follow the findings reported in [40], which claims that it is better to allocate the annotation budget to label as many examples as possible when the annotation quality is above a specified threshold. Based on this, we perform the annotation task submitting one annotation per tweet and thus, utilizing the annotation budget in order to annotate as many tweets as possible. For each language we used experienced annotators, proficient in the corresponding language, familiar with social media and acquainted with the colloquial nature of online conversations. Additionally, we follow a strict quality assurance methodology assigning a supervisor role in the process. The supervisor’s job is to validate the annotations following a quality control methodology. In total five annotators (one for each language) and one supervisor are employed for the annotation task. The supervisor and annotators of each language perform preliminary annotations to make sure that the definition of hate speech are correctly understood. This process also includes discussions about the particularities of each language. For each language, the corresponding annotator is asked to flag hate speech in tweets with a yes or no answer. In cases where the annotator is unsure about the answer, he/she could flag the tweet, and on a second stage, this post would be discussed with the supervisor. 7 A PREPRINT - M AY 4, 2020 Figure 1: A schematic representation of the annotation process (dotted arrows used only for the initial sampling stage) To ensure the quality of the annotation, we follow the quality control methodology as described in ISO 2859 and ANSI/ASQ Z1.4-2003. We used level II, i.e., the normal severity level, and thus a lot of 1,000 annotations is considered of acceptable quality if the error rate does not exceed 4%. To determine this, a single sampling size of 80 annotations out of 1,000 tweets is used. Whenever an annotator completes 1,000 annotations, the supervisor of the process evaluates 80 random samples of them, and if more than 7 annotations are erroneous, the whole lot is rejected, and careful instructions are given to the annotator. The annotator will then annotate anew the tweets. 5.3 Active learning sampling stage To generate the initial batch of tweets for each language, we use the keyword-based and the dataset-based sampling approaches, and then human annotators label each language batch, as described in previous Subsections. At this point, L contain 10k labeled tweets for each language. We use the current annotated sets and train new detection models in order to generate new B . This approach is similar to the dataset-based sampling approach, although there are some key differences. Detection models are created for all examined languages (and not just for English and German) with our own annotated datasets. At the same time, these detection models will detect hate speech that matches the definition that is presented in Section 3 and not ones that are transferred from other works. Although this is a valid approach that can be efficiently used to sample tweets from the unlabeled pool L, it does not guarantee that the sampled tweets will add additional value to the models. For instance, if a hate speech classifier is highly confident about a tweet, then this tweet will not contribute to the learning process and might even hurt generalization performance. Also, this implies that this specific tweet is very similar with other tweets existing in the dataset that were used to generate the classifier. Motivated by this assumption, an active learning mechanism is adopted to generate annotation batches of tweets that would essentially contribute to the ability of the models to learn. Pool-based active learning [41] relies on an initial small set of labeled instances L, and a larger set of unlabeled onesU . Batches of informative training samples are iteratively selected fromU and added toL, with respect to some selection mechanism, after a query about their actual label to an annotator. This approach is motivated in many modern machine learning applications, where unlabeled data may be abundant, but labels are difficult or expensive to obtain. Initially, uncertainty sampling [41] is considered as a selection mechanism. In this setup, an active learner selects tweets whose posterior probability is near 0.5. This approach proved problematic for our scenario, since hate speech datasets are imbalanced, and thus the probability distributions are skewed towards the dominant class. Having this in mind, and also the fact that multiple deep learning models would be trained for comparison and evaluation, the best approach proved to be the query-by-committee [42] algorithm. The query-by-committee approach involves maintaining (1) (c) a committee C = f ; :::;  g of models, which are all trained on the current labeled set L, but represent competing hypotheses. This is applicable to our scenario since each deep learning model architecture captures different semantic and syntactic components of the tweet, even if they are trained using the same dataset. Then, each committee member is allowed to vote on the labels of query candidates. The most informative tweet to be sent to annotators is considered 8 A PREPRINT - M AY 4, 2020 Train dataset Test dataset macro-F1 Initial Additional 8000 - 2000 0.43 8000 2000 (random) 2000 0.43 8000 2000 (hate probability) 2000 0.45 8000 2000 (active learning) 2000 0.47 Table 5: Different evaluation experiments in English language. We evaluate the models created with an initial train set and additional sets generated with different techniques. to be the one that committee members disagree upon the most. The average Kullback-Leibler (KL) divergence [43]: x = argmax D(P (c)jjP )); KL c=1 where: P (c)(y jx) D(P jjP )) = P (y jx) log (c) (c) C i P (y jx) C i (c) is used for measuring the level of disagreement among the classifiers. Here  represents a particular model in the committee, andC represents the whole committee, thus P (y jx) = P (c)(y jx) is the “consensus” probability C i i C c=1 that y is the correct label. KL divergence [44] is an information-theoretic measure of the difference between two probability distributions. So, this disagreement measure considers the most informative query to be the one with the largest average difference between the label distributions of any of the committee members and the consensus. To apply this process to our work, for k-th iteration we train various deep learning models for hate speech detection in k1 each language, using the annotated tweets L that are generated from the previous iteration k 1. We denote the committee of models for each language i by C . Specifically, the active learning sampling stage is described in the following bullets and the entire annotation process is presented in Figure 1. Thus, for each language i: 1. We train the models described in Section 6.2 and generate the committee of classifiers C , using the previous k1 batch of annotated tweets L . Note that in this step, we do not include the ensemble model to the committee C . (c ) 2. For every tweet in the unlabeled pool U , we calculate each model’s  hate speech probability and compute the average KL divergence. 3. For every tweet in the unlabeled pool U , we calculate the ensemble model’s hate speech probability. 4. We sample 8000 tweets from U that have the highest KL divergence and the ensemble output probability is higher than 0.2. We also randomly sample 2000 tweets to generate the batch B with a total of 10000 tweets. Note that the sampled tweets are removed from U . Also, in the final iteration the size of the batch might be smaller than 10000 in case where there are no available tweets left in U . k k 5. We submit B for annotation and we append the annotated batch to the pool of labeled tweets L . i i 6. The process is repeated from step 1, using L to retrain the classifiers and generate the next batch for annotation. We stop this process if there no available tweets in U or the annotation budget is exhausted. In Table 4, it is evident that the first annotation batch created with the active learning approach (B ) had a higher percent of hate speech compared to the initial or the preliminary batch. This is expected because in the active learning setup, we use models that are trained on datasets annotated using our definition of hate speech. Table 5 shows the conducted experiments to validate whether sampled tweets with this approach contribute to the model’s ability to learn and generalize. In this experimental setting, the initial annotated batch of English tweets is split to 8000 training and 2000 testing tweets. A simple CNN model is trained and the macro average F1 score is calculated for the test set. This process is repeated 3 times using different additional annotated sets to train the classifier. The first set consists of 2000 randomly sampled tweets, the second set consists of 2000 tweets sampled from the [0.2-1.0] hate speech probability interval, and the third contains 2000 tweets sampled using the active learning approach. As we can see in Table 5, random sampling does not improve the evaluation results at all. On the other hand, both hate probability and active learning approaches improved the macro-f1 evaluation with the latter being the best approach. In this experiment, it is obvious that active learning improves the learning process of the classifier by choosing the most appropriate tweets for annotation. 9 A PREPRINT - M AY 4, 2020 5.4 Datasets Hate speech Language Set positive negative train 5804 68051 EN test 1355 16812 total 7159 84863 train 1361 33626 DE test 340 8707 total 1702 42033 train 795 29355 ES test 199 7339 total 994 36694 train 2163 29124 FR test 541 7281 total 2704 26405 train 913 48271 GR test 228 12069 total 1141 60340 Table 6: Number of positive and negative annotated tweets in different sets and languages available in our publicly available datasets 6 7 8 9 10 The final datasets for English , German , Spanish , French and Greek languages are hosted on Zenodo platform and are available after request. Each dataset contains tweets ids and their corresponding binary annotations. Table 6 shows the individual dataset statistics for each language. The datasets are provided in train/test sets, preserving the proportion of negative and positive samples. 6 Experimental study In this section we present the experimental pipeline that is followed in order to train and evaluate the hate speech detection models using the datasets described in the previous Section. 6.1 Tweet pre-processing Due to the nature of Twitter data, there is a lot of noise among words. Posted links or mentions do not provide any useful information and need to be normalized. To achieve this, a state of the art tweet normalization tool [45] is used, to tokenize and transform each tweet into a sequence of words. The process involves Twitter handles normalization (e.g. @random_user becomes < user >), emoji transformation (e.g. :( becomes < sad >), lower casing, as well as, URL, email and number removal. Furthermore, only basic punctuation is retained (e.g. .,?;”). 6.2 Models Following the latest trend in the literature, which shifts towards the adoption of deep learning based methods, some of the latest state of the art models for text classification are used. Deep learning models perform better than traditional methods in most NLP tasks, including hate speech detection tasks [33, 15]. Furthermore, an ensemble learning architecture is proposed, since it combines the predictive power of each individual classifier. Below we describe the deep learning architectures that are evaluated in this work. CNN. A simple Convolutional Neural Network model described in [35] acts as n-gram feature extractor. Using windows sizes of 2,3 and 4 this CNN model can extract bi-gram, tri-gram and quad-gram features. The output of each CNN is then further down-sampled by a 1D max pooling layer with a pool size of 4 and a stride of 4 https://zenodo.org/record/3520152#.XcL0OnUzY5k https://zenodo.org/record/3520148#.XcL04XUzY5k https://zenodo.org/record/3520150#.XcL1C3UzY5k https://zenodo.org/record/3520156#.XcL1GHUzY5k https://zenodo.org/record/3520157#.XcL1G3UzY5k 10 A PREPRINT - M AY 4, 2020 for further feature selection. After the concatenation of pooling layers, another 1D max pooling layer is added and the output is fed to the final fully connected layer. Skipped CNN (sCNN). Extending the base CNN model in order to capture features of words that are not next to each other, Zhang et al. [35] proposed Skipped CNN layer. Skipped CNN applies a mask to a kernel window, skipping intermediate words and associating words that are not directly near. According to the authors, skipped CNNs can be considered as extractors of ‘skip-gram’ like features. CNN + GRU. Work in [37] added a GRU layer followed by a global max pooling layer on top of CNN model. The GRU layer captures sequence feature relations and learns to identify dependencies between n-gram features LSTM. A bidirectional LSTM model is created. After the embedding layer, spatial dropout is introduced, which randomly masks 20% of the input words. To process the sequence of word embeddings, an LSTM layer is used with 128 units. Next, a global max pooling and an average max pooling layer are concatenated, flattening the output space by taking the highest and the average value in each timestep dimension, respectively. The produced feature vector is fed into the final fully connected layer. LSTM + Attention (aLSTM). Attention mechanism is used with success in many NLP tasks like in [46]. Intuitively, attention is a mechanism that learns to favor features that are more relevant to the classification task, by assigning weights to them. This means that features that are not important to the task are multiplied by smaller weights, while predictive features are multiplied by higher weights. The attention layer is implemented based on [47] and it is applied to the LSTM model. Instead of taking the max and average features in each timestep, an attention layer with 100 units is used to extract the important features of the LSTM layer. The output of the attention layer is then fed to the output layer. Ensemble (E). Aken et al. [48] proposed an ensemble model based on the assumption that classification methods vary in their predictive power and may conduct specific errors. The ensemble model in [48] is trained with gradient boosting decision trees. We used a simple dense neural network ensemble architecture, forwarded the output predictions of the models as inputs to a dense layer with 20 neurons, applied a dropout layer with a ratio of 0.2 and finally the outputs features are forwarded to the final output layer. Intuitively, this small neural network learns to apply a weighted average based on the prediction probability of each individual classifier. 6.3 Implementation details For all methods discussed in this work, we use Keras [49] with Tensorflow [50] backend and the scikit-learn [51] library. Each model is trained for 10 epochs and a mini-batch of 64 tweets is used. Keras requires static input sequences, meaning that the max number of words in a tweet has to be predefined. Thus, the max sequence of words for a tweet is set to 50, since, after experimentation, it is found that it does not affect performance. Zero padding is used for sentences with less than 50 words. The first layer for every model is an embedding layer. We initialize the embedding layer using pre-trained word vectors for each language. After conducting some preliminary experiments, the best pre-trained embedding choice for Greek and French language is using fastText embeddings [52], trained on Common Crawl and Wikipedia. For English, Spanish and German language Glove embeddings [53] achieve better evaluation results. Word2vec [54] pre-trained embeddings are also tested. Note that the evaluation results among different embedding approaches do not exhibit significant differences. Word vectors that do not exist in the pre-trained embeddings are randomly initialized, and the embedding layer is further fine-tuned during the training process. To represent a padding token, zero initialization is used. For every model, the default parameters are used, as they are provided by the corresponding authors, unless stated otherwise. The l2 regularization parameter is set to be 1e for every layer. We treat hate speech detection as a binary classification problem. The final fully connected layer is a sigmoid activation and outputs the hate speech probability. Binary cross-entropy loss function and the Adam optimizer are used to train the models. 6.4 Evaluation setup In related literature, evaluation of the performance of hate speech detection typically adopts the classic Precision, Recall and F1 metrics. Precision measures the percentage of true positives among the predicted hate speech tweets. Recall measures the percentage of true positives among the ground truth hate speech tweets, and F1 calculates the harmonic average of the two. The three metrics are applied to each dataset class and an aggregated result is computed, either using micro-average or macro-average. The first approach sums up the individual true positives, false positives, and false negatives identified by a model, not taking into consideration different classes to calculate overall Precision, Recall and F1 scores. The second approach takes the average of the Precision, Recall and F1 on different classes. Existing studies on hate speech detection have primarily reported their results using micro-average Precision, Recall and F1 [34, 33, 15, 25, 37]. 11 A PREPRINT - M AY 4, 2020 As stated in [35] and is made obvious in our dataset statistics shown in Table 6, a usual observation in hate speech datasets is their highly imbalanced nature. In imbalanced datasets, like the ones discussed in this paper, micro-averaging can inherently hide the real performance of minority classes. Thus, a significantly lower or higher F1 score on a minority class is unlikely to cause a significant change in micro-F1 on the entire dataset. In a practical application like hate speech detection, reporting micro-F1 on the entire dataset will not properly reflect a model’s performance on hateful content as opposed to non-hate. Motivated by these observations, we use the standard Precision (P), Recall (R) and F1 measures for evaluation and report their macro averages(m-P, m-R, m-F1). Additionally, we provide F1 obtained on hate speech class (h-F1). To train and evaluate the models for hate speech detection, we use the training and test sets reported in Table 6, respectively. 6.5 Results Table 7 shows the evaluation results for the hate speech class in each language. A first observation that highlights the imbalance between classes is that F1 score for the hate class is significantly lower compared to the macro F1 scores. This is expected because the number of negative annotated tweets in the test dataset is significantly larger than positive annotated ones, as displayed in Table 6. metric CNN sCNN CNN + LSTM aLSTM E GRU m-P 0.81 0.83 0.80 0.77 0.79 0.80 m-R 0.78 0.78 0.80 0.78 0.79 0.82 EN m-F1 0.79 0.80 0.80 0.77 0.79 0.81 h-F1 0.61 0.64 0.63 0.58 0.61 0.65 m-P 0.64 0.67 0.68 0.65 0.67 0.67 m-R 0.67 0.71 0.68 0.65 0.66 0.71 DE m-F1 0.65 0.69 0.68 0.65 0.66 0.69 h-F1 0.34 0.40 0.38 0.32 0.35 0.40 m-P 0.69 0.69 0.70 0.74 0.68 0.70 m-R 0.71 0.75 0.72 0.68 0.68 0.73 ES m-F1 0.70 0.72 0.71 0.70 0.68 0.72 h-F1 0.42 0.45 0.44 0.42 0.38 0.44 m-P 0.81 0.81 0.83 0.80 0.80 0.84 m-R 0.81 0.82 0.81 0.77 0.82 0.81 FR m-F1 0.81 0.82 0.82 0.78 0.81 0.83 h-F1 0.65 0.66 0.66 0.64 0.64 0.67 m-P 0.81 0.87 0.87 0.86 0.86 0.87 m-R 0.78 0.77 0.75 0.75 0.75 0.78 GR m-F1 0.79 0.81 0.80 0.80 0.80 0.82 h-F1 0.59 0.63 0.60 0.60 0.60 0.65 Table 7: The evaluation results for hate speech class By inspecting each language separately, we notice that there are no significant performance differences between all models in terms of macro F1. However, in terms of individual models, the sCNN model seems to generally exhibit the best performance. Some exceptions are observed, as in the case of the Spanish language, where the LSTM model performs better in terms of macro Precision, and in the Greek language, where the CNN model has better macro Recall evaluation. sCNN seems to be the most compelling feature extractor for hate speech as it achieves the best F1 score for the hate class, among individual models. This also corresponds to overall better macro F1 by sCNN compared to other methods. For the case of the French language, CNN+GRU model performs on par with sCNN model. Additionally, the combination of all individual models in the ensemble model (E) yielded even better results in terms of macro F1. The ensemble model had the best macro F1 score, as it manages to perform well both in terms of macro Precision and macro Recall. The ensemble model also exhibits the best performance in terms of hate speech F1 score. The only exception is observed in the Spanish language, where sCNN model scores a higher F1 score for the hate speech class. Another observation is that the evaluation for English, French and Greek, specifically in terms of F1 score in the positive class, is significantly better when compared with the Spanish and the German languages. This is potentially due to 12 A PREPRINT - M AY 4, 2020 the fact that there are less positive samples in the related datasets. Our goal is to continue expanding the datasets and specifically address the issue for these two languages. 7 Conclusion In this work, we try to tackle hate speech directed at journalists on social media. To accomplish this, we define hate speech in a way that takes into consideration the journalistic point of view, and is simple enough to be used by non specialists. Using this definition we create labeled datasets in five different languages. During data annotaion, a comprehensive annotation strategy is followed. The generated datasets are made publicly available to assist further research efforts. Furthermore, we use these datasets to train various state-of-the-art deep learning architectures, while at the same time, we propose an ensemble model that outperforms all individual models. Another major contribution of this study is its annotation pipeline. To increase the number of positive annotations, we employed keyword-based sampling and also an approach that uses available datasets from other related works. Namely, the dataset-based sampling approach trains detection models using related datasets and samples tweets based on the output probability of the model. This approach, is similar to transferring the definition of hate speech from other works and use it to sample tweets. This is particularly useful in cases where the definition of hate speech in these datasets has a wider scope than the proposed definition. It is obvious that these approaches introduce bias to the resulting datasets. Although bias is inevitable, we propose ways to mitigate it. Lowering the sampling threshold of the detection models, combining multiple datasets and adding randomly sampled data, are good approaches to deal with bias. Building on the initial annotation stage, we also presented an active learning approach. The contribution of this stage is that, besides the high percent of positive annotations in the resulting batches, these batches contain data that can contribute to the learning process of the models. Specifically, data sampled with this approach generally deviate from the already labeled data and consequently improve the generalization of the model. We believe that the presented annotation process can be applied to other domains, beyond the scope of this work. Specifically, the two sampling stages can be very effective approaches when annotating a large corpus of data. The only input that is required for these approaches is one or more relevant datasets or a curated list of keywords. As future steps, we plan to keep expanding the datasets with new tweets. To this end, we have developed an alert monitoring mechanism for journalists that supports further annotation of tweets. Using these tweets, we plan to retrain our models in frequent intervals. Additionally, we aim to investigate new active learning techniques in order to choose more informative tweets that improve the models’ ability to learn and generalize. Another issue we will focus on, is the imbalance between the positive and negative classes. To alleviate this, we will explore ways to fetch more hateful content in a more unbiased manner. Finally, we will experiment with some state of the art deep learning architectures for natural language processing like BERT [55] or ULMFiT [56]. 8 Acknowledgements This work was supported by the Rights, Equality and Citizenship programme of the European Union (2014-2020) under grant agreement number 785679. References [1] V. Jourová, Code of conduct on countering illegal hate speech online: First results on implementation. [2] UNESCO, World trends in freedom of expression and media development: global report 2017/2018, 2018. [3] E. Wulczyn, N. Thain, L. Dixon, Ex Machina: Personal Attacks Seen at Scale, in: Proceedings of the 26th International Conference on World Wide Web, WWW ’17, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 2017, pp. 1391–1399, event-place: Perth, Australia. doi:10.1145/3038912.3052591. URL https://doi.org/10.1145/3038912.3052591 [4] A. M. Founta, C. Djouvas, D. Chatzakou, I. Leontiadis, J. Blackburn, G. Stringhini, A. Vakali, M. Sirivianos, N. Kourtellis, Large scale crowdsourcing and characterization of twitter abusive behavior, in: Twelfth International AAAI Conference on Web and Social Media, 2018. [5] T. Davidson, D. Warmsley, M. Macy, I. Weber, Automated hate speech detection and the problem of offensive language, in: Eleventh International AAAI Conference on Web and Social Media, 2017. 13 A PREPRINT - M AY 4, 2020 [6] Z. Waseem, D. Hovy, Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter, in: Proceedings of the NAACL Student Research Workshop, Association for Computational Linguistics, San Diego, California, 2016, pp. 88–93. doi:10.18653/v1/N16-2013. URL https://www.aclweb.org/anthology/N16-2013 [7] S. Sharma, S. Agrawal, M. Shrivastava, Degree based Classification of Harmful Speech using Twitter Data, in: Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018), Association for Computational Linguistics, Santa Fe, New Mexico, USA, 2018, pp. 106–112. URL https://www.aclweb.org/anthology/W18-4413 [8] O. d. Gibert, N. Perez, A. G. Pablos, M. Cuadros, Hate Speech Dataset from a White Supremacy Forum, 2018, pp. 11–20. URL https://aclweb.org/anthology/papers/W/W18/W18-5102/ [9] M. ElSherief, S. Nilizadeh, D. Nguyen, G. Vigna, E. Belding, Peer to Peer Hate: Hate Speech Instigators and Their Targets (Apr. 2018). URL https://arxiv.org/abs/1804.04649v1 [10] I. Kwok, Y. Wang, Locate the Hate: Detecting Tweets against Blacks, in: AAAI, 2013. [11] B. Ross, M. Rist, G. Carbonell, B. Cabrera, N. Kurowsky, M. Wojatzki, Measuring the Reliability of Hate Speech Annotations: The Case of the European Refugee Crisis (Jan. 2017). doi:10.17185/duepublico/42132. URL https://arxiv.org/abs/1701.08118v1 [12] M. Wiegand, M. Siegel, J. Ruppenhofer, Overview of the germeval 2018 shared task on the identification of offensive language (2018). [13] F. D. Vigna, A. Cimino, F. Dell’Orletta, M. Petrocchi, M. Tesconi, Hate me, hate me not: Hate speech detection on facebook, in: Proceedings of the First Italian Conference on Cybersecurity (ITASEC17), 2017. [14] H. M. Saleem, K. P. Dillon, S. Benesch, D. Ruths, A Web of Hate: Tackling Hateful Speech in Online Social Spaces, arXiv:1709.10159 [cs]ArXiv: 1709.10159 (Sep. 2017). URL http://arxiv.org/abs/1709.10159 [15] J. H. Park, P. Fung, One-step and Two-step Classification for Abusive Language Detection on Twitter, 2017, pp. 41–45. doi:10.18653/v1/W17-3006. URL https://aclweb.org/anthology/papers/W/W17/W17-3006/ [16] C. Nobata, J. Tetreault, A. Thomas, Y. Mehdad, Y. Chang, Abusive Language Detection in Online User Content, in: Proceedings of the 25th International Conference on World Wide Web, WWW ’16, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 2016, pp. 145–153, event-place: Montréal, Québec, Canada. doi:10.1145/2872427.2883062. URL https://doi.org/10.1145/2872427.2883062 [17] E. A. Jane, “your a ugly, whorish, slut” understanding e-bile, Feminist Media Studies 14 (4) (2014) 531–546. [18] W. Warner, J. Hirschberg, Detecting Hate Speech on the World Wide Web, in: Proceedings of the Second Workshop on Language in Social Media, Association for Computational Linguistics, Montréal, Canada, 2012, pp. 19–26. URL https://www.aclweb.org/anthology/W12-2103 [19] P. Burnap, M. L. Williams, Cyber Hate Speech on Twitter: An Application of Machine Classification and Statistical Modeling for Policy and Decision Making, Policy & Internet 7 (2) (2015) 223–242. doi:10.1002/poi3.85. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/poi3.85 [20] N. Djuric, J. Zhou, R. Morris, M. Grbovic, V. Radosavljevic, N. Bhamidipati, Hate Speech Detection with Comment Embeddings, in: Proceedings of the 24th International Conference on World Wide Web, WWW ’15 Companion, ACM, New York, NY, USA, 2015, pp. 29–30, event-place: Florence, Italy. doi:10.1145/2740908. URL http://doi.acm.org/10.1145/2740908.2742760 [21] S. Assimakopoulos, F. H. Baider, S. Millar, Online hate speech in the European Union: a discourse-analytic perspective, Springer, 2017. [22] Council of the European Union, Council framework decision 2008/913/jha of 28 november 2008 on combating certain forms and expressions of racism and xenophobia by means of criminal law, Official Journal of the European Union L 328/55 (2008). [23] OHCHR, Rabat plan of action on the prohibition of advocacy of national, racial or religious hatred that constitutes incitement to discrimination, hostility or violence., Office of the United Nations High Commissioner for Human Rights Report (2013). 14 A PREPRINT - M AY 4, 2020 [24] D. K. Citron, Hate crimes in cyberspace, Harvard University Press, 2014. [25] Z. Waseem, Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter, in: Proceedings of the First Workshop on NLP and Computational Social Science, Association for Computational Linguistics, Austin, Texas, 2016, pp. 138–142. doi:10.18653/v1/W16-5618. URL https://www.aclweb.org/anthology/W16-5618 [26] Jigsaw, Toxic Comment Classification Challenge (2018). URL https://kaggle.com/c/jigsaw-toxic-comment-classification-challenge [27] Quora, Quora Insincere Questions Classification (2019). URL https://kaggle.com/c/quora-insincere-questions-classification [28] Impermium, Detecting Insults in Social Commentary (2012). URL https://kaggle.com/c/detecting-insults-in-social-commentary [29] A. Schmidt, M. Wiegand, A Survey on Hate Speech Detection using Natural Language Processing, 2017, pp. 1–10. doi:10.18653/v1/W17-1101. URL https://aclweb.org/anthology/papers/W/W17/W17-1101/ [30] Y. Mehdad, J. Tetreault, Do Characters Abuse More Than Words?, 2016, pp. 299–303. doi:10.18653/v1/ W16-3638. URL https://aclweb.org/anthology/papers/W/W16/W16-3638/ [31] G. Xiang, B. Fan, L. Wang, J. Hong, C. Rose, Detecting Offensive Tweets via Topical Feature Discovery over a Large Scale Twitter Corpus, in: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM ’12, ACM, New York, NY, USA, 2012, pp. 1980–1984, event-place: Maui, Hawaii, USA. doi:10.1145/2396761.2398556. URL http://doi.acm.org/10.1145/2396761.2398556 [32] L. Gao, R. Huang, Detecting Online Hate Speech Using Context Aware Models, 2017, pp. 260–266. doi: 10.26615/978-954-452-049-6_036. URL https://aclweb.org/anthology/papers/R/R17/R17-1036/ [33] B. Gambäck, U. K. Sikdar, Using convolutional neural networks to classify hate-speech, in: Proceedings of the first workshop on abusive language online, 2017, pp. 85–90. [34] P. Badjatiya, S. Gupta, M. Gupta, V. Varma, Deep Learning for Hate Speech Detection in Tweets, in: Proceedings of the 26th International Conference on World Wide Web Companion, WWW ’17 Companion, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 2017, pp. 759–760, event-place: Perth, Australia. doi:10.1145/3041021.3054223. URL https://doi.org/10.1145/3041021.3054223 [35] Z. Zhang, L. Luo, Hate Speech Detection: A Solved Problem? The Challenging Case of Long Tail on Twitter, arXiv:1803.03662 [cs]ArXiv: 1803.03662 (Feb. 2018). URL http://arxiv.org/abs/1803.03662 [36] A. M. Founta, D. Chatzakou, N. Kourtellis, J. Blackburn, A. Vakali, I. Leontiadis, A unified deep learning architecture for abuse detection, in: Proceedings of the 10th ACM Conference on Web Science, ACM, 2019, pp. 105–114. [37] Z. Zhang, D. Robinson, J. Tepper, Detecting Hate Speech on Twitter Using a Convolution-GRU Based Deep Neural Network, in: A. Gangemi, R. Navigli, M.-E. Vidal, P. Hitzler, R. Troncy, L. Hollink, A. Tordai, M. Alam (Eds.), The Semantic Web, Lecture Notes in Computer Science, Springer International Publishing, 2018, pp. 745–760. [38] L. Torrey, J. Shavlik, Transfer learning, in: Handbook of research on machine learning applications and trends: algorithms, methods, and techniques, IGI Global, 2010, pp. 242–264. [39] M. Wiegand, J. Ruppenhofer, T. Kleinbauer, Detection of abusive language: the problem of biased datasets, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 602–608. [40] A. Khetan, Z. C. Lipton, A. Anandkumar, Learning from noisy singly-labeled data, arXiv preprint arXiv:1712.04577 (2017). [41] D. D. Lewis, W. A. Gale, A sequential algorithm for training text classifiers, in: SIGIR’94, Springer, 1994, pp. 3–12. [42] H. S. Seung, M. Opper, H. Sompolinsky, Query by committee, in: Proceedings of the fifth annual workshop on Computational learning theory, ACM, 1992, pp. 287–294. 15 A PREPRINT - M AY 4, 2020 [43] A. K. McCallumzy, K. Nigamy, Employing em and pool-based active learning for text classification, Citeseer. [44] S. Kullback, R. A. Leibler, On information and sufficiency, The annals of mathematical statistics 22 (1) (1951) 79–86. [45] C. Baziotis, N. Pelekis, C. Doulkeridis, Datastories at semeval-2017 task 4: Deep lstm with attention for message- level and topic-based sentiment analysis, in: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Association for Computational Linguistics, Vancouver, Canada, 2017, pp. 747–754. [46] T. Luong, H. Pham, C. D. Manning, Effective Approaches to Attention-based Neural Machine Translation, 2015, pp. 1412–1421. doi:10.18653/v1/D15-1166. URL https://aclweb.org/anthology/papers/D/D15/D15-1166/ [47] P. Zhou, W. Shi, J. Tian, Z. Qi, B. Li, H. Hao, B. Xu, Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification, 2016, pp. 207–212. doi:10.18653/v1/P16-2034. URL https://aclweb.org/anthology/papers/P/P16/P16-2034/ [48] B. v. Aken, J. Risch, R. Krestel, A. Löser, Challenges for Toxic Comment Classification: An In-Depth Error Analysis, 2018, pp. 33–42. URL https://aclweb.org/anthology/papers/W/W18/W18-5105/ [49] F. Chollet, et al., Keras, https://keras.io (2015). [50] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Lev- enberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: Large-scale machine learning on heterogeneous systems, software available from tensorflow.org (2015). URL http://tensorflow.org/ [51] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830. [52] E. Grave, P. Bojanowski, P. Gupta, A. Joulin, T. Mikolov, Learning word vectors for 157 languages, in: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), 2018. [53] J. Pennington, R. Socher, C. Manning, Glove: Global vectors for word representation, in: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543. [54] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in: Advances in neural information processing systems, 2013, pp. 3111–3119. [55] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [56] J. Howard, S. Ruder, Universal language model fine-tuning for text classification, arXiv preprint arXiv:1801.06146 (2018). http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Statistics arXiv (Cornell University)

Towards countering hate speech against journalists on social media

Loading next page...
 
/lp/arxiv-cornell-university/towards-countering-hate-speech-against-journalists-on-social-media-GVJnQugrSi
ISSN
2468-6964
eISSN
ARCH-3347
DOI
10.1016/j.osnem.2020.100071
Publisher site
See Article on Publisher Site

Abstract

TOWARDS COUNTERING HATE SPEECH AGAINST JOURNALISTS ON SOCIAL MEDIA A PREPRINT Polychronis Charitidis Stavros Doropoulos Stavros Vologiannidis DataScouting DataScouting International Hellenic University 30 Vakchou Street, 54629 30 Vakchou Street, 54629 Terma Magnisias, 62124 Thessaloniki, Greece Thessaloniki, Greece Serres, Greece pcharitidis@datascouting.com doro@datascouting.com svol@teicm.gr Ioannis Papastergiou Sophia Karakeva DataScouting DataScouting 30 Vakchou Street, 54629 30 Vakchou Street, 54629 Thessaloniki, Greece Thessaloniki, Greece ipapaste@datascouting.com soka@datascouting.com May 4, 2020 ABSTRACT The damaging effects of hate speech on social media are evident during the last few years, and several organizations, researchers and social media platforms tried to harness them in various ways. Despite these efforts, social media users are still affected by hate speech. The problem is even more apparent to social groups that promote public discourse, such as journalists. In this work, we focus on countering hate speech that is targeted to journalistic social media accounts. To accomplish this, a group of journalists assembled a definition of hate speech, taking into account the journalistic point of view and the types of hate speech that are usually targeted against journalists. We then compile a large pool of tweets referring to journalism-related accounts in multiple languages. In order to annotate the pool of unlabeled tweets according to the definition, we follow a concise annotation strategy that involves active learning annotation stages. The outcome of this paper is a novel, publicly available collection of Twitter datasets in five different languages. Additionally, we experiment with state-of-the-art deep learning architectures for hate speech detection and use our annotated datasets to train and evaluate them. Finally, we propose an ensemble detection model that outperforms all individual models. 1 Introduction Hate Speech is not a new phenomenon. However, before the advent of email, online comments and networking platforms, the threshold to utter it to any effect at all was much higher. People had to draft a letter, buy postage, and send their missive through the mail. Since then, the formulation and dissemination of hate speech have become easy, instant, potentially ubiquitous, public, and therefore much more damaging. In fact, it not only poisons and thus effectively undermines free and open discourse on the Internet, which is bad enough in itself, but also constitutes a threat to the individuals and organizations it is directed at. The increasing propagation of hate speech through social media has drawn the attention of governments and organiza- tions. In May 2016, the European Commission agreed with Facebook, Microsoft, Twitter, and YouTube to a code of conduct [1] to prevent and counter the spread of illegal online hate speech. Several other large companies joined the code of conduct later. Although these initiatives mitigate hate speech incidents, the elimination of hate speech needs further work. Following this paradigm, the research community made efforts towards countering hate speech. The latest literature contains an increasing number of works that deal with the problem of automatic hate speech detection. Hate arXiv:1912.04106v2 [cs.IR] 30 Apr 2020 A PREPRINT - M AY 4, 2020 speech detection methodologies aim to classify social media posts into those than contain hate speech and those that do not, while some works even try to identify the type of hate speech. In this paper, we present some of the outcomes of DACHS (“A Data-driven Approach to Countering Hate Speech”) project. While, all victims of hate speech are equally in need of protection and defense, for the purpose of DACHS, we focus on journalism as a test case. As professional arbiters of the public sphere, journalists run afoul of hate speech originators practically by default. Journalists are multipliers of societal discourse and their relative prominence, and high audience reach makes them vulnerable to hate speech. Report in [2] highlights the rapid spread of hate speech against journalists that infringes their freedom of expression. To assist their work and further promote free speech, the DACHS project aims to counter hate speech directed at journalists. One of the main goals of DACHS is to build a Twitter alert monitoring mechanism that notifies journalists about cases where hateful tweets are posted in their Twitter feed. Also, they can receive email reports with statistics about hate speech in their timeline at specified time intervals. Optionally, journalists can suggest or flag tweets that they consider to be hate speech. During DACHS, hate speech against journalists in Twitter is studied in 5 languages: English, French, German, Spanish, and Greek. This work makes the following contributions. First, it defines hate speech from a journalistic point of view, taking into account examples of hate speech directed at journalists. The definition is formed after extensive discussions with journalists from the European Journalism Centre and its main attributes are that it is simple, concise and accounts for large-scale annotation. Second, it presents a concise two-stage annotation strategy. Both of these stages sample tweets from the collected unlabeled data and generate batches of tweets to be submitted for human annotation. The first stage generates the initial batch using keywords and existing hate speech datasets to filter tweets and annotate them. The second stage is responsible for generating all subsequent batches making use of active learning. This strategy is used to annotate a large pool of journalist-related tweets generating large-scale hate speech datasets in multiple languages. The datasets were made publicly available to assist further research on the field. The third contribution of this work is that it uses these datasets to train various state-of-the-art deep learning models, including an ensemble model that outperforms all individual models. To the best of our knowledge, this is the first work that studies hate speech in multiple languages. The rest of the paper is structured as follows. Section 2 includes a brief overview of the current state-of-the-art that addresses hate speech. This includes an overview of different hate speech definitions, existing hate speech datasets and automated detection methods. In section 3 we present the definition of hate speech that is being used throughout the paper. Sections 4 and 5, present the data collection and annotation methodologies respectively. Section 6 describes the performed experimental study and demonstrates the results of this work. In the last section, we conclude this work and we present some future steps related to hate speech detection. Dataset Source Size Language Labels Wulczyn et al. [3] Wikipedia 100k EN offensive Founta et al. [4] Twitter 80k EN offensive/abusive/hate speech Davidson et al. [5] Twitter 25k EN hate speech/offensive Waseem et al. [6] Twitter 16k EN sexism/racism Sharma et al. [7] Twitter 9k EN multiple hate speech classes Gibert et al. [8] Other 10k EN hate speech ElSherief et al. [9] Twitter 28k EN multiple hate speech classes Kwok et al. [10] Twitter 24k EN racism Ross et al. [11] Twitter 541 DE racism Wiegand et al. [12] Twitter 9.5k DE insult/abuse/profanity Del Vigna et al. [13] Facebook 17.5k IT multiple hate speech classes Table 1: Related dataset information 2 Related Work 2.1 Definitions One of the challenges in studying negative online behavior, and hate speech in particular, is the lack of a clear, common definition [14]. Generally speaking, hate speech could be described as the expression of hatred towards an individual or group of individuals because of a characteristic they share, or a group to which they belong. In [3] the term personal https://hatedetection.com/ https://ejc.net/ 2 A PREPRINT - M AY 4, 2020 attack is used to describe offensive online behavior, while other studies focus on offensive or abusive speech and online harassment [15, 5, 16]. Other works like [17], address particular types of online harassment or hate, like misogyny. The actual term hate speech is used in many previous works [18, 19, 20, 10, 7, 6, 8]. Even though these definitions share many common characteristics, there are distinct differences even between definitions that are using the same term to describe negative online behavior. Authors in [21] argue that even more formal definitions of illegal hate speech, like the EU definition [22] or the United Nations definition [23], contain words that are open to interpretation. Illegal hate speech is further examined in another European project that takes into account the heterogeneity and complexity of different legislations. Furthermore, authors in [24] thoroughly discuss and define hate crimes in cyberspace. In our case, our intention is to find and work with a notion of hate speech that takes into consideration the journalistic point of view and at the same time is easy to understand. To this end, we examined the related literature and created a new definition that is in line with the project’s requirements. 2.2 Datasets With the advent of social media, research on hate speech was intensified during the last few years. A critical step to achieve further progress in the detection of online hate speech is the availability of large scale datasets. There have been relatively few efforts focusing on the creation of hate speech datasets from social media. Davidson et al. [5] collected Twitter data using a hate speech lexicon compiled with the help of Hatebase.org in English. They employed crowd-sourcing to label tweets into three categories: hate speech, offensive language, and those with neither. Waseem et al. [25, 6] provide a hate speech dataset, which contains 16k tweets, and describe the respective annotation procedure, in which an initial manual search was conducted on Twitter to collect common slurs and terms about religion, sexual orientation, gender, and ethnic minorities. The dataset was then manually annotated regarding the existence of sexism or racism. Sharma et al. [7] collected a set of 9k tweets containing harmful speech and they manually annotated them in three classes based on their degree of hateful intent. The authors of [8] crawled data from a white supremacy forum to extract and to manually annotate over 10k sentences as hate speech or not. Authors in [9] describe a multi-step classification process and they provide a comprehensive hate speech dataset containing more than 28k tweets with various types of hate related to sexual orientation, gender, ethnicity, etc. Finally, in older works, many researchers have relied on creating their own hand-coded hate speech datasets as in [10, 18]. The majority of hate speech related studies focus on the English language. However, in [11, 12], hate speech against refugees is studied in the German language. The authors in [13] crawled Facebook comments from public Italian pages and annotated them with a variety of hate categories to distinguish different notions of hate speech. In addition, there are a lot of datasets addressing offensive and toxic online behavior. Kaggle’s Toxic Comment Classification Challenge dataset [26] consists of 150k Wikipedia comments annotated for toxic behavior. Kaggle hosts additional large scale toxic speech datasets like [27, 28]. Studies in [3, 4] use crowdsourcing to provide abuse-related annotation on 100k English Wikipedia comments and 80k tweets respectively. Finally, smaller datasets as in [16, 19], focus on the annotation of toxic versus non-toxic online comments. Table 1 presents a summary on the datasets available in related literature. 2.3 Detection methods Existing hate speech detection methods address the problem as a supervised classification task [29]. Traditional methods rely on manually designing and encoding features of textual data into feature vectors, that are used as inputs to algorithms, such as Naive Bayes, Logistic Regression, SVM and Random Forest. These methods are adopted by numerous hate and offensive speech detection studies, such as [19, 5, 20, 10, 30, 18, 25, 6, 31, 32]. These studies experiment with various features, including bag-of-words representation, character-level, word-level n-gram features, syntactic features, linguistic features, and comment embedding features. Following the more recent deep-learning paradigm, several studies use neural networks to detect hateful and toxic content. Neural networks learn abstract feature representations from input data through multiple stacked layers. The key difference from traditional models is that deep learning models automate the feature extraction process and the multi-layer structure provides more efficient feature representations. Many studies have shown that deep learning and neural network methods outperform traditional methods on hate speech detection tasks [33, 15]. The most popular network architectures are Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). In the context of hate speech classification, CNN extracts meaningful features from word or character combinations [34, 33, 15, 35], while RNN learns word or character dependencies in sequences of words [32, 34, 13, 36]. Combinations of CNN and RNN models are also applied in [37]. http://mandola-project.eu/ 3 A PREPRINT - M AY 4, 2020 Tweet Type @USER They shud DEPORT that mot******ker Back to Iran 1 @USER Please somebody, kill him but before you do torture him to death! 2 @USER Whats the matter you cowards have someone disagree with you and your coward 1 journalists and boom they are taken off platforms for daring to have an opinion against you.... I hope you all rot in hell Wall Street Journal for the cowards you are! @USER Kill the NBC journalists !!! 2 @USER Let me fix this for you. Chicken good...Queers Bad. Solved it for you. 3 @USER @USER ACCORDING TO WHITE PEOPLE... When whites kill = Lone Wolf Mental 4 Illness (Even though they’ve killed all over the globe for Centuries). When black people kill = Entire black race is violent . @USER @USER Just cut the diplomatic ties to UK. We Germans call them island monkeys. Or 4 island apes. Table 2: Hate speech tweets and the type of hateful attack that corresponds to the second bullet of the definition in Section 3 3 Definition To consistently annotate a large Twitter corpus, there is a need for a clear and simple definition. We define hate speech in a way that is easy for annotators to label tweets but also for other non-expert groups, like journalists, to further enhance the dataset or provide feedback. The proposed definition is formed after extensive discussions with journalists from the European Journalism Centre through a small focus group and continuous evaluation and feedback from journalists. After looking at hundreds of hateful tweets and several meetings, it was decided that the presence of hate speech should be concluded by answering to two simple key questions. These questions refer to the tweet content and they are presented below: Does it target a person or group? Does it contain a hateful attack? 1. Violent speech 2. Support for death/disease/harm 3. Statement of inferiority relating to a group they identify with (like LGBTQI) 4. Call for segregation A positive answer to both bullets should make the annotator flag the post as hate speech. The second question can refer to any of the four subcategories that are listed above. This definition was evaluated with a larger group of journalists and proved to be concise and easy to understand. To give a better intuition about the hate speech definition, we provide some examples of annotated tweets in English language. The examples are listed in Table 2. Note that all of these tweets comply with the first requirement of the above definition, meaning that these tweets target a person or a group. Table 2 also identifies the type of hateful attack on each tweet in four classes, as described in the second bullet of the definition. 4 Data collection One of the goals of this work is to create multilingual hate speech datasets consisting of tweets that originate from a journalistic context, alongside with binary annotations about hate speech. To generate these datasets, a large list of unlabeled tweets is collected. As a side-note, during data collection and management all EU General Data Protection Regulations were followed. The data collection process started with the creation of a list of journalism-related Twitter accounts. In a subsequent step, we retrieve the tweets related to these accounts to assemble the pool of unlabeled data. Apart from English, which is the most common language in the related literature, this process is applied for French, German, Greek and Spanish. A straightforward way to compose such a list is to manually identify a list of well-known accounts of journalists and news outlets and focus the data collection on those accounts. However, preliminary experiments showed that the volume of data that could be collected following this approach is limited, at least by using the standard version of Search API that does not provide historical data. https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets.html 4 A PREPRINT - M AY 4, 2020 Language Accounts Tweets EN 12,306 92,324,248 DE 3,436 12,436,132 ES 1,765 49,453,601 FR 2,794 34,118,951 GR 3,577 2,147,668 Table 3: Total number of journalism-related Twitter accounts and total number of tweets retrieved per language To overcome this issue we identify a larger number of journalist-related accounts by using Twitter lists. Twitter lists are curated groups of Twitter accounts and are usually centered around specific themes. In order to find Twitter lists related to journalism, we collected the Twitter accounts of well-known news outlets and journalism-related organizations and automated the process of fetching all list members. Despite the fact that the majority of the collected accounts are related to either journalists or news outlets, we also noticed the presence of non-journalistic accounts, which could potentially tamper with the journalistic focus of our data collection. To tackle this issue, we manually filtered the irrelevant accounts. Since annotating (for relevance with journalism) the full list of accounts would involve a considerable manual effort, we prioritized the annotation of the most popular accounts, since those accounts usually attract a larger number of tweets. In the first iteration of account collection, we annotated the accounts using the following classes: a) journalist, b) news outlet, c) irrelevant. The initial seed consists of 200 manually collected journalism-related accounts for each language. Table 3 lists the number of journalism-related accounts per language. Having a validated list of journalism-related Twitter accounts for each language, we set up, as the next step, a mechanism to collect tweets from the feeds of these accounts. To this end, the Twitter Search API is used, which returns a sample of Tweets posted in the past 7 days. The API is rate-limited at 180 requests per 15-min window when user authentication is used and at 450 requests per 15-min window when application authentication is used. We used application authentication having also in mind that each call can return a maximum of 100 tweets. Despite these limitations, by effectively utilizing the API within the imposed call rate limitations, we manage to collect a large number of tweets that is sufficient for creating a sizeable, journalism-oriented hate speech database. Typically, the Twitter Search API query consists of a series of keywords that should be contained in the set of returned tweets, along with account-based search operators. We opted for using search queries that do not restrict the tweet contents and only used account-based search operators to limit results. More specifically, we used the to: (e.g., to:BBCNews) and the “@” (e.g. @BBCNews) operators in order to collect tweets authored in reply to and tweets mentioning those specific accounts. Specifically, every 15 minutes, the module sequentially performed N = 450 N API requests, evenly distributed safe across the 15-minute window. N [0; 450) corresponds to a safety parameter that is used to avoid pushing the API to safe its limits. We use N = 200 and thus a maximum of N = 250 calls is performed per every 15-min window. Since safe each call is associated with one account, and the number of calls that can be performed within each 15-min window is much smaller than the pool of target accounts, an account prioritization/selection mechanism is implemented. This prioritization approach fetched data from all accounts and measured an estimate of the rate of incoming tweets. Using this estimate, the available API calls are distributed on accounts, which are expected to have received a sufficient number of new tweets since the last time that they were queried. Specifically, whenever a new API call is performed, an estimated number of available new tweets is calculated for each account based on the account’s estimated incoming tweet rate and the time that has passed since the last time it was fetched. Then, one account is randomly selected among those whose estimated number of available new tweets is larger than a user-specified threshold (for which reasonable values should lie close to 100, the maximum number of results per request). After a call is performed, the incoming tweet rate for each account is updated by dividing the number of returned tweets with the time difference between the newest and the oldest tweet (a very low rate is assigned in case a call returns zero results). Moreover, the last fetched timestamp is recorded to facilitate the calculation of the estimated number of incoming tweets. Initially, all accounts are assigned a fixed, high incoming tweet rate to ensure that all accounts will be fetched at least once. This approach ensured that all accounts were queried proportionally to their actual, dynamic incoming tweet rate, thus maximizing the amount of tweets that can be collected. We apply this approach for every language, mining tweets from the corresponding pool of journalistic Twitter accounts and store the retrieved tweets in a mongoDB database. To ensure that tweet language matches the language of the query accounts, we inspect the tweet metadata information. The total number of the collected unlabeled tweets per language is listed in Table 3. We denote the unlabeled pool of tweets by U . The unlabeled pool of tweets for each language is denoted by U where i 2 fEN; DE; ES; FR; GRg. https://help.twitter.com/en/using-twitter/twitter-lists 5 A PREPRINT - M AY 4, 2020 Notice that there are significant differences in the number of tweets retrieved per language. This is expected due to the different number of query accounts used to retrieve tweets, the language popularity and the Twitter usage per country. The total collection period was approximately 6 months (1/10/2018 - 8/5/2019). 5 Annotation process In this Section, we describe the annotation process for labelling the pool of tweetsU . The purpose of this process is to create a labeled dataset denoted by L. Note that L consists of 5 different language datasets denoted by L where i 2 fEN; DE; ES; FR; GRg and k denotes the current number of annotated batches that is included in the dataset. The annotation process consists of two sampling stages, the initial sampling stage and the active learning sampling stage. In the initial sampling stage, we use two approaches described in the following subsection, and select a defined number of tweets from the U pool to generate the initial annotation batch for each language i. We denote annotation batches k 1 for each language by B where k is the number of the generated batch. For the initial sampling stage k = 1. Each B i i 1 1 in this stage, is submitted for human annotation. After, the annotation of the initial B , the first labeled datasets L are i i generated. The active learning sampling stage,generates all the subsequent B with k > 1, for further annotation and dataset expansion, through an iterative process. In the first iteration, we use L generated from the initial sampling stage, and 2 2 we employ an active learning sampling approach to generate B from U . We then submit B for annotation and we i i 1 2 expand the L to L by appending the new annotated batch. We repeat this process until there are no more tweets i i available in the U or we exhaust the annotation budget. 5.1 Initial sampling stage By observing Table 3, we notice that there is a large number of unlabeled tweets in each language. To submit such sizable data for human annotation, is not only overwhelming to the annotators, but makes the whole process very expensive and hard to supervise. Additionally, we expect that the number of positive annotations will be very small compared to the negative annotations. To validate this, we conduct a brief preliminary annotation round, annotating 2000 random tweets per language and inspecting the number of positive annotations. As expected, we observe an apparent scarcity of positive annotations. Table 4 shows that for the preliminary batch in English language, among 2000 annotations, only 0.4% are positive. Based on these observations, it is evident that we need to develop a process, which would generate batches for annotation of manageable size and exhibiting a higher positive annotation ratio. To mitigate the lack of positive annotations, we consider two different approaches for generating the annotation batches of tweets (B ). The first and the most popular approach in the literature, is using keyword-based sampling and thus creating the annotation batch with tweets containing specific keywords. In the hate speech context, these keywords are usually offensive words or words that can be used to express hatred. The second approach, which is novel in the literature, is to use existing hate speech datasets in order to train hate speech detection models and, in a subsequent step, apply these models to the pool of unlabeled tweets and sample the tweets with higher hate speech probability. For this task we train a CNN model, which is described later in section 6.2. In a sense, this approach shares similarities with transfer learning [38], where we essentially transfer the established hate speech definition from another work to sample our data. We refer to this approach as dataset-based sampling. One evident shortcoming of this approach is that it requires the availability of such datasets, something that is possible only for a few languages. Both of these approaches introduce bias in the sampled tweets [39] and consequently in the generated datasets. It is obvious that dataset generation with such extreme imbalance between classes, as the one presented in this work, requires a sampling approach that is inclined towards the minority class. Although bias is inevitable, there are ways that it can be moderated. To moderate bias for the keyword-based approach, we compile a large list of keywords. These lists included offensive slang, phrases and words that could potentially express hate when used in the appropriate context and keywords related to several forms of possible discrimination (religion, gender, refugees etc). For the dataset-based approach, we adopt three different approaches to tackle bias. First, we use datasets from works that have less strict definition of hate speech compared to ours. Second, in cases where the dataset-based approach is not applicable due to lack of existing datasets we combine multiple different datasets with diverse definitions and scopes. Third, we use a loose classification threshold in the resulting CNN model, sampling tweets from a wide range of hate speech probability. Finally, to further reduce bias in the generated annotation batch, we concatenate random sampled tweets for both keyword-based and dataset-based approaches. 1 1 We apply the dataset-based sampling approach to generate B and B by using some of the related datasets in EN DE Table 1. We apply the keyword-based sampling for the Spanish, French and Greek due to lack of related datasets. 6 A PREPRINT - M AY 4, 2020 Specifically, for the English language, we train a CNN model using the dataset presented in [5]. Although their definition of hate speech differs from the one described in this work, the dataset is suitable for training, since [5] establishes a loose definition of hate speech compared to ours (i.e a lot of offensive and insulting tweets are considered as hate speech). For the case of the German language, two different datasets are combined. The first dataset [11], included tweets referring to refugees and included binary hate speech annotations. Additionally, the GermEval 2018 dataset [12] is used, which is a series of shared task evaluation campaigns that focused on natural language processing for the German language. Just like the English case, a CNN model is trained using this dataset. To generate the first annotation batch for each one of these languages, we applied the corresponding CNN model to the pool of unlabeled tweets calculating the hate speech probability for each unlabeled tweet. Then we randomly sample 8000 tweets with hate speech probability that falls in range [0.2-1.0] for each language. This wide range is chosen to reduce the bias of the generated batch as discussed previously. For the rest of the languages (Spanish, French, Greek), we employ the keyword-based approach. For each language we assemble a large list of keywords, including 500 to 1000 keywords per language. These lists included offensive slang, phrases and words that could potentially express hate when used in the appropriate context and keywords related to several kinds of possible discrimination (religion, gender, refugees etc). We use these keywords to sample from the pool of unlabeled tweets and fetch up to 8000 tweets per language. For each language, 2000 additional randomly selected tweets are concatenated to the corresponding batch, in order to further mitigate bias imposed by the keyword-based and dataset-based approaches. This leads to the generation of 1 1 the initial B that contain 10000 tweets per language. Note that tweets in B are removed from U , in order not be i i retrieved again during the creation of the next batches. After the generation of initial batches, they are submitted to the corresponding annotators to produce L . We describe the manual annotation process in the next subsection. Table 4 shows that annotated B exhibits a significant increase in the ratio of positive annotations compared to the preliminary annotation batch. This is expected because keyword-based and dataset-based sampling favour the hate speech class. Batch type Language i 1 2 Preliminary B B i i EN 0.4% 1.9% 6.33% DE 0.2% 1.13% 3.47% ES 0.12% 0.89% 2.51% FR 0.3% 2.36% 7.37% GR 0.13% 0.70% 1.32% Table 4: Ratio of positive annotations in different batches and languages. Preliminary batch includes 2k randomly sampled tweets for each language. B include 10k tweets selected with keyword-based or dataset-based approach in the initial sampling stage. B include 10k tweets from the first batch of active learning sampling stage 5.2 Manual annotation After generating each batch of tweets for each language, we proceed to the manual annotation task. This process is described in this subsection. There are many annotation methodologies that can be used on a large corpus of data in the related literature. For this work, we follow the findings reported in [40], which claims that it is better to allocate the annotation budget to label as many examples as possible when the annotation quality is above a specified threshold. Based on this, we perform the annotation task submitting one annotation per tweet and thus, utilizing the annotation budget in order to annotate as many tweets as possible. For each language we used experienced annotators, proficient in the corresponding language, familiar with social media and acquainted with the colloquial nature of online conversations. Additionally, we follow a strict quality assurance methodology assigning a supervisor role in the process. The supervisor’s job is to validate the annotations following a quality control methodology. In total five annotators (one for each language) and one supervisor are employed for the annotation task. The supervisor and annotators of each language perform preliminary annotations to make sure that the definition of hate speech are correctly understood. This process also includes discussions about the particularities of each language. For each language, the corresponding annotator is asked to flag hate speech in tweets with a yes or no answer. In cases where the annotator is unsure about the answer, he/she could flag the tweet, and on a second stage, this post would be discussed with the supervisor. 7 A PREPRINT - M AY 4, 2020 Figure 1: A schematic representation of the annotation process (dotted arrows used only for the initial sampling stage) To ensure the quality of the annotation, we follow the quality control methodology as described in ISO 2859 and ANSI/ASQ Z1.4-2003. We used level II, i.e., the normal severity level, and thus a lot of 1,000 annotations is considered of acceptable quality if the error rate does not exceed 4%. To determine this, a single sampling size of 80 annotations out of 1,000 tweets is used. Whenever an annotator completes 1,000 annotations, the supervisor of the process evaluates 80 random samples of them, and if more than 7 annotations are erroneous, the whole lot is rejected, and careful instructions are given to the annotator. The annotator will then annotate anew the tweets. 5.3 Active learning sampling stage To generate the initial batch of tweets for each language, we use the keyword-based and the dataset-based sampling approaches, and then human annotators label each language batch, as described in previous Subsections. At this point, L contain 10k labeled tweets for each language. We use the current annotated sets and train new detection models in order to generate new B . This approach is similar to the dataset-based sampling approach, although there are some key differences. Detection models are created for all examined languages (and not just for English and German) with our own annotated datasets. At the same time, these detection models will detect hate speech that matches the definition that is presented in Section 3 and not ones that are transferred from other works. Although this is a valid approach that can be efficiently used to sample tweets from the unlabeled pool L, it does not guarantee that the sampled tweets will add additional value to the models. For instance, if a hate speech classifier is highly confident about a tweet, then this tweet will not contribute to the learning process and might even hurt generalization performance. Also, this implies that this specific tweet is very similar with other tweets existing in the dataset that were used to generate the classifier. Motivated by this assumption, an active learning mechanism is adopted to generate annotation batches of tweets that would essentially contribute to the ability of the models to learn. Pool-based active learning [41] relies on an initial small set of labeled instances L, and a larger set of unlabeled onesU . Batches of informative training samples are iteratively selected fromU and added toL, with respect to some selection mechanism, after a query about their actual label to an annotator. This approach is motivated in many modern machine learning applications, where unlabeled data may be abundant, but labels are difficult or expensive to obtain. Initially, uncertainty sampling [41] is considered as a selection mechanism. In this setup, an active learner selects tweets whose posterior probability is near 0.5. This approach proved problematic for our scenario, since hate speech datasets are imbalanced, and thus the probability distributions are skewed towards the dominant class. Having this in mind, and also the fact that multiple deep learning models would be trained for comparison and evaluation, the best approach proved to be the query-by-committee [42] algorithm. The query-by-committee approach involves maintaining (1) (c) a committee C = f ; :::;  g of models, which are all trained on the current labeled set L, but represent competing hypotheses. This is applicable to our scenario since each deep learning model architecture captures different semantic and syntactic components of the tweet, even if they are trained using the same dataset. Then, each committee member is allowed to vote on the labels of query candidates. The most informative tweet to be sent to annotators is considered 8 A PREPRINT - M AY 4, 2020 Train dataset Test dataset macro-F1 Initial Additional 8000 - 2000 0.43 8000 2000 (random) 2000 0.43 8000 2000 (hate probability) 2000 0.45 8000 2000 (active learning) 2000 0.47 Table 5: Different evaluation experiments in English language. We evaluate the models created with an initial train set and additional sets generated with different techniques. to be the one that committee members disagree upon the most. The average Kullback-Leibler (KL) divergence [43]: x = argmax D(P (c)jjP )); KL c=1 where: P (c)(y jx) D(P jjP )) = P (y jx) log (c) (c) C i P (y jx) C i (c) is used for measuring the level of disagreement among the classifiers. Here  represents a particular model in the committee, andC represents the whole committee, thus P (y jx) = P (c)(y jx) is the “consensus” probability C i i C c=1 that y is the correct label. KL divergence [44] is an information-theoretic measure of the difference between two probability distributions. So, this disagreement measure considers the most informative query to be the one with the largest average difference between the label distributions of any of the committee members and the consensus. To apply this process to our work, for k-th iteration we train various deep learning models for hate speech detection in k1 each language, using the annotated tweets L that are generated from the previous iteration k 1. We denote the committee of models for each language i by C . Specifically, the active learning sampling stage is described in the following bullets and the entire annotation process is presented in Figure 1. Thus, for each language i: 1. We train the models described in Section 6.2 and generate the committee of classifiers C , using the previous k1 batch of annotated tweets L . Note that in this step, we do not include the ensemble model to the committee C . (c ) 2. For every tweet in the unlabeled pool U , we calculate each model’s  hate speech probability and compute the average KL divergence. 3. For every tweet in the unlabeled pool U , we calculate the ensemble model’s hate speech probability. 4. We sample 8000 tweets from U that have the highest KL divergence and the ensemble output probability is higher than 0.2. We also randomly sample 2000 tweets to generate the batch B with a total of 10000 tweets. Note that the sampled tweets are removed from U . Also, in the final iteration the size of the batch might be smaller than 10000 in case where there are no available tweets left in U . k k 5. We submit B for annotation and we append the annotated batch to the pool of labeled tweets L . i i 6. The process is repeated from step 1, using L to retrain the classifiers and generate the next batch for annotation. We stop this process if there no available tweets in U or the annotation budget is exhausted. In Table 4, it is evident that the first annotation batch created with the active learning approach (B ) had a higher percent of hate speech compared to the initial or the preliminary batch. This is expected because in the active learning setup, we use models that are trained on datasets annotated using our definition of hate speech. Table 5 shows the conducted experiments to validate whether sampled tweets with this approach contribute to the model’s ability to learn and generalize. In this experimental setting, the initial annotated batch of English tweets is split to 8000 training and 2000 testing tweets. A simple CNN model is trained and the macro average F1 score is calculated for the test set. This process is repeated 3 times using different additional annotated sets to train the classifier. The first set consists of 2000 randomly sampled tweets, the second set consists of 2000 tweets sampled from the [0.2-1.0] hate speech probability interval, and the third contains 2000 tweets sampled using the active learning approach. As we can see in Table 5, random sampling does not improve the evaluation results at all. On the other hand, both hate probability and active learning approaches improved the macro-f1 evaluation with the latter being the best approach. In this experiment, it is obvious that active learning improves the learning process of the classifier by choosing the most appropriate tweets for annotation. 9 A PREPRINT - M AY 4, 2020 5.4 Datasets Hate speech Language Set positive negative train 5804 68051 EN test 1355 16812 total 7159 84863 train 1361 33626 DE test 340 8707 total 1702 42033 train 795 29355 ES test 199 7339 total 994 36694 train 2163 29124 FR test 541 7281 total 2704 26405 train 913 48271 GR test 228 12069 total 1141 60340 Table 6: Number of positive and negative annotated tweets in different sets and languages available in our publicly available datasets 6 7 8 9 10 The final datasets for English , German , Spanish , French and Greek languages are hosted on Zenodo platform and are available after request. Each dataset contains tweets ids and their corresponding binary annotations. Table 6 shows the individual dataset statistics for each language. The datasets are provided in train/test sets, preserving the proportion of negative and positive samples. 6 Experimental study In this section we present the experimental pipeline that is followed in order to train and evaluate the hate speech detection models using the datasets described in the previous Section. 6.1 Tweet pre-processing Due to the nature of Twitter data, there is a lot of noise among words. Posted links or mentions do not provide any useful information and need to be normalized. To achieve this, a state of the art tweet normalization tool [45] is used, to tokenize and transform each tweet into a sequence of words. The process involves Twitter handles normalization (e.g. @random_user becomes < user >), emoji transformation (e.g. :( becomes < sad >), lower casing, as well as, URL, email and number removal. Furthermore, only basic punctuation is retained (e.g. .,?;”). 6.2 Models Following the latest trend in the literature, which shifts towards the adoption of deep learning based methods, some of the latest state of the art models for text classification are used. Deep learning models perform better than traditional methods in most NLP tasks, including hate speech detection tasks [33, 15]. Furthermore, an ensemble learning architecture is proposed, since it combines the predictive power of each individual classifier. Below we describe the deep learning architectures that are evaluated in this work. CNN. A simple Convolutional Neural Network model described in [35] acts as n-gram feature extractor. Using windows sizes of 2,3 and 4 this CNN model can extract bi-gram, tri-gram and quad-gram features. The output of each CNN is then further down-sampled by a 1D max pooling layer with a pool size of 4 and a stride of 4 https://zenodo.org/record/3520152#.XcL0OnUzY5k https://zenodo.org/record/3520148#.XcL04XUzY5k https://zenodo.org/record/3520150#.XcL1C3UzY5k https://zenodo.org/record/3520156#.XcL1GHUzY5k https://zenodo.org/record/3520157#.XcL1G3UzY5k 10 A PREPRINT - M AY 4, 2020 for further feature selection. After the concatenation of pooling layers, another 1D max pooling layer is added and the output is fed to the final fully connected layer. Skipped CNN (sCNN). Extending the base CNN model in order to capture features of words that are not next to each other, Zhang et al. [35] proposed Skipped CNN layer. Skipped CNN applies a mask to a kernel window, skipping intermediate words and associating words that are not directly near. According to the authors, skipped CNNs can be considered as extractors of ‘skip-gram’ like features. CNN + GRU. Work in [37] added a GRU layer followed by a global max pooling layer on top of CNN model. The GRU layer captures sequence feature relations and learns to identify dependencies between n-gram features LSTM. A bidirectional LSTM model is created. After the embedding layer, spatial dropout is introduced, which randomly masks 20% of the input words. To process the sequence of word embeddings, an LSTM layer is used with 128 units. Next, a global max pooling and an average max pooling layer are concatenated, flattening the output space by taking the highest and the average value in each timestep dimension, respectively. The produced feature vector is fed into the final fully connected layer. LSTM + Attention (aLSTM). Attention mechanism is used with success in many NLP tasks like in [46]. Intuitively, attention is a mechanism that learns to favor features that are more relevant to the classification task, by assigning weights to them. This means that features that are not important to the task are multiplied by smaller weights, while predictive features are multiplied by higher weights. The attention layer is implemented based on [47] and it is applied to the LSTM model. Instead of taking the max and average features in each timestep, an attention layer with 100 units is used to extract the important features of the LSTM layer. The output of the attention layer is then fed to the output layer. Ensemble (E). Aken et al. [48] proposed an ensemble model based on the assumption that classification methods vary in their predictive power and may conduct specific errors. The ensemble model in [48] is trained with gradient boosting decision trees. We used a simple dense neural network ensemble architecture, forwarded the output predictions of the models as inputs to a dense layer with 20 neurons, applied a dropout layer with a ratio of 0.2 and finally the outputs features are forwarded to the final output layer. Intuitively, this small neural network learns to apply a weighted average based on the prediction probability of each individual classifier. 6.3 Implementation details For all methods discussed in this work, we use Keras [49] with Tensorflow [50] backend and the scikit-learn [51] library. Each model is trained for 10 epochs and a mini-batch of 64 tweets is used. Keras requires static input sequences, meaning that the max number of words in a tweet has to be predefined. Thus, the max sequence of words for a tweet is set to 50, since, after experimentation, it is found that it does not affect performance. Zero padding is used for sentences with less than 50 words. The first layer for every model is an embedding layer. We initialize the embedding layer using pre-trained word vectors for each language. After conducting some preliminary experiments, the best pre-trained embedding choice for Greek and French language is using fastText embeddings [52], trained on Common Crawl and Wikipedia. For English, Spanish and German language Glove embeddings [53] achieve better evaluation results. Word2vec [54] pre-trained embeddings are also tested. Note that the evaluation results among different embedding approaches do not exhibit significant differences. Word vectors that do not exist in the pre-trained embeddings are randomly initialized, and the embedding layer is further fine-tuned during the training process. To represent a padding token, zero initialization is used. For every model, the default parameters are used, as they are provided by the corresponding authors, unless stated otherwise. The l2 regularization parameter is set to be 1e for every layer. We treat hate speech detection as a binary classification problem. The final fully connected layer is a sigmoid activation and outputs the hate speech probability. Binary cross-entropy loss function and the Adam optimizer are used to train the models. 6.4 Evaluation setup In related literature, evaluation of the performance of hate speech detection typically adopts the classic Precision, Recall and F1 metrics. Precision measures the percentage of true positives among the predicted hate speech tweets. Recall measures the percentage of true positives among the ground truth hate speech tweets, and F1 calculates the harmonic average of the two. The three metrics are applied to each dataset class and an aggregated result is computed, either using micro-average or macro-average. The first approach sums up the individual true positives, false positives, and false negatives identified by a model, not taking into consideration different classes to calculate overall Precision, Recall and F1 scores. The second approach takes the average of the Precision, Recall and F1 on different classes. Existing studies on hate speech detection have primarily reported their results using micro-average Precision, Recall and F1 [34, 33, 15, 25, 37]. 11 A PREPRINT - M AY 4, 2020 As stated in [35] and is made obvious in our dataset statistics shown in Table 6, a usual observation in hate speech datasets is their highly imbalanced nature. In imbalanced datasets, like the ones discussed in this paper, micro-averaging can inherently hide the real performance of minority classes. Thus, a significantly lower or higher F1 score on a minority class is unlikely to cause a significant change in micro-F1 on the entire dataset. In a practical application like hate speech detection, reporting micro-F1 on the entire dataset will not properly reflect a model’s performance on hateful content as opposed to non-hate. Motivated by these observations, we use the standard Precision (P), Recall (R) and F1 measures for evaluation and report their macro averages(m-P, m-R, m-F1). Additionally, we provide F1 obtained on hate speech class (h-F1). To train and evaluate the models for hate speech detection, we use the training and test sets reported in Table 6, respectively. 6.5 Results Table 7 shows the evaluation results for the hate speech class in each language. A first observation that highlights the imbalance between classes is that F1 score for the hate class is significantly lower compared to the macro F1 scores. This is expected because the number of negative annotated tweets in the test dataset is significantly larger than positive annotated ones, as displayed in Table 6. metric CNN sCNN CNN + LSTM aLSTM E GRU m-P 0.81 0.83 0.80 0.77 0.79 0.80 m-R 0.78 0.78 0.80 0.78 0.79 0.82 EN m-F1 0.79 0.80 0.80 0.77 0.79 0.81 h-F1 0.61 0.64 0.63 0.58 0.61 0.65 m-P 0.64 0.67 0.68 0.65 0.67 0.67 m-R 0.67 0.71 0.68 0.65 0.66 0.71 DE m-F1 0.65 0.69 0.68 0.65 0.66 0.69 h-F1 0.34 0.40 0.38 0.32 0.35 0.40 m-P 0.69 0.69 0.70 0.74 0.68 0.70 m-R 0.71 0.75 0.72 0.68 0.68 0.73 ES m-F1 0.70 0.72 0.71 0.70 0.68 0.72 h-F1 0.42 0.45 0.44 0.42 0.38 0.44 m-P 0.81 0.81 0.83 0.80 0.80 0.84 m-R 0.81 0.82 0.81 0.77 0.82 0.81 FR m-F1 0.81 0.82 0.82 0.78 0.81 0.83 h-F1 0.65 0.66 0.66 0.64 0.64 0.67 m-P 0.81 0.87 0.87 0.86 0.86 0.87 m-R 0.78 0.77 0.75 0.75 0.75 0.78 GR m-F1 0.79 0.81 0.80 0.80 0.80 0.82 h-F1 0.59 0.63 0.60 0.60 0.60 0.65 Table 7: The evaluation results for hate speech class By inspecting each language separately, we notice that there are no significant performance differences between all models in terms of macro F1. However, in terms of individual models, the sCNN model seems to generally exhibit the best performance. Some exceptions are observed, as in the case of the Spanish language, where the LSTM model performs better in terms of macro Precision, and in the Greek language, where the CNN model has better macro Recall evaluation. sCNN seems to be the most compelling feature extractor for hate speech as it achieves the best F1 score for the hate class, among individual models. This also corresponds to overall better macro F1 by sCNN compared to other methods. For the case of the French language, CNN+GRU model performs on par with sCNN model. Additionally, the combination of all individual models in the ensemble model (E) yielded even better results in terms of macro F1. The ensemble model had the best macro F1 score, as it manages to perform well both in terms of macro Precision and macro Recall. The ensemble model also exhibits the best performance in terms of hate speech F1 score. The only exception is observed in the Spanish language, where sCNN model scores a higher F1 score for the hate speech class. Another observation is that the evaluation for English, French and Greek, specifically in terms of F1 score in the positive class, is significantly better when compared with the Spanish and the German languages. This is potentially due to 12 A PREPRINT - M AY 4, 2020 the fact that there are less positive samples in the related datasets. Our goal is to continue expanding the datasets and specifically address the issue for these two languages. 7 Conclusion In this work, we try to tackle hate speech directed at journalists on social media. To accomplish this, we define hate speech in a way that takes into consideration the journalistic point of view, and is simple enough to be used by non specialists. Using this definition we create labeled datasets in five different languages. During data annotaion, a comprehensive annotation strategy is followed. The generated datasets are made publicly available to assist further research efforts. Furthermore, we use these datasets to train various state-of-the-art deep learning architectures, while at the same time, we propose an ensemble model that outperforms all individual models. Another major contribution of this study is its annotation pipeline. To increase the number of positive annotations, we employed keyword-based sampling and also an approach that uses available datasets from other related works. Namely, the dataset-based sampling approach trains detection models using related datasets and samples tweets based on the output probability of the model. This approach, is similar to transferring the definition of hate speech from other works and use it to sample tweets. This is particularly useful in cases where the definition of hate speech in these datasets has a wider scope than the proposed definition. It is obvious that these approaches introduce bias to the resulting datasets. Although bias is inevitable, we propose ways to mitigate it. Lowering the sampling threshold of the detection models, combining multiple datasets and adding randomly sampled data, are good approaches to deal with bias. Building on the initial annotation stage, we also presented an active learning approach. The contribution of this stage is that, besides the high percent of positive annotations in the resulting batches, these batches contain data that can contribute to the learning process of the models. Specifically, data sampled with this approach generally deviate from the already labeled data and consequently improve the generalization of the model. We believe that the presented annotation process can be applied to other domains, beyond the scope of this work. Specifically, the two sampling stages can be very effective approaches when annotating a large corpus of data. The only input that is required for these approaches is one or more relevant datasets or a curated list of keywords. As future steps, we plan to keep expanding the datasets with new tweets. To this end, we have developed an alert monitoring mechanism for journalists that supports further annotation of tweets. Using these tweets, we plan to retrain our models in frequent intervals. Additionally, we aim to investigate new active learning techniques in order to choose more informative tweets that improve the models’ ability to learn and generalize. Another issue we will focus on, is the imbalance between the positive and negative classes. To alleviate this, we will explore ways to fetch more hateful content in a more unbiased manner. Finally, we will experiment with some state of the art deep learning architectures for natural language processing like BERT [55] or ULMFiT [56]. 8 Acknowledgements This work was supported by the Rights, Equality and Citizenship programme of the European Union (2014-2020) under grant agreement number 785679. References [1] V. Jourová, Code of conduct on countering illegal hate speech online: First results on implementation. [2] UNESCO, World trends in freedom of expression and media development: global report 2017/2018, 2018. [3] E. Wulczyn, N. Thain, L. Dixon, Ex Machina: Personal Attacks Seen at Scale, in: Proceedings of the 26th International Conference on World Wide Web, WWW ’17, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 2017, pp. 1391–1399, event-place: Perth, Australia. doi:10.1145/3038912.3052591. URL https://doi.org/10.1145/3038912.3052591 [4] A. M. Founta, C. Djouvas, D. Chatzakou, I. Leontiadis, J. Blackburn, G. Stringhini, A. Vakali, M. Sirivianos, N. Kourtellis, Large scale crowdsourcing and characterization of twitter abusive behavior, in: Twelfth International AAAI Conference on Web and Social Media, 2018. [5] T. Davidson, D. Warmsley, M. Macy, I. Weber, Automated hate speech detection and the problem of offensive language, in: Eleventh International AAAI Conference on Web and Social Media, 2017. 13 A PREPRINT - M AY 4, 2020 [6] Z. Waseem, D. Hovy, Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter, in: Proceedings of the NAACL Student Research Workshop, Association for Computational Linguistics, San Diego, California, 2016, pp. 88–93. doi:10.18653/v1/N16-2013. URL https://www.aclweb.org/anthology/N16-2013 [7] S. Sharma, S. Agrawal, M. Shrivastava, Degree based Classification of Harmful Speech using Twitter Data, in: Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018), Association for Computational Linguistics, Santa Fe, New Mexico, USA, 2018, pp. 106–112. URL https://www.aclweb.org/anthology/W18-4413 [8] O. d. Gibert, N. Perez, A. G. Pablos, M. Cuadros, Hate Speech Dataset from a White Supremacy Forum, 2018, pp. 11–20. URL https://aclweb.org/anthology/papers/W/W18/W18-5102/ [9] M. ElSherief, S. Nilizadeh, D. Nguyen, G. Vigna, E. Belding, Peer to Peer Hate: Hate Speech Instigators and Their Targets (Apr. 2018). URL https://arxiv.org/abs/1804.04649v1 [10] I. Kwok, Y. Wang, Locate the Hate: Detecting Tweets against Blacks, in: AAAI, 2013. [11] B. Ross, M. Rist, G. Carbonell, B. Cabrera, N. Kurowsky, M. Wojatzki, Measuring the Reliability of Hate Speech Annotations: The Case of the European Refugee Crisis (Jan. 2017). doi:10.17185/duepublico/42132. URL https://arxiv.org/abs/1701.08118v1 [12] M. Wiegand, M. Siegel, J. Ruppenhofer, Overview of the germeval 2018 shared task on the identification of offensive language (2018). [13] F. D. Vigna, A. Cimino, F. Dell’Orletta, M. Petrocchi, M. Tesconi, Hate me, hate me not: Hate speech detection on facebook, in: Proceedings of the First Italian Conference on Cybersecurity (ITASEC17), 2017. [14] H. M. Saleem, K. P. Dillon, S. Benesch, D. Ruths, A Web of Hate: Tackling Hateful Speech in Online Social Spaces, arXiv:1709.10159 [cs]ArXiv: 1709.10159 (Sep. 2017). URL http://arxiv.org/abs/1709.10159 [15] J. H. Park, P. Fung, One-step and Two-step Classification for Abusive Language Detection on Twitter, 2017, pp. 41–45. doi:10.18653/v1/W17-3006. URL https://aclweb.org/anthology/papers/W/W17/W17-3006/ [16] C. Nobata, J. Tetreault, A. Thomas, Y. Mehdad, Y. Chang, Abusive Language Detection in Online User Content, in: Proceedings of the 25th International Conference on World Wide Web, WWW ’16, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 2016, pp. 145–153, event-place: Montréal, Québec, Canada. doi:10.1145/2872427.2883062. URL https://doi.org/10.1145/2872427.2883062 [17] E. A. Jane, “your a ugly, whorish, slut” understanding e-bile, Feminist Media Studies 14 (4) (2014) 531–546. [18] W. Warner, J. Hirschberg, Detecting Hate Speech on the World Wide Web, in: Proceedings of the Second Workshop on Language in Social Media, Association for Computational Linguistics, Montréal, Canada, 2012, pp. 19–26. URL https://www.aclweb.org/anthology/W12-2103 [19] P. Burnap, M. L. Williams, Cyber Hate Speech on Twitter: An Application of Machine Classification and Statistical Modeling for Policy and Decision Making, Policy & Internet 7 (2) (2015) 223–242. doi:10.1002/poi3.85. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/poi3.85 [20] N. Djuric, J. Zhou, R. Morris, M. Grbovic, V. Radosavljevic, N. Bhamidipati, Hate Speech Detection with Comment Embeddings, in: Proceedings of the 24th International Conference on World Wide Web, WWW ’15 Companion, ACM, New York, NY, USA, 2015, pp. 29–30, event-place: Florence, Italy. doi:10.1145/2740908. URL http://doi.acm.org/10.1145/2740908.2742760 [21] S. Assimakopoulos, F. H. Baider, S. Millar, Online hate speech in the European Union: a discourse-analytic perspective, Springer, 2017. [22] Council of the European Union, Council framework decision 2008/913/jha of 28 november 2008 on combating certain forms and expressions of racism and xenophobia by means of criminal law, Official Journal of the European Union L 328/55 (2008). [23] OHCHR, Rabat plan of action on the prohibition of advocacy of national, racial or religious hatred that constitutes incitement to discrimination, hostility or violence., Office of the United Nations High Commissioner for Human Rights Report (2013). 14 A PREPRINT - M AY 4, 2020 [24] D. K. Citron, Hate crimes in cyberspace, Harvard University Press, 2014. [25] Z. Waseem, Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter, in: Proceedings of the First Workshop on NLP and Computational Social Science, Association for Computational Linguistics, Austin, Texas, 2016, pp. 138–142. doi:10.18653/v1/W16-5618. URL https://www.aclweb.org/anthology/W16-5618 [26] Jigsaw, Toxic Comment Classification Challenge (2018). URL https://kaggle.com/c/jigsaw-toxic-comment-classification-challenge [27] Quora, Quora Insincere Questions Classification (2019). URL https://kaggle.com/c/quora-insincere-questions-classification [28] Impermium, Detecting Insults in Social Commentary (2012). URL https://kaggle.com/c/detecting-insults-in-social-commentary [29] A. Schmidt, M. Wiegand, A Survey on Hate Speech Detection using Natural Language Processing, 2017, pp. 1–10. doi:10.18653/v1/W17-1101. URL https://aclweb.org/anthology/papers/W/W17/W17-1101/ [30] Y. Mehdad, J. Tetreault, Do Characters Abuse More Than Words?, 2016, pp. 299–303. doi:10.18653/v1/ W16-3638. URL https://aclweb.org/anthology/papers/W/W16/W16-3638/ [31] G. Xiang, B. Fan, L. Wang, J. Hong, C. Rose, Detecting Offensive Tweets via Topical Feature Discovery over a Large Scale Twitter Corpus, in: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM ’12, ACM, New York, NY, USA, 2012, pp. 1980–1984, event-place: Maui, Hawaii, USA. doi:10.1145/2396761.2398556. URL http://doi.acm.org/10.1145/2396761.2398556 [32] L. Gao, R. Huang, Detecting Online Hate Speech Using Context Aware Models, 2017, pp. 260–266. doi: 10.26615/978-954-452-049-6_036. URL https://aclweb.org/anthology/papers/R/R17/R17-1036/ [33] B. Gambäck, U. K. Sikdar, Using convolutional neural networks to classify hate-speech, in: Proceedings of the first workshop on abusive language online, 2017, pp. 85–90. [34] P. Badjatiya, S. Gupta, M. Gupta, V. Varma, Deep Learning for Hate Speech Detection in Tweets, in: Proceedings of the 26th International Conference on World Wide Web Companion, WWW ’17 Companion, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 2017, pp. 759–760, event-place: Perth, Australia. doi:10.1145/3041021.3054223. URL https://doi.org/10.1145/3041021.3054223 [35] Z. Zhang, L. Luo, Hate Speech Detection: A Solved Problem? The Challenging Case of Long Tail on Twitter, arXiv:1803.03662 [cs]ArXiv: 1803.03662 (Feb. 2018). URL http://arxiv.org/abs/1803.03662 [36] A. M. Founta, D. Chatzakou, N. Kourtellis, J. Blackburn, A. Vakali, I. Leontiadis, A unified deep learning architecture for abuse detection, in: Proceedings of the 10th ACM Conference on Web Science, ACM, 2019, pp. 105–114. [37] Z. Zhang, D. Robinson, J. Tepper, Detecting Hate Speech on Twitter Using a Convolution-GRU Based Deep Neural Network, in: A. Gangemi, R. Navigli, M.-E. Vidal, P. Hitzler, R. Troncy, L. Hollink, A. Tordai, M. Alam (Eds.), The Semantic Web, Lecture Notes in Computer Science, Springer International Publishing, 2018, pp. 745–760. [38] L. Torrey, J. Shavlik, Transfer learning, in: Handbook of research on machine learning applications and trends: algorithms, methods, and techniques, IGI Global, 2010, pp. 242–264. [39] M. Wiegand, J. Ruppenhofer, T. Kleinbauer, Detection of abusive language: the problem of biased datasets, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 602–608. [40] A. Khetan, Z. C. Lipton, A. Anandkumar, Learning from noisy singly-labeled data, arXiv preprint arXiv:1712.04577 (2017). [41] D. D. Lewis, W. A. Gale, A sequential algorithm for training text classifiers, in: SIGIR’94, Springer, 1994, pp. 3–12. [42] H. S. Seung, M. Opper, H. Sompolinsky, Query by committee, in: Proceedings of the fifth annual workshop on Computational learning theory, ACM, 1992, pp. 287–294. 15 A PREPRINT - M AY 4, 2020 [43] A. K. McCallumzy, K. Nigamy, Employing em and pool-based active learning for text classification, Citeseer. [44] S. Kullback, R. A. Leibler, On information and sufficiency, The annals of mathematical statistics 22 (1) (1951) 79–86. [45] C. Baziotis, N. Pelekis, C. Doulkeridis, Datastories at semeval-2017 task 4: Deep lstm with attention for message- level and topic-based sentiment analysis, in: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Association for Computational Linguistics, Vancouver, Canada, 2017, pp. 747–754. [46] T. Luong, H. Pham, C. D. Manning, Effective Approaches to Attention-based Neural Machine Translation, 2015, pp. 1412–1421. doi:10.18653/v1/D15-1166. URL https://aclweb.org/anthology/papers/D/D15/D15-1166/ [47] P. Zhou, W. Shi, J. Tian, Z. Qi, B. Li, H. Hao, B. Xu, Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification, 2016, pp. 207–212. doi:10.18653/v1/P16-2034. URL https://aclweb.org/anthology/papers/P/P16/P16-2034/ [48] B. v. Aken, J. Risch, R. Krestel, A. Löser, Challenges for Toxic Comment Classification: An In-Depth Error Analysis, 2018, pp. 33–42. URL https://aclweb.org/anthology/papers/W/W18/W18-5105/ [49] F. Chollet, et al., Keras, https://keras.io (2015). [50] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Lev- enberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: Large-scale machine learning on heterogeneous systems, software available from tensorflow.org (2015). URL http://tensorflow.org/ [51] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830. [52] E. Grave, P. Bojanowski, P. Gupta, A. Joulin, T. Mikolov, Learning word vectors for 157 languages, in: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), 2018. [53] J. Pennington, R. Socher, C. Manning, Glove: Global vectors for word representation, in: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543. [54] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in: Advances in neural information processing systems, 2013, pp. 3111–3119. [55] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [56] J. Howard, S. Ruder, Universal language model fine-tuning for text classification, arXiv preprint arXiv:1801.06146 (2018).

Journal

StatisticsarXiv (Cornell University)

Published: Dec 5, 2019

There are no references for this article.