Time of Your Hate: The Challenge of Time in Hate Speech Detection on Social Media
Round 1
Reviewer 1 Report
This work focus in the analysis of the impact of language dynamics on the detection of hate speech in social media.
In general, the work is interesting and the manuscript is well written.
The first three sections are very good: the introduction clearly contextualises and motivates the research work, establishing the objectives and the research questions; the related work section presents an adequate set of recent publications related the problem, making connections to the different aspects required for better understand the presented work; the description of the method allows for a general understanding of the choices of the authors, but it could include more details for an adequate replicability (we know that AlBERTo is used, but in which way? which is the architecture of the classifier?).
Main main concerns are the following:
- The originality of the work is moderate. There are several studies that focus on language dynamics (and the impact of language dynamics on several NLP tasks), including in social media (see, for example, https://arxiv.org/pdf/1609.02075.pdf). This is not a problem in itself, but
- sometimes, the manuscript leads the reader to a comparison between approaches for hate speech detection (SVM vs deep learning), which, as described in this work, is less interesting, as there is lack of detail in the descriptions of the approaches and no reason is given for only exploring these approaches;
- the results are not a strong evidence for the claims.
- The description of the data should be improved.
- It has two sections — training data and test data —, but the datasets described in the test data subsection are also used for training, which might wrongly induce the reader.
- In the training data subsection, it is mentioned that a selection of tweets from the TWITA dataset, based on a list of topic-based keywords ("and imposing the constrain of the tweets having a geotag in Italy"). More information on how this selection was done is important to understand how the dataset used in the experiments was created.
- It is said that the tweets were annotated by three independent contributors, but no information about the inter-annotator agreement is given. This is specially important, since the annotation was based on crowdsourcing.
- Information about the length of tweets and vocabulary diversity would also improve the characterisation of the dataset.
- Table 1 shows different distributions of non-HS vs HS in the different subsets. Why does this happen? More information concerning this aspect is important.
- Concerning the results, it is important to address several aspects:
- The design of the experiment is adequate.
- I think that having some tables with numbers would help to grasp the whole picture.
- I would say that is not that relevant that there is statistical significance between the SVM and the deep learning approach. It is important to have statistical significance when comparing the results of Figure 1 to the results of Figure 2.
- It is of great importance the have a deeper analysis of the results. There are several aspects that should be commented:
- in Figure 1, we can see that the real is relatively stable across the several partitions and, comparing to Figure 2, the results achieved by the AlBERTo-based approach seem, in general, better in Figure 1;
- the tendency of the F1-score is similar on both figures, as well, as the values;
- why recall and F1 keep getting lower when advancing in the timeline in Figure 2?
- wouldn't be important to use a strategy for unbalanced datasets, especially because of the last partitions?
- it i said that temporal proximity is more important than more data Figures 2.c) and d) and Figures 2.e) and f) do not seem to support that claim.
- A better discussion of the results, including these aspects is very important.
- The lexical analysis is an interesting idea, but several aspects should be improved:
- a reference and a brief description for the weirdness index is given, but it would be clear to include an equation;
- this single(!) paragraph should be improved for better readability — the weirdness index is applied in different ways, a polarised weirdness index is mentioned (but no definition is provided), there are hints of a POS-based analysis, and in the end we have a chart (Figure 3) with a simpler frequency-based analysis (with relative frequencies reaching up to almost 400%, which is quite odd);
- in this helpful section, more visual information is important for a better readability.
There are some minor English problems that should be addressed. For example,
- In line 64, "pre-train" should be "pre-training"
- In line 67, "es" should be "as"
- In line 136, "two different set" should be "two different sets"
- The sentence spanning through line 176 and 177 is not clear
- In line 197, "is" should be "are"
- In general, the text should be better proof-read from section 5 until the end
Author Response
We would like to deeply thank you for your detailed and in depth feedback on our work that helped us to improve the paper
Comment 1
The originality of the work is moderate. There are several studies that focus on language dynamics (and the impact of language dynamics on several NLP tasks), including in social media (see, for example, https://arxiv.org/pdf/1609.02075.pdf). This is not a problem in itself, but sometimes, the manuscript leads the reader to a comparison between approaches for hate speech detection (SVM vs deep learning), which, as described in this work, is less interesting, as there is lack of detail in the descriptions of the approaches and no reason is given for only exploring these approaches; the results are not a strong evidence for the claims.
Reply to Comment 1
We extended the related work in order to better highlight the novelty of our contribution w.r.t. the state of the art. In particular we extended the discussion of related work in Sec. 2. (from row 128 onward), by pointing out the relation with studies about the dynamic changes of languages, like the one you suggest (R Goel et al., 2016). Thanks for the suggestion. Such works focus on the change of language mainly from a linguistic analysis point of view, while we aim to use this hypothesis for studying the robustness of machine learning approaches that work overtime on hate speech.
The originality of our work consists, to our knowledge, in presenting the first attempt to tackle the issue of diachronic degradation of hate speech predicting systems. We propose here to achieve this goal by means of an evaluation methodology for exploring the temporal robustness of hate speech detection algorithms.
More in detail, we suggest to focus on analysing how different time spans of the training data result in a different trend for the F1 score and macro F1 score metrics in the case of two different prediction algorithms: SVM and AlBERTo.
The rationale behind the two chosen algorithms, as explained in the introduction is that SVM is a widely used classification algorithm in hate speech detection task while BERT is a new transformers based approach that has gained much attention in the NLP community in the past year due to both the outstanding performances in predicting hate speech and in classification tasks more and general and the exploitation of not annotated data where the availability of labeled corpus is scarce, which is exactly the scenario in which our work is situated.
In order to make the comprehension of the models used in the contribution more straightforward than in the Old Version, we have extended the Section 3. "Method". We have included a broader discussion of the SVM and AlBERTo/BERT models, and we have better described the classification strategy used in each of the two approaches. This new contribution is consultable from row 117 to 192.
We are sorry if we were not able to support our claims clearly in the Old Version. We also enriched Sec. 5.2. “Results” in order to better explain our observation. In particular, in Fig. 2, we can note a clear drop in the performances of the F1 measure in 6 months for both the models. Moreover, observing Fig. 3, it is possible to observe that this phenomenon is less evident if we use a strategy based on incremental learning.
Comment 2
The description of the data should be improved. It has two sections — training data and test data —, but the datasets described in the test data subsection are also used for training, which might wrongly induce the reader. In the training data subsection, it is mentioned that a
selection of tweets from the TWITA dataset, based on a list of topic-based keywords ("and imposing the constraint of the tweets having a geotag in Italy"). More information on how this selection was done is important to understand how the dataset used in the experiments
was created. It is said that the tweets were annotated by three independent contributors, but no information about the inter-annotator agreement is given. This is specially important, since the annotation was based on crowdsourcing.
Information about the length of tweets and vocabulary diversity would also improve the characterisation of the dataset.
Table 1 shows different distributions of non-HS vs HS in the different subsets. Why does this happen? More information concerning this aspect is important.
Reply to Comment 2
In Sec.4 “Data” we added an introduction that we hope helps to clarify how we selected and use the data for our experiments. We also included a detailed description of the full process that was implemented to determine the keywords used to filter the tweets (lines 227-243). concerning the inter-annotator agreement, the full annotation process is reported in [49] and [50], as now stated explicitly in the paper (254-255).
We measure the scores of precision, recall and F1 independently for each class (HS, non-HS) because we believe that, in a task of hate speech detection, it is important to see if the model achieves good performances (in particular precision) in detecting the hate speech class. For example, it is more important for us that the model is able to correctly classify the hate speech sentence “You are ugly, kill yourself” than the sentence “Today is a good day” as not hate speech. We discuss this aspect in the Sec. 5.1. Experimental Design.
In the same section we also provided a more detailed description of the inter-annotator agreements involved in the creation of the training dataset (lines 243-248).
Comment 3
Concerning the results, it is important to address several aspects:
- The design of the experiment is adequate.
I think that having some tables with numbers would help to grasp the whole picture.
Reply to Comment 3
In Section 5.2 “Results” we added Tab. 4-7 that now include all the results presented in Fig. 2 and Fig. 3. We also expanded the description of the experiments we conducted and the results we obtained.
Comment 4
I would say that is not that relevant that there is statistical significance between the SVM and the deep learning approach. It is important to have statistical significance when comparing the results of Figure 1 to the results of Figure 2.
Reply to Comment 4
Thanks for the suggestion. We performed the required statistical tests and reported the results in Section 5.2, Table 8.
Comment 5
It is of great importance the have a deeper analysis of the results. There are several aspects that should be commented:
- in Figure 1, we can see that the real is relatively stable across the several partitions and, comparing to Figure 2, the results achieved by the AlBERTo-based approach seem, in general, better in Figure 1;
- the tendency of the F1-score is similar on both figures, as well, as the values;
- why recall and F1 keep getting lower when advancing in the timeline in Figure 2?
Reply to Comment 5
While in Fig. 2 the training set is kept fixed and is only the test set that changes, in Fig. 3, both training and test sets are varying over time. In the second case, the recall is probably getting lower because we are using two training strategies that are adding new data at the training set. This operation is probably going to change the model in a way that performances on recall are decreasing by adding more training data collected in different time spans. But this is a kind of hypothesis or speculation that deserves further investigation, so we have decided not to include it in the paper.
Comment 6
- wouldn't be important to use a strategy for unbalanced datasets, especially because of the last partitions?
Reply to Comment 6
We agree with your very interesting comment. We do not report the results where we tried to use the undersampling of non-hate examples because the results obtained were pretty close to those we discuss in our contribution. For sure, it opens interesting perspectives for future work about them. A paragraph about this aspect has been added in the “Conclusion” Section.
Comment 7
if you said that temporal proximity is more important than more data Figures 2.c) and d) and Figures 2.e) and f) do not seem to support that claim. A better discussion of the results, including these aspects is very important.
Reply to Comment 7
We added to Sec. 3 “Methods and models” a more detailed description of the two classifiers and how we implemented them in our work, in order to hopefully make it easier to understand our results. We rephrased the discussion of our results in Sec. 5.2. by adding all the numerical results we obtained (Tables 3-9).
Comment 8
The lexical analysis is an interesting idea, but several aspects should be improved:
- a reference and a brief description for the weirdness index is given, but it would be clear to include an equation;
- this single(!) paragraph should be improved for better readability — the weirdness index is applied in different ways, a polarised weirdness index is mentioned (but no definition is provided), there are hints of a POS-based analysis, and in the end we have a chart (Figure 3) with a simpler frequency-based analysis (with relative frequencies reaching up to almost 400%, which is quite odd); in this helpful section, more visual information is important for a better readability.
Reply to Comment 8
We extended and improved the Sec. 6. “Lexical Analysis” in order to address the reported issues and make the text easier to read and understand.
Reviewer 2 Report
This paper studies temporal robustness of prediction methods for hate speech detection. Experiments are conducted over AlBERTo and SVM under various conditions to study the impact of the size and temporal coverage of the training set on the temporal robustness.
The topic is timely and interesting and the experiment results look strong. Though it is not in my area of expertise, the paper is written well and easy to follow. However, this work lacks technical depth -- the paper uses existing methods without any further improvement, the way to formulate the problem is far from unexpected, and the derivation of solutions are quite straightforward. It looks more like a "case study" report, not a scientific paper.
Author Response
The topic is timely and interesting and the experiment results look strong. Though it is not in my area of expertise, the paper is written well and easy to follow. However, this work lacks technical depth -- the paper uses existing methods without any further improvement, the way to formulate the problem is far from unexpected, and the derivation of solutions are quite straightforward. It looks more like a "case study" report, not a scientific paper.
Reply to Review 2
Thank you for your feedback. We extended the related work in order to better highlight the novelty of our contribution w.r.t. the state of the art. In particular we extended the discussion of related work in Sec. 2. (from row 128 onward), by pointing out the relation with studies about the dynamic changes of languages. Such works focus on the change of language mainly from a linguistic analysis point of view, while we aim to use this hypothesis for studying the robustness of machine learning approaches that work overtime on hate speech:
“Computational approaches to the diachronic analysis of language [24] have been gaining momentum over the last decade. An interesting analysis of the dynamics of language changes has been exploited by [25]. The author describes what happens from the language analysis point of view on words that change their meaning during the time. Most of them show a social contagion where the meaning is changed by their common/wrong use on social media platforms. Clyne et al. in [26] discuss the changing of words meaning by the influence of immigrant languages, Lieberman et al. [27], instead, tries to quantify these changes in language. These studies support our idea about the possible difficulties of an automatic machine learning approach to classify new sentences when they have been collected in a time distant enough by them used for training it. We suppose that the language of hate speech is very volatile and influenced by event and it changes words meaning faster than usual making it difficult to correctly classify them also after a month of distance by the training of them. This encourages us to investigate the robustness of some machine learning models over time. The recent availability of long-term and large-scale digital corpora, and the effectiveness of methods for representing words over time played a crucial role in the recent advances in this field. However, only a few attempts focused on social media [28,29] and their goal is to analyse linguistic aspects rather that understanding how lexical semantic change can affect performance in sentiment analysis or hate speech detection. From this perspective, our work represents a novelty: for the first time we propose to tackle the issue of diachronic degradation of hate speech predicting systems performances by exploring their temporal robustness. The closest works found in recent literature are [30], where the authors explore the diachronic aspect in the context of users profiling, and [31],who provides a broader view on diachronicity in word embeddings and corpora. But this is the first work investigating the diachronic aspect in the specific context of hate speech detection, which is a crucial issue especially in application settings devoted to monitoring the spread of the hate speech phenomenon over time.”
The core novelty of our work consists of a methodology for exploring the temporal robustness of hate speech detection models and thus evaluating the diachronic degradation of hate speech predicting systems. The models we considered are not new, but this does not undermine the novelty of our results as BERT is a major breakthrough in the field of NLP classification tasks in recent years and SVM is a widely spread classifier in such tasks.
Reviewer 3 Report
The topic of the paper is obvious, but the presentations of methods and results are enigmatic.
The authors use the AlBERTo model, created by themselves and presented at several conferences. There is no description of the construction and functioning of AlBERTo in any publication. In their earlier paper [12] they define AlBERTo as “Italian language understanding model based on social media writing style”. Title of their paper [14] contains the definition “…. AlBERTo Italian Language Understanding Model”. Therefore, the statements in line 166 “we decided to compare AlBERTo against a traditional SVM” and line 192 “AlBERTo performs significantly much better than SVM” are not justified. SUV is a widely used and highly valued method for creating classifiers, and AlBERTo is “Italian language understanding model”. It is impossible to compare such very different things. Such comparison and conclusions suggest the commercial nature of the publication. In addition, the work requires editorial correction. For example, the acronym SVM should be defined at its first appearance, that is in chapter “1. Introduction” and not in chapter “2. Related Work”. By the way, should it be “2. Related Works”? In line 126, the word “rate” appears double.
Author Response
The authors use the AlBERTo model, created by themselves and presented at several conferences. There is no description of the construction and functioning of AlBERTo in any publication. In their earlier paper [12] they define AlBERTo as “Italian language understanding model based on social media writing style”. Title of their paper [14] contains the definition “.... AlBERTo Italian Language Understanding Model”. Therefore, the statements in line 166 “we decided to compare AlBERTo against a traditional SVM” and line 192 “AlBERTo performs significantly much better than SVM” are not justified. SUV is a widely used and highly valued method for creating classifiers, and AlBERTo is “Italian language understanding model”. It is impossible to compare such very different things. Such comparison and conclusions suggest the commercial nature of the publication. In addition, the work requires editorial correction. For example, the acronym SVM should be defined at its first appearance, that is in chapter “1. Introduction” and not in chapter “2. Related Work”. By the way, should it be “2. Related Works”? In line 126, the word “rate” appears double.
Reply to Review 3
We would like to deeply thank you for your feedback, that helped us to improve the paper.
Technically you are right in pointing out that BERT is not per se a language modeling. We clarified this in Sec. 1 “Introduction” (lines 38-42) and Sec. 2 “Related Works” (lines 65-81). In our work, we used it as a classifier and thus its performances in predicting hate speech are compared to the ones of SVM.
Nevertheless, BERT is based on transformers language modeling and, by extension, sometimes it is referred to as a language model, despite not being one strictly speaking.
We hope this clarifies the confusion.
We extended Sec. 3 “Methods and Models” in order to provide a deeper description of AlBERTo and how it was implemented in our experiments (lines 152 - 177) in order to make the paper self-contained.
Typos and comments on style have been all addressed.
Round 2
Reviewer 1 Report
In general, my concerns were addressed by the authors.
Concerning content, I do not agree with the authors when they say the most meaningful result is presented in Figure 3.(x) and that it consists in Alberto performing better than the SVM with incremental training:
- the authors improved the manuscript clearly stating that their goal is the diachronic focus.
- Alberto already performs better Figure 2, without the diachronic training
I think that the most meaningful result is that comparing Figures 3.(e) and 3.(f), and 3.(g) and 3.(h), with Figures 2.(c) and 2.(d) the loss of performance is diminished which show the importance of the diachronic training.
The text clearly needs revising, as several typos, inconsistencies, grammatical problems, etc. are still present (e.g., lines 53, 134, 264-267 — OMW not defined —, 363, 429, Haspeede+ is typed in several different ways, etc.). References should also be revised because they are presented in a quite inconsistent way.
Author Response
Dear Reviewer,
Thank you very much for this second round of insightful comments.
Following your advice we added a clarification on the importance of the results presented in Figure 2 and Figure 3 in Sec 5.2 (lines 350 - 358).
We also fixed all the typos you highlighted and revised all the references in order to make them consistent.
Reviewer 2 Report
The authors addressed my concerns in the previous report.
The language of this paper still needs to be polished.
For example,
"On the other hand, there are several actors, like institutions and ICT companies that need to comply..."
It is "actor" or "factor"?
"which is only possible at a large scale"
on a large scale
"The field has been recently surveyed in [8] and [9]. The vast majority of the papers analysed
86 in [8] describe approaches to hate speech detection based on supervised learning,"
the structure is awkward
"Google Colab has been chosen as a running environment"
a or the?
Author Response
Dear Reviewer,
Thank you for your detailed comment on our paper.
Following your advice we fixed all the typos and further revised the language to make it smoother and easier to follow.
Reviewer 3 Report
The work requires editorial correction, for example:
Line 19 Natural Language Processing <- for Natural Language Processing (NLP)
Line 41 counties <- countries
Line 98 Support Vector Machine (SVM) <- SVM
Line 166 Support Vector Machine (SVM) <- SVM
Line 171 Support vector machines (SVMs) <- SVMs
Line 177 Support Vector Machines <- SVMs
Line 193 BERT (Bidirectional Encoder Representations from Transformers) <- BERT
Line 231 The learning rate rate has been kept <- The learning rate has been kept
Figure 2. Date sequence on the horizontal axis
Line 363, Table 8 and line 378 haspeede+ <- Haspeede+
I suggest to remove sentence “Support Vector Machines are fast, and they perform well with a limited amount of data.” from lines 177-178, because there are no methods which “perform well with an unlimited amount of data”. The next sentence “In order to better understand the way SVMs work, it can be possible to imagine elements of two classes plotted on a 2-d space.” should be removed as well. This sentence together with its context suggests that SUVs can be used only with a small number of featuress. Many papers showed that this statement is not true.
Author Response
Dear Reviewer,
Thank you for your detailed feedback.
Following your advice we fixed the typos, removed the sentence highlighted and generally revised the language of the paper to make it smoother and easier to follow.