Hate Speech Detection by Using Rationales for Judging Sarcasm

Mamun, Maliha Binte; Tsunakawa, Takashi; Nishida, Masafumi; Nishimura, Masafumi

doi:10.3390/app14114898

Open AccessArticle

Hate Speech Detection by Using Rationales for Judging Sarcasm

by

Maliha Binte Mamun

^1,*,

Takashi Tsunakawa

¹

,

Masafumi Nishida

¹ and

Masafumi Nishimura

²

¹

Graduate School of Science and Technology, Shizuoka University, 3-5-1 Johoku, Chuo-ku, Hamamatsu 432-8011, Japan

²

Department of Smart Design, Faculty of Architecture and Design, Aichi Sangyo University, 12-5 Harayama, Oka-machi, Okazaki 444-0005, Japan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(11), 4898; https://doi.org/10.3390/app14114898

Submission received: 10 April 2024 / Revised: 23 May 2024 / Accepted: 4 June 2024 / Published: 5 June 2024

(This article belongs to the Special Issue Advancements in Natural Language Processing, Semantic Networks, and Sentiment Analysis)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The growing number of social media users has impacted the rise in hate comments and posts. While extensive research in hate speech detection attempts to combat this phenomenon by developing new datasets and detection models, reconciling classification accuracy with broader decision-making metrics like plausibility and faithfulness remains challenging. As restrictions on social media tighten to stop the spread of hate and offensive content, users have adapted by finding new approaches, often camouflaged in the form of sarcasm. Therefore, dealing with new trends such as the increased use of emoticons (negative emoticons in positive sentences) and sarcastic comments is necessary. This paper introduces sarcasm-based rationale (emoticons or portions of text that indicate sarcasm) combined with hate/offensive rationale for better detection of hidden hate comments/posts. A dataset was created by labeling texts and selecting rationale based on sarcasm from the existing benchmark hate dataset, HateXplain. The newly formed dataset was then applied in the existing state-of-the-art model. The model’s F1-score increased by 0.01 when using sarcasm rationale with hate/offensive rationale in a newly formed attention proposed in the data’s preprocessing step. Also, with the new data, a significant improvement was observed in explainability metrics such as plausibility and faithfulness.

Keywords:

hate speech; sarcasm; rationale

1. Introduction

Although half of the world’s population uses social media for networking, many have encountered hostility in some form or another [1], such as those categorized under crimes against minorities [2]. Victims of online hate suffer from psychological and emotional distress like depression, anxiety, and in severe cases, physical harm, which potentially leads to suicide [3,4,5,6]. Over the years, continuous research has been conducted to develop hate speech detection methods that could effectively reduce online hate. In order to achieve such methods, understanding online hate and predicting hate crimes in their various manifestations are very important [7]. Therefore, to improve hate detection, hate speech datasets [8,9,10,11,12,13] have been developed over the years, which mostly focus on hate speech attributes closely related to hate expressions (e.g., abusive comment, hate speech, offensive, and disrespectful content), the target of hate (e.g., origin, gender, religion, and sex orientation), and retweets (now termed posts) or the like are made available for the purpose of progressive research and development.

A distinction is recognized between hate speech and offensive comments in this study. Hate speech is described as statements that incite violence or prejudice against specific groups based on protected attributes such as race, religion, or sexual orientation. These expressions typically aim to harm and marginalize, reflecting deep-seated prejudice.

Offensive comments, which are potentially hurtful or rude, are generally not directed to incite direct harm nor are they always targeted at specific groups. These may include profanity or derogatory remarks made in a general context without the intention to incite violence or discrimination against a group.

In this paper, a clear distinction between these two types of content is maintained to tailor the model’s response appropriately. Not only is the presence of hate or offense identified, but also the context and the target, which are considered crucial for effective moderation on social platforms. This approach allows for more nuanced detection and response strategies, which are essential for maintaining the integrity and inclusivity of online communication spaces.

Some studies [14,15,16,17] have proposed models for improving automatic hate speech detection. However, these models often lack inductive reasoning due to an overestimation of state-of-the-art approaches for hate speech detection [18,19]. These advanced models typically rely on triggered vocabularies for predictions, leading to substantial evidence of racial bias in major published hate speech datasets [20,21]. Moreover, as hate speech detection models become more complex, it becomes increasingly challenging to clarify their decision-making process [22].

To address these concerns, a benchmark dataset named “HateXplain” [13] was introduced. This dataset is unique in that each data point is annotated with labels indicating whether the content is hate, offensive, or normal, along with details such as the mentioned target communities and text snippets known as rationales [23]. These rationales, which are identified by human annotators, provide context that is instrumental in teaching the model not only to recognize hate speech but also to understand the context that characterizes such language, enhancing both the robustness and interpretability of the model.

During the training phase, these rationales inform a specialized attention mechanism within a neural network architecture, focusing the model’s attention more on words and phrases within these rationales. This targeted learning approach helps the model to prioritize the most critical attributes that indicate offensive/hate speech, leading to improved accuracy and reliability in classifications.

Furthermore, the use of rationales extends into the operational phase of the model. Here, the trained model applies the patterns it has learned to new, unseen texts. When it identifies words or phrases that mirror the rationales from the training dataset, the model scrutinizes these pieces of text with heightened attention. This method not only improves the accuracy of classification but also ensures that each decision made by the model can be traced back to clear, understandable linguistic evidence, thereby enhancing the transparency and accountability of automated hate speech detection systems.

Thus, incorporating rationales into the training and operational phases of model bridges the gap between traditional text analysis and a deep, contextual understanding of language used in hate speech. This not only enhances the model’s performance but also ensures that each decision made by the model is transparent and explainable, which is crucial for applications in sensitive areas such as online content moderation.

While recent studies have highlighted flaws in previous hate speech detection methods, including issues with model performance and datasets, the evolving nature of language and the constant emergence of new ways social media users spread hate speech pose ongoing challenges. Platforms like Twitter or Facebook often inspire users to employ figurative language in conveying their opinions. This includes the concealment of hate comments under the guise of sarcasm [24], a common form of passive-aggressive speech.

Therefore, sarcasm represents a crucial aspect for enhancing hate speech detection. However, as far as the knowledge of this study extends, there has not been any previous research in this sector specifically addressing hate concealed under the pretext of sarcasm. As a solution to this problem, this research proposes a new dataset that also focuses on sarcasm alongside hate. For the study, 3199 data points were selected from the benchmark HateXplain dataset [13] and annotators were hired to label them based on sarcasm as well as choose the accompanying rationale. LIME [25] was utilized for generating important tokens as rationale to assist in the classification task. The BERT model, as documented in [26], gives significance to the HateXplain data rationales provided by human annotators, but it tends to misclassify the label (offensive speech—OF). Notably, for this study’s data, BERT assigned the label (hate speech—HS).

This study aimed to enhance hate speech detection by incorporating a sarcasm-based rationale in addition to the conventional hate/offensive rationale. The main objectives were to develop a novel methodology for integrating these rationales into the existing model frameworks and to assess the impact of this integration on model performance, specifically in terms of classification accuracy and explainability.

The growing prevalence of online hate speech necessitates robust detection mechanisms that can effectively handle the nuances of language, including sarcasm, which is often used to disguise hateful intents. By addressing these challenges, the research seeks to contribute to safer online environments and more effective moderation tools.

The contributions of this research are as follows:

1.: A new dataset for hate speech detection is presented, comprising two kinds of labeled categories: (i) sarcastic/non-sarcastic; and (ii) hate speech/offensive/normal. Rationale selection is based on whether it fits into the categories of sarcastic and hate/offensive;
2.: The sarcasm rationale is incorporated in cases where the data are both sarcastic and hate speech/offensive. This involves preprocessing steps before utilizing it in the model’s attention calculation;
3.: The inclusion of sarcasm-based rationale alongside hate speech/offensive-based rationale improved both the classification and explainability metrics of the BERT model.

2. Related Works

The proliferation of online platforms has significantly increased incidents of hate speech, impacting vulnerable individuals and communities profoundly [3,5,6,27]. Comprehensive studies [4,28,29,30] have documented these effects, emphasizing the urgent need for effective detection mechanisms. Current research primarily focuses on developing computational models to detect explicit and implicit hate speech [11,12,13]. However, traditional methods often overlook nuanced expressions such as sarcasm and indirect speech that can mask hateful intentions [31,32,33], a critical gap highlighted in the literature. This study addresses this by enhancing model sensitivity to not only detect overt hate speech but also subtler, sarcastic expressions often overlooked by current methodologies [14,15,16,17,34,35].

Dataset: Previous datasets designed to detect hate speech primarily focus on expressions related to hate, encompassing abusive comments, hate speech, offensive language, and disrespectful content. A common issue in much of the preceding research is the tendency to conflate hate speech with abusive or offensive language [34]. Furthermore, some studies merged offensive and hate language into a single category, while only a limited number [34,36] made efforts to distinguish offensive from hate speech. In the published HateXplain dataset, subjectivity is recognized as a pivotal aspect, acknowledging that numerous messages may be offensive without qualifying as hate speech. For example, in the United States, the term “nigga” is commonly used in online language by the Black community in a manner that is not necessarily intended to be malevolent toward another individual [35]. Building on the categorization framework that separates hate from offensive language [34], the HateXplain dataset adopts three classes: hate, offensive, and normal. Moreover, HateXplain incorporates the concept of using rationales [23]. Since human annotators are required to highlight a span of text supporting their labeling decision, this enriches the rationale annotation process. As a result, HateXplain provides the first benchmark hate speech dataset with human-level explanations, enhancing interpretability and transparency.

Despite the progress made by HateXplain in addressing major issues related to hate speech, a contemporary challenge persists—the use of sarcasm as a disguise for spreading hate speech [24]. Given the non-physical and implicit nature of online violence, it often appears subtle. Social media platforms frequently host stereotypical, racial, and gender jokes, sometimes disguised as sarcasm. While seemingly harmless, the repetition of such jokes can lead to negative psychological effects, resembling a form of cyberbullying [37,38]. Despite existing research on sarcasm detection [39,40] and multi-modal detection [41], there is a noticeable gap—no research has explored the connection between hate and sarcasm for automatic hate speech detection. In our work, assuming hate is hidden under the guise of sarcasm, and rationale can provide human-level explanations, we collected data from HateXplain, which is already labeled with rationale based on hate/offensive/normal for new dataset creation. This new dataset is different from existing datasets in that it introduces two categories, and based on each category (sarcasm and hate/offensive), we provide rationale for human-level explanations.

Model: Despite numerous models claiming to achieve state-of-the-art performance on specific datasets, their ability to generalize is often questionable [18,19]. Challenges emerge when these models misclassify comments related to identities frequently under attack (e.g., gay, black, and/or Muslim) as toxic even when lacking any malicious intent [42]. The prevalence of biased predictions, stemming from an overemphasis on specific trigger vocabulary, can result in further discrimination against groups already targeted by online abuse [20,21]. Another pressing issue in current methodologies is the lack of decision-making transparency. As hate speech detection models grow in complexity, elucidating their decisions becomes increasingly challenging [22]. Therefore, it is important to focus on interpretable rather than performance-based models. HateXplain tackles the challenge of model explainability by concurrently learning target classification and the rationale behind human decisions. This dual learning approach aims to enhance both aspects. In this study, existing state-of-the-art models such as CNN-GRU [14], BERT [26] and BiRNN [43] were leveraged. Drawing insights from HateXplain, where BERT demonstrates slightly superior results in classification, bias, and a key explainability metric (faithfulness), as well as being the only model that can provide a vector representing the attention for each token, the study opted to employ BERT for a more comprehensive comparison with HateXplain.

3. Methods

This section explains the preprocessing steps and model employed in evaluating the dataset. The preprocessing steps encompass HateXplain dataset management, the data from the newly proposed dataset, and a novel preprocessing method designed for data exhibiting both sarcasm and hate/offensive characteristics. In the model details, the utilization of ground truth attention for training is clarified along with the model’s requirement to produce a vector representing the attention for each token.

3.1. Dataset Selection

Among the various available hate speech datasets, HateXplain was selected for this research due to its unique features. HateXplain provides annotated rationales for each labeled instance, which is critical for developing models that not only predict hate speech but also offer insights into the reasons behind these classifications. This transparency is vital for the interpretability of models in sensitive applications like content moderation. Additionally, HateXplain’s focus on both hate speech and offensive language, annotated in a real-world context, aligns well with the objectives of this study to handle nuanced expressions like sarcasm effectively.

3.2. Preprocessing

Conversion of text data: Essential preprocessing tasks are efficiently executed to ensure data uniformity and usability. Text normalization and cleaning involve converting all text to lowercase, removing HTML tags, and substituting URLs and other non-textual elements with placeholders to standardize data. The Ekphrasis tokenizer, designed for social media, segments and annotates text with emotional and stylistic cues critical for detecting sarcasm and hate speech. Advanced tokenization for the model input employs BertTokenizer to refine text into tokens suitable for BERT models, appending special tokens like ‘[CLS]’ and ‘[SEP]’ to facilitate proper sequence formatting, while also preparing sequence input IDs and attention masks. Integration with SpaCy enhances the grammatical and semantic handling of the text, supporting accurate model training and effectiveness. These comprehensive preprocessing steps meticulously prepare the dataset, addressing language complexities and nuances vital for the effective detection of sarcasm and hate speech, thus enhancing model accuracy and interpretability.

Conversion of rationale to ground truth attention: Ground truth attention is established for posts identified as toxic (hatespeech/offensive) using a baseline method for scenario (i). Furthermore, for scenario (ii), where posts are classified as hatespeech/offensive in the first category and sarcasm in the second category, ground truth attention is derived from the annotator’s rationales depicted in Figure 1. In this process, each rationale is transformed into an attention vector with values of 0 and 1. Tokens within the rationale are assigned a value of 1 in the attention vector, while the rest receive a 0. The average of these attention vectors is then calculated to obtain the ground truth attention for each post. Subsequently, a softmax function is applied to normalize the attention.

An inherent challenge in the ground truth attention vector arises from the potential proximity of values between rationale and non-rationale tokens. To address this, we utilize the temperature parameter (

τ

) within the softmax function. This strategic application of the temperature parameter facilitates an adjustment in the probability distribution, amplifying the emphasis on the rationales. The fine-tuning of this parameter is a meticulous process conducted through adjustments in the validation set.

In the conversion process of toxic (hatespeech/offensive) rationale, if at least two out of three annotators label the data as hatespeech/offensive, the rationale is considered. However, for sarcasm data, even if just one annotator labels the data as sarcasm, the rationale is converted. This distinction is made due to the subjective nature of understanding sarcasm in text data, which can vary significantly from person to person.

Proposed method for the conversion of rationale to ground truth attention: In this study, an innovative algorithm tailored for data exhibiting both hateful/offensive and sarcastic characteristics is introduced, see Algorithm 1. In this process, similar to the previous approach, each rationale is transformed into an attention vector with values of 0 and 1. Tokens within the rationale are assigned a value of 1 in the attention vector, while the rest receive a 0. Here, the attention vector for toxic (hate/offensive) after conversion is denoted as

H = (h_{1}, h_{2}, h_{3}, \dots, h_{n})

in the 1st category, and the attention vector for sarcasm, denoted as

S = (s_{1}, s_{2}, s_{3}, \dots, s_{n})

, corresponds to the 2nd category. This procedure is applied to a sentence represented by tokens

t_{1}, t_{2}, t_{3}, \dots, t_{n}

and its corresponding rationale after vector conversion

r_{1}, r_{2}, r_{3}, \dots, r_{n}

.

Algorithm 1 Calculate ground truth attention vector

1:: procedure Calculate Ground Truth Attention $Sentence tokens t_{1}, t_{2}, t_{3}, \dots, t_{n}$ ;
$Rationale vector after conversion r_{1}, r_{2}, r_{3}, \dots, r_{n}$ ; Hate/Offensive Attention Vector,
$H = (h_{1}, h_{2}, h_{3}, \dots, h_{n})$ ; $Sarcasm Attention Vector, S = (s_{1}, s_{2}, s_{3}, \dots, s_{n})$
2:: for $i \leftarrow 1$ to n do
3:: Calculate combined attention score for token $t_{i}$ :
4:: if $s_{i} > h_{i}$ then
5:: $r_{i} \leftarrow \frac{s_{i} + h_{i}}{2}$
6:: end if
7:: if $s_{i} < h_{i}$ then
8:: $r_{i} \leftarrow h_{i}$
9:: end if
10:: if $s_{i} = h_{i}$ then
11:: $r_{i} \leftarrow h_{i} + \frac{s_{i} + h_{i}}{2}$
12:: end if
13:: end for
14:: return Rationale Attention Vector r
15:: end procedure

Drawing on the concept of an attention vector, a new attention vector is established for tokens within the rationale that simultaneously convey hateful/offensive and sarcastic elements. This new attention vector is utilized for the calculation of the ground truth attention. Specifically, if, in sarcasm rationale, a vector element

s_{i}

surpasses a hate rationale vector element

h_{i}

, the final newly assigned vector element is

r_{i} = 0.5

. When a sarcasm rationale vector element

s_{i}

is less than the hate rationale vector element

h_{i}

, the final new assigned vector element

r_{i} = 1

.

And lastly, when the sarcasm rationale vector element

s_{i}

is equal to the hate rationale vector element

h_{i}

and if both of them are 1, then the final new assigned vector element

r_{i} = 1.5

, as shown in Figure 2.

This approach refines the interpretation of attention dynamics in the context of sentences exhibiting both hatespeech/offensive and sarcasm. Following this, the average of the attention vectors is calculated to obtain the ground truth attention for each post, as in the previous process shown in Figure 1. Once the updated ground truth attention is obtained, a softmax function is applied to distribute the concentration of rationale, mirroring the final ground truth attention calculation as previously demonstrated.

3.3. Model

In this section, the details of the BERT model are provided. The model was trained on both HateXplain and the newly proposed dataset. Here, how the cross-entropy loss is calculated is explained for ground truth attention.

Bidirectional Encoder Representations from Transformers (BERT)

In the BERT model, there are approximately 12 layers, and each layer consists of 12 attention heads, resulting in a total of 144 attention heads. Typically, a subset of these heads, denoted as “supervised heads”, is utilized in the last layer of BERT for attention supervision, Figure 3a. HateXplain [13] employs this term to describe the selected attention heads. For each supervised head, the attention weights corresponding to [CLS] are utilized, and the cross-entropy loss is calculated against the ground truth attention vector. This process ensures that the final weighted output corresponding to [CLS] focuses on words aligned with the ground truth attention vector. The same procedure is repeated for all supervised heads. The resultant loss from attention supervision is the average of the cross-entropy losses from each supervised head, further multiplied by the regularization parameter, denoted as

λ

. The utilization of attention weights associated with [CLS] in conjunction with the cross-entropy loss helps align the model’s attention with the ground truth attention vector.

Following the computation of ground truth attention, the BERT model undergoes training using the preprocessed data and ground truth attention. Subsequently, a loss from attention supervision is computed, employing cross-entropy between attention values and the ground truth attention as the loss function for attention, Figure 3b. The total loss, denoted as

L_{t o t a l}

, is determined by the sum of the prediction loss (

L_{p r e d i c t i o n}

) and the attention loss (

λ \cdot L_{a t t e n t i o n}

). In this equation,

λ

acts as a weighting factor, regulating the influence of the attention loss on the overall loss.

4. Experimental Conditions

4.1. Dataset

This section is divided into two categories: Benchmark Dataset HateXplain [13] and New Dataset Creation.

Benchmark Dataset—HateXplain: This dataset was obtained from platforms like Twitter and Gab, which were previously explored in hate speech studies. The corpus, comprising tweets and Gab posts, was annotated with labels (hatespeech/offensive/normal), target communities, and text snippets marked by annotators supporting the labels. Each data point was evaluated by three independent annotators, and if there was a consensus among two or more annotators on a specific category, the data were officially categorized accordingly.

New Dataset Creation: From the benchmark dataset, HateXplain (which is publicly available in online), 3199 data points were randomly selected and each underwent manual scrutiny to determine its sarcastic nature, illustrated in Figure 4a. Three independent annotators meticulously labeled each data offering a nuanced perspective on the presence or absence of sarcasm, complemented by a detailed rationale for their labeling decisions, illustrated in Figure 4b.

About 12 anonymous annotators, along with recruited volunteers, were engaged in the labeling task for this project. Exclusively provided with textual data (posts) from the selected entries in the HateXplain dataset, these annotators focused on identifying sarcasm. Drawing inspiration from the use of rationales in detecting hate/offensive comments, as observed in the benchmark dataset, our approach involved creating rationales [23] for sarcasm in comments. When an annotator labeled a data point as sarcasm, they highlighted a specific text span they believed conveyed the sarcastic nature of the comment/post. Given the subjective nature of sarcasm perception, even if just one annotator categorized a data point as sarcastic, it was officially placed into the sarcastic category.

4.2. Data Distribution

All 3199 data points were categorized into two groups: (i) toxic; and (ii) sarcasm. Within the first category, which comprises toxic (hatespeech/offensive) data, there were 1637 instances labeled as toxic, 1552 labeled as normal, and 10 labeled as undecided. The distribution of toxic data was visualized using a heatmap. Table 1 illustrates the distribution of toxic data, presenting two perspectives: (a) non-sarcasm and sarcasm data; and (b) other data and toxic data under sarcasm. In perspective (b), “other” encompasses sarcasm data not classified as toxic along with all non-sarcasm data.

Observing Figure 5, it is apparent that, while the total number of sarcasm data points was 1487, only 755 data points could be utilized under the proposed method as they fall under toxic data. Here, the last three data points from the undecided section could not be utilized as they do not conform to toxic data criteria. The heatmap depicted in the following figure illustrates the distribution of data from both perspectives.

4.3. Dataset Splitting

The set of 3199 data points was divided into three subsets for training, validation, and testing. As the dataset was imbalanced, a stratified split was performed on the dataset to ensure a minimum class balance was maintained across all categories and the best performance was achieved. Table 1 shows the total amount of toxic data (hatespeech/offensive), sarcasm data, and data that are both sarcastic and toxic (toxic under sarcasm). Furthermore, the splitting of these data in different sets are shown here.

Through both criteria—(i) non-sarcasm and sarcasm data; and (ii) other data and toxic data under sarcasm—the data distribution can be observed for all subsets.

Figure 6 above illustrates the distribution of data across all subsets: training, validation, and testing. Each set exhibits two distinct criteria. Specifically, only criterion (ii) data were utilized in accordance with the proposed method. Furthermore, the graphs reveal that sarcasm comprises nearly half of the data for all labels within the first category (hatespeech, offensive, normal).

4.4. Hyperparameter Settings

In the hyperparameter tuning phase, our approach to training the BERT model involved the incorporation of specific settings. A dropout rate of 0.1 was applied after the fully connected layer, utilizing a batch size of 4 and adopting a learning rate of 2 × 10⁻⁵. For the calculation of the ground truth attention, we opted for a configuration with six supervised heads, accompanied by an attention lambda (

λ

) value of 0.001. Notably, in the selection of rationale tokens, we adhered to the practice of choosing the top five tokens, aligning with the average length of annotation spans within the dataset. This careful calibration of hyperparameters contributes to the robustness and effectiveness of our model during training and attention calculation.

4.5. Metrics for Evaluation

In the quest to uncover hate speech masked within sarcasm, a range of metrics covering diverse dimensions of hate expression was considered. Taking cues from the HateXplain paper’s approach to hate speech classification, emphasis was placed on two primary types of metrics: performance-based and explainability-based.

Performance-based metrics: Priority was given to these metrics, with accuracy, macro F1-score, and AUROC score reported for both the baseline and proposed method in accordance with established standards. These metrics serve to assess the classifier’s performance in distinguishing among the three classes: hate speech, offensive speech, and normal speech;
Explainability-based metrics: We adhered to the framework outlined in the ERASER benchmark by to gauge the explainability aspect of the model. This assessment involved measuring the plausibility and faithfulness. Plausibility denotes the persuasiveness of the interpretation to humans, while faithfulness pertains to the accuracy with which it reflects the genuine reasoning process of the model [44,45].
1.
Plausibility: To assess plausibility, metrics for both discrete and soft selection were employed. Specifically, the Intersection-Over-Union (IOU) F1-score and token F1-score for the discrete case were reported, along with the Area Under the Precision-Recall curve (AUPRC) score for soft token selection [45]. The IOU metric was utilized for credit assignment in cases of partial matches. This metric, defined at the token-level, measures the overlap between two spans divided by the size of their union [45]. A prediction was considered a match if the overlap with any ground truth rationale exceeded 0.5. These partial matches were used to compute the IOU F1-score. Token-level precision and recall were also calculated, and these were used to derive token-level F1 scores (token F1). Additionally, to assess the plausibility of soft token scoring, the AUPRC was reported by sweeping a threshold over the token scores;
2.
Faithfulness: To assess faithfulness, two metrics were reported: comprehensiveness and sufficiency [45].
-
Comprehensiveness: This metric evaluates the extent to which the model’s predictions are affected by the removal of predicted rationales. For each post $x_{i}$ , a contrast example ${\tilde{x}}_{i}$ was created by removing the predicted rationales $r_{i}$ from $x_{i}$ . Let $m {(x_{i})}_{j}$ be the original prediction probability provided by a model m for the predicted class j. Then, $m {(x_{i} ∣ r_{i})}_{j}$ represents the predicted probability of ${\tilde{x}}_{i}$ (i.e., $x_{i}$ without $r_{i}$ ) by the model m for class j. Comprehensiveness was computed as $m {(x_{i})}_{j} - m {(x_{i} ∣ r_{i})}_{j}$ . A higher value of comprehensiveness suggested that the rationales significantly influenced the prediction;
-
Sufficiency: This metric measures the adequacy of the extracted rationales for the model to make a prediction. Sufficiency was calculated as $m {(x_{i})}_{j} - m {(r_{i})}_{j}$ , where $m {(r_{i})}_{j}$ represents the predicted probability of the rationales $r_{i}$ by the model m for class j.

5. Results

In this section, the performance and explainability score of the BERT model for both the baseline and new method is shown. The tokens for the explainability calculation were chosen using the LIME method. Rationales were comprised of the top five tokens, consistent with the baseline methodology, reflecting the average length of annotation spans in the dataset.

Table 2 presents the outcomes derived from both the the baseline and proposed method. The proposed method was employed on the newly constructed proposed dataset, while the baseline approach was implemented on the HateXplain dataset [13] section of the proposed dataset. The performance metrics show an increase of 0.01 in the macro F1 score, 0.186 in the AUROC, and a slight increase in accuracy. In the explainability metrics, a significant increase was observed in two metrics (token F1 and AUPRC), from plausibility and on sufficiency from faithfulness. There is an increase of 0.04 in the token F1 and 0.17 in AUPRC. For sufficiency, it is understood that a lower value signifies better sufficiency, and a decrease of 0.08 was observed.

6. Discussion

The main results are reported in Table 2.

Performance: The results indicate a gradual enhancement in the performance metrics, with the macro F1 score exhibiting a marginal increment of 0.01 alongside improvements in other metrics when utilizing the proposed method in combination with sarcasm rationale. Figure 7 provides a detailed breakdown of the performance results via a confusion matrix.

Performance metrics from the confusion matrix are illustrated in Figure 7a,b for both the baseline and the proposed method using True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN) for each class.

Baseline Performance Metrics Calculation

Precision:

\begin{matrix} For hatespeech : P = 0.796 \\ For offensive : P = 0.533 \\ For normal : P = 0.811 \end{matrix}

Recall:

\begin{matrix} For hatespeech : R = 0.796 \\ For offensive : R = 0.480 \\ For normal : R = 0.795 \end{matrix}

F1-Score:

\begin{matrix} For hatespeech : F 1 = 0.796 \\ For offensive : F 1 = 0.506 \\ For normal F 1 = 0.803 \end{matrix}

Proposed Method Performance Metrics Calculation

Precision:

\begin{matrix} For hatespeech : P = 0.815 \\ For offensive : P = 0.520 \\ For normal : P = 0.814 \end{matrix}

Recall:

\begin{matrix} For hatespeech : R = 0.778 \\ For offensive : R = 0.500 \\ For normal : R = 0.781 \end{matrix}

F1-Score:

\begin{matrix} For hatespeech : F 1 = 0.796 \\ For offensive : F 1 = 0.510 \\ For normal : F 1 = 0.797 \end{matrix}

The performance metrics from both the baseline and proposed method revealed a comparison in their effectiveness. The proposed method demonstrated improvements in precision for the hate speech and normal speech categories compared to the baseline. However, there was a slight decrease in precision for offensive language classification. The recall for all categories remained relatively similar between the baseline and proposed methods. Overall, the F1-scores for the hate speech and normal speech categories remained consistent, while there was a slight increase in the F1-score for offensive language classification with the proposed method.

Explainability: A significant improvement was evident in three out of five explainability metrics, encompassing both plausibility and faithfulness. Within the realm of explainability metrics, notable advancements were observed in two metrics (token F1 and AUPRC) under plausibility and sufficiency under faithfulness. The HateXplain [13] paper demonstrates that using rationale compared to not using it, alone, resulted in an increase in only one out of five explainability metrics for the BERT model, specifically in comprehensiveness within faithfulness. In contrast, our dataset and proposed method enhanced three metrics. Overall, it is evident that relying solely on a model’s performance metric is inadequate. Models with slightly lower performance metrics but significantly higher scores for plausibility and faithfulness may be preferable depending on the specific task. Therefore, the presence of hate speech concealed within sarcasm could augment a model’s ability to provide more interpretable outcomes.

LIME: LIME [25], or Local Interpretable Model-agnostic Explanations, enhances the interpretability of complex machine learning models, particularly for text data. It operates by generating new samples around the instance being explained through perturbation—slightly modifying the original data point to create a dataset of “synthetic” samples that are similar to the original instance. LIME then monitors how the predictions change with these modifications. Subsequently, LIME constructs a simpler, interpretable model that approximates the original model’s behavior within a local region around the perturbed instance. This model identifies and highlights the words or phrases that significantly influence the model’s decisions. By demonstrating how predictions shift when key text elements—termed rationales—are included or excluded, LIME helps in assessing the model’s robustness and its reliance on specific features for decision-making. This methodology is crucial for applications requiring high interpretability, such as content moderation or hate speech detection, where understanding the basis of the model’s decisions is essential for user trust and the validation of the model’s effectiveness.

Figure 8 shows how LIME selected tokens for both baseline and proposed method along with the prediction probabilities for the toxic category, as evidenced by an example of hate speech data. For both the baseline and proposed method, the token selection was diverse, which led to distinct label predictions for each method. In Figure 8, the classification approaches of the baseline and proposed methods are compared using the BERT model, as illustrated by LIME. Figure 8a shows the baseline method utilizing BERT. Here, BERT marked tokens via LIME for a text example, demonstrating how each token influenced the classification, with the offensive category predicted with the highest probability. This highlights the baseline’s sensitivity to specific trigger words, classifying them typically as offensive without deeper contextual analysis.

Conversely, Figure 8b depicts the proposed method, also applying BERT with LIME to mark tokens. This method, however, assigned the highest prediction probability to the hate speech category, aligning more closely with the contextual use of the terms within the dialogue, hence providing a more accurate representation of the original label, which was hate speech. This contrast between the two figures illustrates the proposed method’s enhanced ability to contextualize and accurately classify complex expressions of hate speech, leveraging the nuanced understanding capabilities of the BERT model.

The details of Figure 8 is explained in Table 3. The first two rows depict the “rationale” chosen by the annotators of the HateXplain paper [13] and our annotators, which they deemed essential for the classification task. The last two rows showcase significant tokens using LIME for the BERT model, encompassing both the baseline and proposed method. From Table 3, the details can be observed for the baseline ground truth attention calculation; therein, even though the BERT model marked the same “rationale” as the HateXplain human annotators, it assigned the wrong label (offensive speech—OF). But, for the proposed ground truth attention calculation, the BERT model marked almost the same “rationale” as that of our annotators and assigned the right label (hate speech—HS).

7. Limitations

Several limitations are inherent in our work. Firstly, the absence of external context poses a significant challenge. Relying solely on the published dataset resulted in a lack of contextual information for each post, thereby complicating the annotation process for annotators. Providing additional context could facilitate a more accurate assignment of posts as sarcastic or non-sarcastic. Additionally, our work is limited to the English language, neglecting potential insights from other languages.

8. Conclusions and Future Work

This study introduces a novel dataset derived from the established HateXplain dataset, comprising 3199 posts extracted from Gab and Twitter. In the original HateXplain dataset, each post is annotated with labels such as normal, hate, or offensive, along with target communities (e.g., black, Muslim, LGBTQ+) and rationales marked by annotators. The data from HateXplain were re-labeled based on sarcasm by our annotators, with rationales selected based on their identification of sarcastic content. Additionally, a new methodology is proposed for calculating ground truth attention when the data exhibit both toxic (hate/offensive) content and sarcasm, aiming to enhance the decision-making process of models. This approach was tested on the BERT model, which incorporates sarcasm along with hate/offensive data to improve hate speech detection, particularly for toxic comments disguised within sarcasm. The results demonstrate that utilizing sarcasm to identify concealed toxic content led to a slight improvement in the performance metrics, while also providing plausible and faithful rationales that enhanced the model’s decision-making capabilities.

Looking ahead, the plan is to explore the application of other Large Language Models (LLMs) to our methodology. Given the complexity and resource requirements of LLMs, this extension was not feasible within the current study’s time frame. Future work will focus on comparing the efficacy and generalizability of our approach using a range of LLMs. This step is essential for verifying our model’s robustness across various architectures and settings, providing deeper insights into its adaptability and performance. We also aim to incorporate context information into data annotation and utilize it in our model training process.

Author Contributions

Conceptualization, M.B.M.; methodology, M.B.M.; software, M.B.M.; validation, M.B.M.; formal analysis, M.B.M.; investigation, M.B.M.; resources, M.B.M.; data curation, M.B.M.; writing—original draft preparation, M.B.M., with suggestions from T.T., M.N. (Masafumi Nishida) and M.N. (Masafumi Nishimura); writing—review and editing, M.B.M., T.T., M.N. (Masafumi Nishida) and M.N. (Masafumi Nishimura); visualization, M.B.M.; supervision, T.T., M.N. (Masafumi Nishida) and M.N. (Masafumi Nishimura); project administration, M.B.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are not publicly available due to privacy or ethical restrictions.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Examples of LIME

Figure A1 presents additional examples of how LIME highlights influential tokens for both the baseline and proposed methods, along with their prediction probabilities for hate speech data. Subfigures (a) and (b) show LIME’s analysis for the same data point using the baseline and proposed methods, respectively. Similarly, subfigures (c) and (d) provide another comparative analysis for a different data point, following the same methodological contrast. These examples clearly demonstrate how token selection varies between the baseline and proposed methods, ultimately leading to different label predictions for each approach. This comparative visualization helps in understanding the impact of methodological changes on the model’s interpretative outputs and decision-making process.

Figure A1. LIME-marked tokens that were deemed important for classification. (a) Baseline [Example 1]. (b) Proposed method [Example 1]. (c) Baseline [Example 2]. (d) Proposed method [Example 2].

References

Bozhidarova, M.; Chang, J.; Ale-rasool, A.; Liu, Y.; Ma, C.; Bertozzi, A.L.; Brantingham, P.J.; Lin, J.; Krishnagopal, S. Hate speech and hate crimes: A data-driven study of evolving discourse around marginalized groups. arXiv 2023, arXiv:2311.11163. [Google Scholar] [CrossRef]
Williams, M.L.; Burnap, P.; Javed, A.; Liu, H.; Ozalp, S. Hate in the machine: Anti-Black and anti-Muslim social media posts as predictors of offline racially and religiously aggravated crime. Br. J. Criminol. 2020, 60, 93–117. [Google Scholar] [CrossRef]
Gámez-Guadix, M.; Wachs, S.; Wright, M. “Haters back off!” psychometric properties of the coping with cyberhate questionnaire and relationship with well-being in Spanish adolescents. Psicothema 2020, 32, 567–574. [Google Scholar] [CrossRef] [PubMed]
Wachs, S.; Krause, N.; Wright, M.F.; Gámez-Guadix, M. Effects of the Prevention Program “HateLess. Together against Hatred” on Adolescents’ Empathy, Self-efficacy, and Countering Hate Speech. J. Youth Adolesc. 2023, 52, 1115–1128. [Google Scholar] [CrossRef] [PubMed]
Saha, K.; Weber, I.; De Choudhury, M. A Social Media Based Examination of the Effects of Counseling Recommendations After Student Deaths on College Campuses. In Proceedings of the International AAAI Conference on Web and Social Media, Palo Alto, CA, USA, 25–28 June 2018; pp. 320–329. [Google Scholar]
Saha, K.; Chandrasekharan, E.; De Choudhury, M. Prevalence and Psychological Effects of Hateful Speech in Online College Communities. In Proceedings of the 10th ACM Conference on Web Science, Boston, MA, USA, 30 June–3 July 2019; pp. 255–264. [Google Scholar] [CrossRef]
Cahill, M.; Migacheva, K.; Taylor, J.; Williams, M.; Burnap, P.; Javed, A.; Liu, H.; Lu, H.; Sutherland, A. Understanding Online Hate Speech as a Motivator and Predictor of Hate Crime, Los Angeles, California, 2017–2018; ICPSR: Ann Arbor, MI, USA, 2021. [Google Scholar] [CrossRef]
de Gibert, O.; Perez, N.; García-Pablos, A.; Cuadros, M. Hate Speech Dataset from a White Supremacy Forum. In Proceedings of the 2nd Workshop on Abusive Language Online (ALW2): Association for Computational Linguistics, Brussels, Belgium, 31 October 2018; pp. 11–20. [Google Scholar]
Sanguinetti, M.; Poletto, F.; Bosco, C.; Patti, V.; Stranisci, M. An Italian Twitter Corpus of Hate Speech against Immigrants. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018; European Language Resources Association (ELRA): Reykjavik, Iceland, 2018. Available online: https://aclanthology.org/L18-1443 (accessed on 7 April 2024).
Qian, J.; ElSherief, M.; Belding, E.M.; Wang, W.Y. Learning to Decipher Hate Symbols. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 4 June 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 3006–3015. [Google Scholar]
Ousidhoum, N.; Lin, Z.; Zhang, H.; Song, Y.; Yeung, D.-Y. Multilingual and Multi-Aspect Hate Speech Analysis. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4667–4676. [Google Scholar]
Albanyan, A.; Blanco, E. Pinpointing Fine-Grained Relationships between Hateful Tweets and Replies. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 28 February–1 March 2022; Volume 36, pp. 10418–10426. [Google Scholar] [CrossRef]
Mathew, B.; Saha, P.; Yimam, S.M.; Beimann, C.; Mukherjee, A. HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 14867–14875. [Google Scholar] [CrossRef]
Zhang, Z.; Robinson, D.; Tepper, J.A. Detecting Hate Speech on Twitter Using a Convolution-GRU Based Deep Neural Network. In Proceedings of the Semantic Web—15th International Conference, Heraklion, Crete, Greece, 3–7 June 2018; pp. 745–760. [Google Scholar]
Mishra, P.; Del Tredici, M.; Yannakoudakis, H.; Shutova, E. Author profiling for abuse detection. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 20–26 August 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 1088–1098. [Google Scholar]
Qian, J.; ElSherief, M.; Belding, E.M.; Wang, W.Y. Hierarchical CVAE for Fine-Grained Hate Speech Classification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 3550–3559. [Google Scholar]
Qian, J.; ElSherief, M.; Belding, E.M.; Wang, W.Y. Leveraging Intra-User and Inter-User Representation Learning for Automated Hate Speech Detection. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 118–123. [Google Scholar]
Gröndahl, T.; Pajola, L.; Juuti, M.; Conti, M.; Asokan, N. All You Need is: Evading Hate Speech Detection. In Proceedings of the 11th ACM Workshop on Artificial Intelligence and Security, Toronto, ON, Canada, 15–19 October 2018; Association or Computing Machinery: New York, NY, USA, 2018; pp. 2–12. [Google Scholar]
Arango, A.; Pérez, J.; Poblete, B. Hate Speech Detection is Not as Easy as You May Think: A Closer Look at Model Validation. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, 21–25 July 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 45–54. [Google Scholar]
Sap, M.; Card, D.; Gabriel, S.; Choi, Y.; Smith, N.A. The Risk of Racial Bias in Hate Speech Detection. In Proceedings of the 57th Conference of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 1668–1678. [Google Scholar]
Davidson, T.; Bhattacharya, D.; Weber, I. Racial Bias in Hate Speech and Abusive Language Detection Datasets. In Proceedings of the Third Workshop on Abusive Language Online, Florence, Italy, 1 August 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 25–35. [Google Scholar]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Zaidan, O.; Eisner, J.; Piatko, C. Using “Annotaor Rationales” to improve Machine Learning for Text Categorization; NAACL: Rochester, NY, USA, 2007; pp. 260–267. [Google Scholar]
Pasa, T.A.; Nuriadi; Lail, H. An Analysis of Sarcasm on Hate Speech Utterances on Just Jared Instagram Account. J. Engl. Educ. Forum (JEEF) 2021, 1, 10–19. [Google Scholar]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd KDD, New York, NY, USA, 11 April 2016; pp. 1135–1144. [Google Scholar]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1 (Long and Short Papers), pp. 4171–4186. [Google Scholar]
Wachs, S.; Castellanos, M.; Wettstein, A.; Bilz, L.; Gámez-Guadix, M. Associations Between Classroom Climate, Empathy, Self-Efficacy, and Countering Hate Speech Among Adolescents: A Multilevel Mediation Analysis. J. Interpers. Violence 2022, 38, 5067–5091. [Google Scholar] [CrossRef] [PubMed]
Bronfenbrenner, U. The Ecology of Human Development: Experiments by Nature and Design; Harvard University Press: Cambridge, MA, USA, 1979. [Google Scholar]
Bandura, A. Social Learning Theory; General Learning Press: New York, NY, USA, 1977. [Google Scholar]
Kansok-Dusche, J.; Ballaschk, C.; Krause, N.; Zeißig, A.; Seemann-Herz, L.; Wachs, S.; Bilz, L. A systematic review on hate speech among children and adolescents: Definitions, prevalence, and overlap with related phenomena. Trauma Violence Abus. 2022, 24, 2598–2615. [Google Scholar] [CrossRef] [PubMed]
Ajzen, I. The theory of planned behavior. Organ. Behav. Hum. Decis. Process. 1991, 50, 179–211. [Google Scholar] [CrossRef]
Bandura, A.; Barbaranelli, C.; Caprara, G.V.; Pastorelli, C. Mechanisms of moral disengagement in the exercise of moral agency. J. Personal. Soc. Psychol. 1996, 71, 364–374. [Google Scholar] [CrossRef]
Olteanu, A.; Castillo, C.; Boy, J.; Varshney, K.R. The Effect of Extremist Violence on Hateful Speech Online. In Proceedings of the 12th ICWSM, Stanford, CA, USA, 25–28 June 2018; pp. 221–230. [Google Scholar]
Davidson, T.; Warmsley, D.; Macy, M.W.; Weber, I. Automated Hate Speech Detection and the Problem of Offensive Language. In Proceedings of the Eleventh International Conference on Web and Social Media, Montréal, QC, Canada, 15–18 May 2017; AAAI Press: Menlo Park, CA, USA, 2017; pp. 512–515. [Google Scholar]
Vigna, F.D.; Cimino, A.; Dell’Orletta, F.; Petrocchi, M.; Tesconi, M. Hate Me, Hate Me Not: Hate Speech Detection on Facebook. In Proceedings of the First Italian Conference on Cybersecurity, Venice, Italy, 17–20 January 2017; Volume 1816, pp. 86–95. [Google Scholar]
Founta, A.; Djouvas, C.; Chatzakou, D.; Leontiadis, I.; Blackburn, J.; Stringhini, G.; Vakali, A.; Sirivianos, M.; Kourtellis, N. Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior. In Proceedings of the Twelfth International Conference on Web and Social Media, Stanford, CA, USA, 25–28 June 2018; AAAI Press: Menlo Park, CA, USA, 2018; pp. 491–500. [Google Scholar]
Douglass, S.; Mirpuri, S.; English, D.; Yip, T. “They were just making jokes”: Ethnic/racial teasing and discrimination among adolescents. Cult. Divers. Ethn. Minor. Psychol. 2016, 22, 69–82. [Google Scholar] [CrossRef] [PubMed]
Hosseinmardi, H.; Mattson, S.A.; Rafiq, R.I.; Han, R.O.; Lv, Q.; Mishra, S. Detection of Cyberbullying Incidents on the Instagram Social Network. arXiv 2015, arXiv:1503.03909. [Google Scholar] [CrossRef]
Razali, M.S.; Halin, A.A.; Ye, L.; Doraisamy, S.; Norowi, N.M. Sarcasm Detection Using Deep Learning with Contextual Features. IEEE Access 2021, 9, 68609–68618. [Google Scholar] [CrossRef]
Ali, R.; Farhat, T.; Abdullah, S.; Akram, S.; Alhajlah, M.; Mahmood, A.; Iqbal, M.A. Deep Learning for Sarcasm Identification in News Headlines. Appl. Sci. 2023, 13, 5586. [Google Scholar] [CrossRef]
Bharti, S.K.; Gupta, R.K.; Shukla, P.K.; Hatamleh, W.A.; Tarazi, H.; Nuagah, S.J. Multimodal Sarcasm Detection: A Deep Learning Approach. Wirel. Commun. Mob. Comput. 2022, 2022, 1653696. [Google Scholar] [CrossRef]
Dixon, L.; Li, J.; Sorensen, J.; Thain, N.; Vasserman, L. Measuring and Mitigating Unintended Bias in Text Classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, New Orleans, LA, USA, 2–3 February 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 67–73. [Google Scholar]
Schuster, M.; Paliwal, K.K. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 1997, 45, 2673–2681. [Google Scholar] [CrossRef]
Jacovi, A.; Goldberg, Y. Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 4198–4205. [Google Scholar]
DeYoung, J.; Jain, S.; Rajani, N.F.; Lehman, E.; Xiong, C.; Socher, R.; Wallace, B.C. ERASER: A Benchmark to Evaluate Rationalized NLP Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 4443–4458. [Google Scholar]

Figure 1. Ground truth attention algorithm is applied to (i) toxic data (hatespeech/offensive) and to data that is both (ii) sarcastic and toxic.

Figure 2. Attention vector calculation before the average of ground truth attention.

Figure 3. (a) Representation of the HateXplain [13] paper’s model, showing the attention supervision for a particular head in the nth layer of the BERT model. (b) Representation of the BERT model showing how the attention of the model is trained using the ground truth attention and the total loss calculation.

Figure 4. (a) Description of how the data were selected and labeled. (b) Description of a certain data point that is labeled as both hate speech and sarcasm, the selected rationale of the benchmark dataset’s annotators, and our annotators’ rationale for the data point.

Figure 5. (a) Heatmap chart of toxic and sarcasm data distribution. (b) Heatmap chart of toxic data and sarcasm data distribution, which effectively camouflaged and concealed the toxic data.

Figure 6. (a) Training data distribution for sarcasm data. (b) Training data distribution for toxic disguised under sarcasm data. (c) Validation data distribution for sarcasm data. (d) Validation data distribution for toxic disguised under sarcasm data. (e) Testing data distribution for sarcasm data. (f) Testing data distribution for toxic disguised under sarcasm data.

Figure 7. (a) Confusion matrix for baseline method’s result. (b) Confusion matrix for proposed method’s result.

Figure 8. LIME-marked tokens that were deemed important for classification. (a) Baseline. (b) Proposed method.

Table 1. Splitting of data into different sets.

Data Categories	Total Amount	Training	Validation	Testing
All category	3199	2599	300	300
Toxic (Hate + Offensive)	1637	1320	164	153
Sarcasm	1487	1198	145	144
Toxic under Sarcasm	755	608	79	68

Here, the table shows that data in both categories were not balanced and how they were distributed in different sets. Toxic under sarcasm shows which sarcasm data were used in the new method.

Table 2. Model’s performance and explanability result.

Method	Performance			Explanability
				Plausibility			Faithfulness
	Acc. ↑	Macro F1 ↑	AUROC ↑	IOU F1 ↑	Token F1 ↑	AUPRC ↑	Comp. ↑	Suff. ↓
Baseline	0.7425	0.6883	0.8264	0.0577	0.2799	0.3884	0.1789	0.1363
Proposed Method	0.7492	0.6983	0.8450	0.0429	0.3232	0.5667	−0.0655	0.0546

The symbol ↑ denotes improvement in metric values when there is an increase, whereas the symbol ↓ signifies enhancement when there is a decrease in metric values.

Table 3. Example of the rationales predicted by baseline and proposed method compared to human annotators.

Model	Text	Label
HateXplain Annotator	user he is infected with jihadi virus he will spread it to others	HS
Our Annotator	user he is infected with jihadi virus he will spread it to others	SARC
Baseline	user he is infected with jihadi virus he will spread it to others	OF
Proposed Method	user he is infected with jihadi virus he will spread it to others	HS

The orange highlight marks what the HateXplain [13] annotators found important for the prediction. The green highlight marks what our annotators found important for the prediction. The underlined words/tokens were identified by the model as important using LIME [25], and the lime highlight marks the highest probability tokens from this set for both the baseline and proposed method. (OF—offensive speech; HS—hatespeech; SARC—sarcasm).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mamun, M.B.; Tsunakawa, T.; Nishida, M.; Nishimura, M. Hate Speech Detection by Using Rationales for Judging Sarcasm. Appl. Sci. 2024, 14, 4898. https://doi.org/10.3390/app14114898

AMA Style

Mamun MB, Tsunakawa T, Nishida M, Nishimura M. Hate Speech Detection by Using Rationales for Judging Sarcasm. Applied Sciences. 2024; 14(11):4898. https://doi.org/10.3390/app14114898

Chicago/Turabian Style

Mamun, Maliha Binte, Takashi Tsunakawa, Masafumi Nishida, and Masafumi Nishimura. 2024. "Hate Speech Detection by Using Rationales for Judging Sarcasm" Applied Sciences 14, no. 11: 4898. https://doi.org/10.3390/app14114898

APA Style

Mamun, M. B., Tsunakawa, T., Nishida, M., & Nishimura, M. (2024). Hate Speech Detection by Using Rationales for Judging Sarcasm. Applied Sciences, 14(11), 4898. https://doi.org/10.3390/app14114898

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hate Speech Detection by Using Rationales for Judging Sarcasm

Abstract

1. Introduction

2. Related Works

3. Methods

3.1. Dataset Selection

3.2. Preprocessing

3.3. Model

4. Experimental Conditions

4.1. Dataset

4.2. Data Distribution

4.3. Dataset Splitting

4.4. Hyperparameter Settings

4.5. Metrics for Evaluation

5. Results

6. Discussion

7. Limitations

8. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Examples of LIME

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI