Previous Article in Journal
Improved Hybrid Bagging Resampling Framework for Deep Learning-Based Side-Channel Analysis
Previous Article in Special Issue
On Using GeoGebra and ChatGPT for Geometric Discovery
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

SOD: A Corpus for Saudi Offensive Language Detection Classification

Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia
*
Authors to whom correspondence should be addressed.
Computers 2024, 13(8), 211; https://doi.org/10.3390/computers13080211 (registering DOI)
Submission received: 9 July 2024 / Revised: 13 August 2024 / Accepted: 19 August 2024 / Published: 20 August 2024
(This article belongs to the Special Issue Natural Language Processing (NLP) and Large Language Modelling)

Abstract

Social media platforms like X (formerly known as Twitter) are integral to modern communication, enabling the sharing of news, emotions, and ideas. However, they also facilitate the spread of harmful content, and manual moderation of these platforms is impractical. Automated moderation tools, predominantly developed for English, are insufficient for addressing online offensive language in Arabic, a language rich in dialects and informally used on social media. This gap underscores the need for dedicated, dialect-specific resources. This study introduces the Saudi Offensive Dialectal dataset (SOD), consisting of over 24,000 tweets annotated across three levels: offensive or non-offensive, with offensive tweets further categorized as general insults, hate speech, or sarcasm. A deeper analysis of hate speech identifies subtypes related to sports, religion, politics, race, and violence. A comprehensive descriptive analysis of the SOD is also provided to offer deeper insights into its composition. Using machine learning, traditional deep learning, and transformer-based deep learning models, particularly AraBERT, our research achieves a significant F1-Score of 87% in identifying offensive language. This score improves to 91% with data augmentation techniques addressing dataset imbalances. These results, which surpass many existing studies, demonstrate that a specialized dialectal dataset enhances detection efficacy compared to mixed-language datasets.
Keywords: natural language processing (NLP); Saudi dialect; offensive detection; Arabic language processing; machine learning; deep learning; computational linguistics; dialect analysis; hate speech detection; text classification; data annotation; data augmentation natural language processing (NLP); Saudi dialect; offensive detection; Arabic language processing; machine learning; deep learning; computational linguistics; dialect analysis; hate speech detection; text classification; data annotation; data augmentation

Share and Cite

MDPI and ACS Style

Asiri, A.; Saleh, M. SOD: A Corpus for Saudi Offensive Language Detection Classification. Computers 2024, 13, 211. https://doi.org/10.3390/computers13080211

AMA Style

Asiri A, Saleh M. SOD: A Corpus for Saudi Offensive Language Detection Classification. Computers. 2024; 13(8):211. https://doi.org/10.3390/computers13080211

Chicago/Turabian Style

Asiri, Afefa, and Mostafa Saleh. 2024. "SOD: A Corpus for Saudi Offensive Language Detection Classification" Computers 13, no. 8: 211. https://doi.org/10.3390/computers13080211

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop