Next Article in Journal
Determining the Macrostructural Stability of Compacted Wyoming Bentonites by a Disaggregation Method
Previous Article in Journal
Research on Automatic Classification of Coal Mine Microseismic Events Based on Data Enhancement and FCN-LSTM Network
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Sentiment Analysis on Algerian Dialect with Transformers

by
Zakaria Benmounah
1,2,*,†,
Abdennour Boulesnane
3,†,
Abdeladim Fadheli
2 and
Mustapha Khial
2
1
LISIA Laboratory, Abdelhamid Mehri University Constantine 02, Constantine 25001, Algeria
2
NTIC Faculty, Department of Fundamental Informatics and Its Application, Constantine 2 Abdelhamid Mehri University, Constantine 25001, Algeria
3
BIOSTIM Laboratory, Medicine Faculty, Salah Boubnider University Constantine 03, Constantine 25001, Algeria
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Appl. Sci. 2023, 13(20), 11157; https://doi.org/10.3390/app132011157
Submission received: 6 September 2023 / Revised: 26 September 2023 / Accepted: 5 October 2023 / Published: 11 October 2023

Abstract

:
The task of extracting sentiment from text has been widely studied in the field of natural language processing. However, little work has been conducted specifically on the Arabic language with the Algerian dialect. In this research, we aim to make a significant contribution to the field of sentiment analysis on the Algerian dialect by creating a custom and relatively large dataset with a tailored deep learning model. The dataset was extracted from Algerian YouTube channels and manually annotated by the research team. We then utilize this dataset to train a state-of-the-art deep learning model for natural language processing called BERT, which is a type of Transformer model. Using this model, we were able to achieve an F1-score of 78.38% and an accuracy of 81.74% on the testing set. This demonstrates the effectiveness of our approach and the potential of using BERT for sentiment analysis on the Algerian dialect. Our model can be used to infer sentiment from any Algerian text, thus providing a valuable tool for understanding the opinions and emotions of the population. This research highlights the importance of studying the Algerian dialect and the potential of using state-of-the-art deep learning models for natural language processing in this area.

1. Introduction

Sentiment analysis, the task of determining the emotional tone of a piece of text, has been widely studied in the field of natural language processing [1,2,3]. However, there has been limited research specifically on the Arabic language with the Algerian dialect [4]. The Algerian dialect, spoken by over 40 million people, is an integral part of the country’s cultural identity [5]. Despite its widespread use in social media, online assistance, and education, local dialects like the Algerian one suffer from a lack of available data. This lack of data hinders the development of accurate sentiment analysis models and limits their potential applications.
In this research, we aim to make a significant contribution to the field of sentiment analysis on the Algerian dialect by creating a custom and relatively large dataset. The dataset was extracted from Algerian YouTube channels and manually annotated by the research team. We then utilize this dataset to train a state-of-the-art deep learning model known as BERT (Bidirectional Encoder Representations from Transformer) [6]. BERT is a type of Transformer model [7] that has recently demonstrated exceptional efficiency and accuracy in the field of Arabic sentiment analysis, highlighting its significance and influence [8,9,10,11,12,13]. Our goal is to use this model to infer sentiment from any Algerian text, providing a valuable tool for understanding the opinions and emotions of the population.
The dataset we created for this research can be used for sentiment analysis and other natural language processing tasks such as text summarization, machine translation, and more [14]. By creating a dataset specific to the Algerian dialect, we can improve the performance of the models trained on it and make them more accurate for the dialect.
Using BERT, we were able to achieve an F1-score of 78.38% and an accuracy of 81.74% on the testing set, demonstrating the effectiveness of our approach and the potential of using BERT for sentiment analysis on the Algerian dialect. This model can be used in various applications such as social media monitoring, customer service, and opinion mining. Furthermore, understanding the sentiment expressed in the Algerian dialect can provide valuable insights into the culture, the people, and their values.
This research highlights the importance of studying the Algerian dialect and the potential of using state-of-the-art deep learning models for natural language processing in this area. Additionally, our research is a step towards addressing the lack of available data on local dialects and making the most of the opportunities that technology can offer for underdeveloped countries like Algeria. This can be beneficial for businesses and marketers, as it can help them better understand their target audience and improve their products and services. Furthermore, it can be a valuable tool for decision making and policy making.
Overall, this research provides valuable insights and contributes to the advancement of natural language processing and the understanding of language and culture. It also opens the door for further studies on this topic and can be used as a foundation for future research in this field.
The structure of the remaining sections in this paper is as follows: The problem statement and research objectives are outlined in Section 2. Section 3 outlines the data collection and analysis process, delving into both our gathered corpus and the pre-existing datasets. A comprehensive exposition of the proposed transformers-based approach is provided in Section 4. The outcomes of our conducted experiments are thoroughly examined and discussed in Section 5. Concluding remarks and prospective directions for future work are encapsulated in Section 6.

2. Problem Statement and Research Objectives

Sentiment analysis has long been a formidable challenge, particularly when striving for a profound comprehension of nuanced behaviors [15,16]. Historically, sentiment analysis models struggled to surpass the 80% accuracy threshold. However, in contemporary times, these models have made remarkable strides, achieving accuracy rates upwards of 95% in tasks such as IMDB review classification [17].
However, this progress has not extended to sentiment analysis in Arabic, despite its usage by over 400 million people. Most Arabic sentiment analysis models continue to fall short of the 90% accuracy mark, even on relatively simple tasks, as evidenced by survey studies conducted in [18] and detailed in [19]. These studies revealed that only 20% and 30% of Arabic sentiment analysis systems, respectively, have managed to surpass this accuracy threshold. Furthermore, these systems struggle to achieve 88% accuracy when confronted with more complex tasks like mental health analysis, as proven in [20].
A further challenge is that discerning sentiment and extracting the fundamental message from Arabic content demands substantially greater effort. This complexity is exacerbated by the multitude of dialects within the Arabic language, with variants like the Algerian dialect adding further layers of intricacy to the challenges at hand.
Based on the presented challenges, this research aims to accomplish the following broad objectives:
  • Creating an extensive and specialized corpus of Algerian Arabic text samples to enhance Arabic language resources and strengthen natural language processing capabilities for this specific dialect.
  • Investigate and implement pre-processing techniques that are specifically optimized for the Algerian Arabic dialect, considering the unique linguistic features and nuances of this dialect.
  • Emphasize the potential of state-of-the-art deep learning models, like BERT, for natural language processing in this specific linguistic and cultural context.
  • Ensure that the developed model can be applied to infer sentiment from any Algerian text, making it a practical tool for understanding the opinions and emotions of the Algerian population.

3. Data Collection and Analysis

In this section, we will start the discussion about existing datasets on sentiment analysis related to Arabic and the Algerian dialect. We also show a comparison of these datasets to ours. Next, we will reveal how we have come to collect and annotate our Algerian Dialect Arabic Sentiment Dataset. After that, we will share some analysis we performed on our dataset. Finally, we share how the preprocessing was accomplished to feed our deep learning model.

3.1. Existing Datasets

Arabic is a semitic language of the Arabs, spoken by more than 400 million people throughout the world, mainly in North Africa and the Middle East [21]. A lot of dialects have emerged over the centuries, including the Maghreb dialect (mainly spoken in Algeria, Tunisia, Libya, Morocco, Western Sahara and Mauritania), the Levantine dialect (spoken in Palestine, Jordan, Syria, and Lebanon), and the Egyptian and Iraqi-gulf dialects [22].
Among these dialect families, a lot of datasets appeared over the years. Here are few:
  • ASAD [23]: A public dataset extracted from Twitter intended to accelerate research in Arabic NLP in general and Arabic sentiment classification in specific.
  • ATSAD [24]: A multi-dialect dataset that was automatically annotated using emojis.
  • TSAC [25]: A Tunisian dialect dataset that was extracted from Facebook comments.
  • SIACC [26]: Sentiment Polarity Identification on Arabic Algerian Newspaper Comments.
  • Ref. [27]: 1000 reviews collected from Algerian press.
  • Ref. [28]: Posts and comments collected from Algerian Facebook pages.
  • TWIFIL [29]: Annotated tweets collected between 2015 and 2019 were obtained from different geo-locations in Algeria with the help of 26 annotators.
  • Ref. [20]: 21,885 posts were collected from Algerian public groups on Facebook.
  • SentiALG [30]: An automatically constructed Algerian sentiment corpus containing 8000 messages.
Table 1 summarizes these datasets along with their size, available classes, annotation approach and dialect. Our dataset is also present in the last line of Table 1.
A lot of relatively large datasets in multi-dialect Arabic have appeared, and as far as we know, our dataset is among the few Algerian datasets that are sufficient for a deep learning approach in sentiment analysis. Our objective is to develop a model that is versatile and capable of performing sentiment analysis across different domains, thereby providing a more holistic and generalized insight. We believe that focusing on a singular topic, while valuable, could potentially limit the model’s adaptability and overall understanding of diverse sentiments, possibly leading to a biased and narrow perspective.

3.2. Collected Dataset’s Statistics and Properties

For our dataset, we have used the YouTube API to extract comments sorted by relevance from more than 30 Algerian press channels. The dataset consists of 45,000 annotated comments, with three classes of sentiments, namely negative, neutral, and positive; we have split the dataset into training (70%), validation (10%), and testing (20%) sets. Table 2 presents the basic statistics of the total dataset, as well as splitting parts (training, validation, and testing) along with the different classes.
Emojis play a key role in sentiment analysis; they are essential elements affecting the sentence’s sentiment. Our dataset includes 6832 samples that have emojis, which is 15.18% of the total samples.
The Algerian dialect on social media uses a lot of code-switching. Some Algerians prefer Latin characters, while most use Arabic letters. Table 3 shows the number of sentences in both characters on our dataset. This will make the classification task even harder.
Most of the comments used Arabic letters, with 84.68%, followed by 12.66% that were using Latin characters, and 2.65% were using both Arabic and Latin interchangeably.

3.3. N-Gram Analysis

We have analyzed our dataset to look for the most common unigrams and bigrams [31]. Table 4 reveals the top 10 most common unigrams on each class, as well as on the whole dataset. Note that both Arabic and Algerian dialect stop words were removed. While, Table 5 shows the top 10 most common bigrams in each class and on the whole dataset.
Before concluding the top n-grams in the above tables, we performed some text preprocessing, including stop word removal; basic stemming, such as removing cumulative redundant words; and cleaning punctuation.
As you can see in Table 4 and Table 5, we see predictable words that are very common in Algerian text, and especially in social media comments. Interestingly, in the negative and positive class columns on both tables, we see some negative and positive words, respectively.

3.4. Data Preprocessing

The preprocessing step plays a crucial role in sentiment analysis, as it involves removing any irrelevant or noisy data before performing sentiment classification [32].
During model training and fine-tuning, we have experimented with various preprocessing tasks and have come to conclude that the following tasks improve the classification performance:
  • Removing HTML tags: When extracting comments from YouTube, some comments come with tags such as </br> and more; removing them improves our primary metrics.
  • Replacing URLs, email addresses, and phone numbers: some comments include these, such as spam and promotional comments; instead of removing them, we have replaced them with a special token.
  • Inserting spaces between emojis: Emojis are useful for detecting sentiment in sentences, so we did not remove them. Instead, we append and prepend white spaces between emojis, so that each is considered an individual token.
  • Normalizing special Arabic characters: In this task, we have replaced several Arabic characters with other more common characters, such as converting آ, أ, and إ to ا.
As mentioned, we tried many preprocessing tasks, and only the above four tasks were useful. We tried many tasks, including removing emojis, stripping Arabic tashkeel and tatweel, converting Latin characters to Arabic characters and vice versa, and removing redundant punctuation.
Surprisingly, all of these tasks did not improve the performance, and some made it even worse. Therefore, we did not use them during model training and evaluation. We will see more on this in the experiment results section.

4. Classification of Algerian YouTube Comments Using Transformers

Figure 1 shows the complete workflow for our work in a flowchart.
In total, there are six stages in our proposed work, and in this section, we will explain each of these:
  • Extracting comments from Algerian YouTube channels using YouTube API: In this stage, we used YouTube API to bring out a large number of comments into a PostgreSQL with the help of Django ORM.
  • Annotating the comments on the annotation GUI site: Here, we used our GUI to manually annotate the comments into one of the three classes: negative, neutral, or positive.
  • Performing preprocessing on the collected dataset: After gathering enough data, preprocessing was the next phase, where we tried to clean our text as much as we can to be understandable by our model.
  • Training a deep learning model on the dataset: In this phase, we trained a deep learning model on the dataset; more detail is given in the Model Performance section.
  • Evaluating the model: After we trained the model, we needed to calculate our evaluation metrics (which we will see in the next section) on the testing set to determine whether we can make improvements. If so, we can change the preprocessing technique and the way we train the model, such as changing the model architecture, weights, or training parameters, and then evaluate it once again. If we cannot improve the model further, we go on to the next stage, which is model deployment.
  • Deploying our model for inference: In this stage, we deploy our model in such a way that it is ready for real-world use in a production environment; more detail on this is provided in the Model Deployment and Model Inference sections.

4.1. Evaluation Metrics

The primary metric we used during evaluation and experimentation is the macro-averaged F1-score across the three classes (negative, neutral, and positive), given by the equation [33]:
A v g F 1 M a c r o = 1 3 ( F 1 N e g + F 1 N e u + F 1 P o s )
where F 1 N e g , F 1 N e u , and F 1 P o s are the F1-scores for negative, neutral, and positive classes, respectively, which are the harmonic mean of precision and recall metrics:
F 1 N e g = 2 · P N e g · R N e g P N e g + R N e g
F 1 N e u = 2 · P N e u · R N e u P N e u + R N e u
F 1 P o s = 2 · P P o s · R P o s P P o s + R P o s
P N e g , P N e u , and P P o s are the precision scores for negative, neutral and positive classes, respectively, and R N e g , R N e u , and R P o s are the recall scores for negative, neutral, and positive classes, respectively.
Recall and precision scores are given by the following equations:
P = T P T P + F P
R = T P T P + F N
where T P is the number of true positives of that particular class, F P is the number of false positives, and F N is the number of false negatives.
After further simplifications, the macro-average F1-score can also be written as the following:
A v g F 1 M a c r o = T P T P + 1 2 · ( F P + F N )
You will also see accuracy in this work; it is quite simple and is given by:
A c c u r a c y = T P + T N T P + T N + F P + F N
where T N is the total number of true negatives. It can also be written as:
A c c u r a c y = T P + T N T o t a l n u m b e r o f s a m p l e s
Since our dataset is not balanced, as shown in the data analysis section, we will take the F1-score metric as our primary metric. We also offer accuracy scores occasionally, but we will use the macro-averaged F1-score to make decisions.

4.2. Model Performance

In this section, we will share the final model metrics, including F1-score, accuracy, and the confusion matrix.
The training was on the Arabic pre-trained BERT model [34]. After a lot of experiments that we will show in the experiment results section, we have come to have a reliable model that is able to correctly classify 81.74% of the total sentences in the test set (accuracy). In contrast, the F1-score reached 78.38% in the test set.
Figure 2a shows the F1-score of the validation set during training, whereas Figure 2b shows the accuracy score.
As shown in the F1-score and accuracy figures, after more than 5500 training steps, the F1-score obtained its highest value in the last step. For accuracy, the highest value was after 2000 training steps, and since the F1-score is our primary metric, we loaded the training weights after 5500 steps and then calculated the metrics for the testing set.
Figure 3 shows the confusion matrix of the optimal model in the testing set.
As you can see, out of all predicted positives, 84.78% were classified correctly; out of all predicted negatives, 81.88% were classified correctly; and out of all predicted neutrals, 72.98% were classified correctly.
The model did very well in positive and negative classes, but it struggles in comparison in the neutral class, which is due to many reasons, including but not limited to the imbalance of the dataset. Only around 22% of the total samples are neutral, including controversial sentences that were mistakenly annotated as neutral by our annotators.

4.3. Model Deployment

After we successfully trained and evaluated our model, we deployed our best model into Huggingface (https://huggingface.co, accessed on 17 February 2023) model hosting. As a result, we were able to use the inference API without the need to install heavy libraries or acquire high-cost machines for inference.
Huggingface hosts any model for free and offers 30,000 characters per month in the inference API for free. Thus, we managed to take a lot of examples (including challenging and hard ones) into our model for inference. In the next section, we will share some interesting examples we have observed.

4.4. Model Inference

For the model inference, we made a Colab (https://colab.research.google.com, accessed on 17 September 2023) notebook to use the inference API offered by Huggingface on our trained model. Now, let us start off with some examples of positive samples:
الجزائري المطبخ واخرجت العالم كل في شرفتنا الرئعة القنات هذه على سميرة السيدة نشكر نحب انا المولد بمناسبة بخير والجزائر عام كل العالم في واحد كل رقم هم الجزائرية والحلويات الطبخ العالمية الى الشريف النبوي (I would like to thank Mrs. Samira for this wonderful channel that honored us all over the world and brought Algerian cuisine to the world. Algerian cooking and sweets are number one in the world. Happy new year to Algeria on the occasion of the birth of the Prophet)
Figure 4a shows the result of the inference in a bar plot.
The model is 99.6% confident that the sentence is a positive sentence despite the typos. Taking a second positive example:
لوكان الطريق بعيدة هي كم لنواكشط تلحقو ويقتاش تتسناو رانا لكن روعة تݒ شيء كل يجارك الله نهاركم في لحقتو راكم رباعية سيارة جات (God bless everything, tooooop, it’s amazing, but we can wait and wait to catch up with Nouakchott. How far is the road?)
This sentence includes both Latin and Arabic letters; Figure 4b shows the result of the model prediction.
Even though most of the sentence is neutral, a few words made it a positive-sentiment sentence, and the model successfully captured that. Next, let us look at some neutral comments: الاولى  الحلقة مل فيه نتفرج راني الكورونا تع فلعطلة الفيد قيت ل اما (When I’m free in the holiday with Corona virus, I watched it as the first episode).
The result of the inference is given in Figure 5a.
Here is a second neutral sentence: هادو حلويات جزائرية (these are Algerian sweets).
Figure 5b shows the prediction results.
Another common type of neutral sentences are questions and inquiries: تحط لم لماذا انجليزي يعرف لا اغلب ترجمة مع فيدو (Why you haven’t added the translation most people don’t know English).
Figure 5c shows the prediction results.
The following is another sentence that is a question: !؟وين بلاصة إكزاكت (Which place exactly!).
Figure 5d shows the prediction results.
Next, we will show some negative sentences: وانت حاجة روحك حاسبة نتي حاجة نقولك زيرو (I tell you something, you think you are something, but you are zero).
Figure 6a shows the prediction results.
As expected, the sentence is clearly negative. Here is another example of the negative class: بينا بخصتو عيب والله  (shame on you, you embarrassed us).
Figure 6b shows the inference results.
Now, let us make it harder by trying challenging sentences. The sentence below is negative but does not include any obvious negative words:
اللي المعدل نجيب تخيلت وجامي طاقتي فوق وقريت سطاش، معدل على عالم اللي وربي خدمت الوكيل ونعم الله حسبنا بالنقاط تلاعبو الا عليه تحصلت (I have worked to get the mark of 16 and God knows that, I over-studied and I have never imagined that I will get this mark, if they have messed up with our marks, Allah suffices me, for He is the best disposer of affairs).
Figure 6c shows the inference results.
In this sentence, the model is not totally sure, as usual, that the sentiment of the writer is actually negative, as it is 70.2% confident it is negative and 28% confident it is neutral.
During the classification of various sentences that the model never saw, it is quite interesting to see such hard sentences being either correctly or incorrectly classified. Here are some examples: اووو خسارة الحلقة الاخييييييييييييييييرة يون (Oh, the last episode, such a pity).
Figure 7a shows the inference results.
The comment’s author is definitely upset that this is the last episode of their favorite show, yet the model classified it as positive with 53.3% confidence because the writer certainly likes the show.
Here is another challenging sentence: العزيز خويا يا نتا فيك تجي اي لخر  مع (at last, it will happen to you my brother).
Figure 7b shows the inference results.
Even though the sentence does not contain any negative words, but does contain a positive word العزيز (Dear), the sentence is still seen as negative with 84.4% confidence.

5. Experimental Results and Discussions

Being able to set up an environment for training a deep learning model and conduct experimentation on parameters on custom data requires a lot of skills and tools to accomplish. In this section, we will show what tools were used for this work, as well as the working environment. Next, we discuss the settings and fixed parameters used during the experiments. After that, we share our model’s performance results with various parameter tweaks and fine-tuning.

5.1. Experiment Setting

During experimentation and hyperparameter fine-tuning, there are a lot of parameters to change and tweak to find the best possible combination to achieve the highest F1-score possible. In this section, we will discuss various training and preprocessing parameters that were fixed during the experiments, which we will reveal in the next section.
  • Sequence length: The maximum length of a sequence in the input text. We set it to 180 for our dataset, which means we truncate longer sentences into 180 tokens and pad shorter sentences with zeros on the left. More than 93% of the sentences in our dataset have at most 180 tokens. In the next section, we will experiment with this parameter too.
  • Learning rate: The configurable hyperparameter used to train neural networks during the backpropagation phase. For large models like Transformers, it is often between 10 6 and 10 4 , and may increase or decrease depending on the model and data sizes. We ran the optuna library for more than 100 trials to obtain the best learning rate, and 1.07 × 10 5 was chosen.
  • Training epochs: An epoch trains the whole training dataset for one cycle. We chose three as our number of training epochs.
  • Batch size: A hyperparameter that controls the number of samples to feed into the neural network in a single training iteration. Again, optuna chose 16.
  • Warmup steps: The number of steps at the beginning of the training, where we use a very low learning rate to avoid early overfitting and also to slowly start fine-tuning the weights in the transformers. We have over 5900 total training iterations, and 800 was set as the number of warmup steps.
Since we are only fine-tuning BERT, there is not much we can change about the model architecture, as the model was pre-trained on a large amount of text, and we just fine-tuned it on our dataset.

5.2. Results and Analysis

As a first experiment, we tried different weights and architectures, where we trained and evaluated an LSTM model, BERT base, BERT large [6], BERT Arabic mini, BERT Arabic medium, BERT Arabic base, and BERT Arabic large [34]. Table 6 shows the different model versions and their corresponding number of parameters, training time, and evaluation on F1 and accuracy scores on the test set.
The large version of Arabic BERT outperforms all other models, with a 0.7838 F1-score. The base version comes in second place, and medium in third. LSTM obtained an F1-score of 0.7399, being fourth in the ranking.
Even though the large BERT Arabic model is optimal in terms of F1-score and accuracy, it is significantly large and takes about 2 h to train with the high-performance GPU offered by Google Colab.
The original BERT weights obviously underperformed because the model was pre-trained on English text instead of Arabic, and since 45,000 samples are not enough for language understanding, even using the large BERT did not increase the F1-score.
The LSTM model has 2 LSTM layers of 128 units, a sequence length set to 180, and an embedding size set to 100. Bi-LSTM uses the same parameters but with bidirectional LSTM cells.
The second experiment is tweaking the sequence length parameter explained in this section’s previous subsection. Table 7 shows the training of large Arabic BERT using different sequence length values.
As shown in Table 7, an sequence length of 180 is optimal. Increasing the parameter will slow down the training as expected, but what is not expected is that F1-score decreases after 180. This is becausle onger comments tend to have mixed sentiments or be neutral in our dataset, which can be quite hard for the model to determine.
Our last experiment is trying to determine the best combination of preprocessing tasks to increase the model performance further. In Table 8, the first row shows the F1 and accuracy scores of the model in the testing set without any preprocessing. Each row after that corresponds to one of the following preprocessing tasks:
  • Removing elongation: This task consists of removing redundant letters; words like “braaaavooooo” are transformed to “braavoo”, where 2 redundant characters is the maximum.
  • Replacing URLs, phone numbers, and emails: This task is straightforward; since URLs, phone numbers, and emails do not contribute to the sentiment of the sentence, they are simply replaced by a special token.
  • Removing HTML: This one removes HTML tags from all sentences.
  • Removing Emojis: Removing all types of emojis from text.
  • Inserting spaces between emojis: A lot of comments come with multiple emojis that are attached to each other. This task inserts spaces between them to help the WordPiece tokenizer [35] learn these emojis as individual tokens.
  • Converting Latin to Arabic: As mentioned earlier, many Algerians use Latin characters to express their opinions on social media. As a result, we constructed a simple algorithm to convert Arabic words written in Latin characters into Arabic characters. Here are some examples:
    (a)
    Rbi y7fdhk → ربي يحفظك (May God protect you)
    (b)
    Ro7 khlih → روح خليه (Let him)
  • Removing redundant punctuation: Some comments have multiple punctuation marks between two words. In this task, we simply allow only one punctuation mark between two words.
  • Removing stop words: Arabic and some Algerian stop words were removed entirely in this task.
  • Removing redundant words: Some comments are basically spam, repeating the word dozens of times. This task only allows two redundant words as a maximum.
  • Balancing the dataset: As discussed in Section 3.2, 22.4% of the total labeled samples are neutral sentiments, 35.5% are negative, and 42.1% are positive; this task takes the lowest percentage to apply to all classes and discards the others so that we end up with a balanced dataset.
  • Normalizing special Arabic characters: Some Arabic characters need to be normalized, such as replacing أ, آ, and إ with ا.
Table 8. Evaluation metrics using various different preprocessing tasks.
Table 8. Evaluation metrics using various different preprocessing tasks.
Task IDPreprocessing TaskF1-ScoreAccuracy
/No Preprocessing0.77640.8078
1Removing elongation0.77810.8086
2Replacing URLs, phone numbers, and emails0.77890.8094
3Removing HTML0.78120.8112
4Removing emojis0.77630.8074
5Inserting spaces between emojis0.78040.8107
6Converting Latin characters to Arabic0.77880.8076
7Removing redundant punctuation0.77860.8087
8Removing stop words0.64400.6881
9Removing redundant words0.77580.8072
10Balancing the dataset0.76530.7680
11Normalizing special Arabic characters0.77760.8094
Removing emojis reduced the F1-score a little bit. Generally, emojis help determine the sentiment of the sentence, and task 5 proves this, where we add spaces between emojis to help the tokenizer, improving the F1-score by 0.004.
Removing stop words reduced the performance significantly. A loss of about 0.13 is seen in the F1-score, and that is totally expected, as stop words include negation words and other similar words that definitely help influence the sentiment. For instance, the word arabic“ليس” (Not) is considered a stop word, so a sentence like “هذا ليس جميل إطلاقا” (This is not beautifull at all) will be transformed into “جميل” (Beautiful/Nice) since all other words are stop words, and the meaning is totally reversed. Therefore, stop word removal surely does not work in the sentiment analysis task.
Balancing the dataset also reduced the performance, and that is because a lot of samples were lost. As expected, accuracy is a more reliable metric in this case and is very close to the F1-score.
After experimenting with different preprocessing tasks, only tasks 1, 2, 3, 5, 6, 7, and 11 (highlighted ones in Table 8) were candidates to use to obtain the best combination between them. Table 4 shows the highest-performing combinations.
As highlighted in Table 9, the combination of preprocessing tasks 2, 3, 5, and 11 performed the best with scores of 0.7838 and 0.8174 for F1-score and accuracy, respectively.

6. Conclusions and Future Work

In conclusion, this research aimed to make a significant contribution to the field of sentiment analysis on the Algerian dialect. By creating a custom and relatively large dataset extracted from Algerian YouTube channels and manually annotated by the research team, we were able to train a state-of-the-art deep learning model for natural language processing called BERT. Our results showed that BERT achieved an F1-score of 78.3% and an accuracy of 81.7% on the testing set, demonstrating the effectiveness of our approach and the potential of using BERT for sentiment analysis on the Algerian dialect. The ability to infer sentiment from any Algerian text provides a valuable tool for understanding the opinions and emotions of the population. This research highlights the importance of studying the Algerian dialect and the potential of using state-of-the-art deep learning models for natural language processing in this area. It also shows that there is a need for more research on sentiment analysis on Arabic dialects, specifically the Algerian dialect. This research opens the door for further studies on this topic and can be used as a foundation for future research in this field. In future work, there is substantial scope to enhance the model’s robustness and generalizability by incorporating a more diverse and extensive dataset representing various Algerian dialects and exploring more advanced and diverse models to ascertain their efficacy relative to BERT in this specific context.

Author Contributions

Conceptualization, Z.B.; methodology, Z.B., A.B. and A.F.; software, A.F. and M.K.; validation, A.B. and Z.B.; investigation, A.F. and M.K.; data curation, A.F. and M.K.; writing—original draft preparation, A.F., M.K., Z.B. and A.B.; writing—review and editing, Z.B. and A.B.; visualization, Z.B. and A.F.; supervision, A.B. and Z.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data is available for download from: https://data.mendeley.com/datasets/zzwg3nnhsz/1.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Medhat, W.; Hassan, A.; Korashy, H. Sentiment analysis algorithms and applications: A survey. Ain Shams Eng. J. 2014, 5, 1093–1113. [Google Scholar] [CrossRef]
  2. Meena, G.; Mohbey, K.K.; Indian, A.; Khan, M.Z.; Kumar, S. Identifying emotions from facial expressions using a deep convolutional neural network-based approach. Multimed. Tools Appl. 2023, 1–22. [Google Scholar] [CrossRef]
  3. Mohbey, K.K.; Meena, G.; Kumar, S.; Lokesh, K. A CNN-LSTM-Based Hybrid Deep Learning Approach for Sentiment Analysis on Monkeypox Tweets. New Gener. Comput. 2023, 1–19. [Google Scholar] [CrossRef]
  4. Boulesnane, A.; Saidi, Y.; Kamel, O.; Bouhamed, M.M.; Mennour, R. DZchatbot: A Medical Assistant Chatbot in the Algerian Arabic Dialect using Seq2Seq Model. In Proceedings of the 2022 4th International Conference on Pattern Analysis and Intelligent Systems (PAIS), Oum El Bouaghi, Algeria, 12–13 October 2022. [Google Scholar] [CrossRef]
  5. Mansouri, A. Algeria between Tradition and Modernity: The Question of Language; State University of New York: Albany, NY, USA, 1991. [Google Scholar]
  6. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar] [CrossRef]
  7. Lin, T.; Wang, Y.; Liu, X.; Qiu, X. A survey of transformers. AI Open 2022, 3, 111–132. [Google Scholar] [CrossRef]
  8. Aftan, S.; Shah, H. Using the AraBERT Model for Customer Satisfaction Classification of Telecom Sectors in Saudi Arabia. Brain Sci. 2023, 13, 147. [Google Scholar] [CrossRef] [PubMed]
  9. Alshehri, W.; Al-Twairesh, N.; Alothaim, A. Affect Analysis in Arabic Text: Further Pre-Training Language Models for Sentiment and Emotion. Appl. Sci. 2023, 13, 5609. [Google Scholar] [CrossRef]
  10. Alruily, M.; Fazal, A.M.; Mostafa, A.M.; Ezz, M. Automated Arabic Long-Tweet Classification Using Transfer Learning with BERT. Appl. Sci. 2023, 13, 3482. [Google Scholar] [CrossRef]
  11. Almaliki, M.; Almars, A.M.; Gad, I.; Atlam, E.S. ABMM: Arabic BERT-Mini Model for Hate-Speech Detection on Social Media. Electronics 2023, 12, 1048. [Google Scholar] [CrossRef]
  12. Sabbeh, S.F.; Fasihuddin, H.A. A Comparative Analysis of Word Embedding and Deep Learning for Arabic Sentiment Classification. Electronics 2023, 12, 1425. [Google Scholar] [CrossRef]
  13. Al Shamsi, A.A.; Abdallah, S. Ensemble Stacking Model for Sentiment Analysis of Emirati and Arabic Dialects. J. King Saud Univ. -Comput. Inf. Sci. 2023, 35, 101691. [Google Scholar] [CrossRef]
  14. Khurana, D.; Koli, A.; Khatter, K.; Singh, S. Natural language processing: State of the art, current trends and challenges. Multimed. Tools Appl. 2022, 82, 3713–3744. [Google Scholar] [CrossRef] [PubMed]
  15. Stine, R.A. Sentiment Analysis. Annu. Rev. Stat. Its Appl. 2019, 6, 287–308. [Google Scholar] [CrossRef]
  16. Dang, N.C.; Moreno-García, M.N.; la Prieta, F.D. Sentiment Analysis Based on Deep Learning: A Comparative Study. Electronics 2020, 9, 483. [Google Scholar] [CrossRef]
  17. Yasen, M.; Tedmori, S. Movies Reviews Sentiment Analysis and Classification. In Proceedings of the 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT), Amman, Jordan, 9–11 April 2019. [Google Scholar] [CrossRef]
  18. Oueslati, O.; Cambria, E.; HajHmida, M.B.; Ounelli, H. A review of sentiment analysis research in Arabic language. Future Gener. Comput. Syst. 2020, 112, 408–430. [Google Scholar] [CrossRef]
  19. Boudad, N.; Faizi, R.; Thami, R.O.H.; Chiheb, R. Sentiment analysis in Arabic: A review of the literature. Ain Shams Eng. J. 2018, 9, 2479–2490. [Google Scholar] [CrossRef]
  20. Boulesnane, A.; Meshoul, S.; Aouissi, K. Influenza-like Illness Detection from Arabic Facebook Posts Based on Sentiment Analysis and 1D Convolutional Neural Network. Mathematics 2022, 10, 4089. [Google Scholar] [CrossRef]
  21. Darwish, K. Arabic Information Retrieval. Found. Trends Inf. Retr. 2014, 7, 239–342. [Google Scholar] [CrossRef]
  22. Al-Wer, E.; Jong, R. Dialects of Arabic. In The Handbook of Dialectology; Wiley-Blackwell: Hoboken, NJ, USA, 2017. [Google Scholar] [CrossRef]
  23. Alharbi, B.; Alamro, H.; Alshehri, M.; Khayyat, Z.; Kalkatawi, M.; Jaber, I.I.; Zhang, X. ASAD: A Twitter-based Benchmark Arabic Sentiment Analysis Dataset. arXiv 2020, arXiv:2011.00578. [Google Scholar] [CrossRef]
  24. Kwaik, K.A.; Chatzikyriakidis, S.; Dobnik, S.; Saad, M.; Johansson, R. An Arabic Tweets Sentiment Analysis Dataset (ATSAD) using Distant Supervision and Self Training. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, Marseille, France, 12 May 2020; pp. 1–8. [Google Scholar]
  25. Mdhaffar, S.; Bougares, F.; Estève, Y.; Hadrich-Belguith, L. Sentiment Analysis of Tunisian Dialects: Linguistic Ressources and Experiments. In Proceedings of the Third Arabic Natural Language Processing Workshop (WANLP), Valence, Spain, 3 April 2017; pp. 55–61. [Google Scholar] [CrossRef]
  26. Rahab, H.; Zitouni, A.; Djoudi, M. SIAAC: Sentiment Polarity Identification on Arabic Algerian Newspaper Comments. In Applied Computational Intelligence and Mathematical Methods; Springer International Publishing: Cham, Switzerland, 2017; pp. 139–149. [Google Scholar] [CrossRef]
  27. Ziani, A.; Azizi, N.; Zenakhra, D.; Cheriguene, S.; Aldwairi, M. Combining RSS-SVM with genetic algorithm for Arabic opinions analysis. Int. J. Intell. Syst. Technol. Appl. 2019, 18, 152. [Google Scholar] [CrossRef]
  28. Mataoui, M.; Zelmati, O.; Boumechache, M. A proposed lexicon-based sentiment analysis approach for the vernacular Algerian Arabic. Res. Comput. Sci. 2016, 110, 55–70. [Google Scholar] [CrossRef]
  29. Moudjari, L.; Akli-Astouati, K.; Benamara, F. An Algerian Corpus and an Annotation Platform for Opinion and Emotion Analysis. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 19 May 2020; pp. 1202–1210. [Google Scholar]
  30. Guellil, I.; Adeel, A.; Azouaou, F.; Hussain, A. SentiALG: Automated Corpus Annotation for Algerian Sentiment Analysis. In Advances in Brain Inspired Cognitive Systems; Springer International Publishing: Cham, Switzerland, 2018; pp. 557–567. [Google Scholar] [CrossRef]
  31. Ahmed, H.; Traore, I.; Saad, S. Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques. In Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2017; pp. 127–138. [Google Scholar] [CrossRef]
  32. Symeonidis, S.; Effrosynidis, D.; Arampatzis, A. A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis. Expert Syst. Appl. 2018, 110, 298–310. [Google Scholar] [CrossRef]
  33. Grandini, M.; Bagli, E.; Visani, G. Metrics for Multi-Class Classification: An Overview. arXiv 2020, arXiv:2008.05756. [Google Scholar] [CrossRef]
  34. Safaya, A.; Abdullatif, M.; Yuret, D. KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media. arXiv 2020, arXiv:2007.13184. [Google Scholar] [CrossRef]
  35. Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv 2016, arXiv:1609.08144. [Google Scholar] [CrossRef]
Figure 1. Workflow diagram of our work.
Figure 1. Workflow diagram of our work.
Applsci 13 11157 g001
Figure 2. F1-score and the accuracy of the validation set during training: (a) F1-score; (b) accuracy.
Figure 2. F1-score and the accuracy of the validation set during training: (a) F1-score; (b) accuracy.
Applsci 13 11157 g002
Figure 3. Confusion matrix on the testing set.
Figure 3. Confusion matrix on the testing set.
Applsci 13 11157 g003
Figure 4. Example inference on a positive sentence: (a) Prediction results of example 1; (b) Prediction results of example 2.
Figure 4. Example inference on a positive sentence: (a) Prediction results of example 1; (b) Prediction results of example 2.
Applsci 13 11157 g004
Figure 5. Example inference on a neutral sentence: (a) Prediction results of neutral comment 1; (b) Prediction results of neutral comment 2; (c) Prediction results of neutral comment 3; (d) Prediction results of neutral comment 4.
Figure 5. Example inference on a neutral sentence: (a) Prediction results of neutral comment 1; (b) Prediction results of neutral comment 2; (c) Prediction results of neutral comment 3; (d) Prediction results of neutral comment 4.
Applsci 13 11157 g005
Figure 6. Example inference on a negative sentence: (a) Prediction results of negative comment 1; (b) Prediction results of negative comment 2; (c) Prediction results of negative comment 3.
Figure 6. Example inference on a negative sentence: (a) Prediction results of negative comment 1; (b) Prediction results of negative comment 2; (c) Prediction results of negative comment 3.
Applsci 13 11157 g006
Figure 7. Example inference on a challenging sentence:: (a) Prediction results of comment 1; (b) Prediction results of comment 2.
Figure 7. Example inference on a challenging sentence:: (a) Prediction results of comment 1; (b) Prediction results of comment 2.
Applsci 13 11157 g007
Table 1. Summary of related Arabic sentiment analysis datasets.
Table 1. Summary of related Arabic sentiment analysis datasets.
DatasetSizeClassesAnnotation ApproachDialect
ASAD [23]100,000Negative, Neural and PositiveManualMultiple
ATSAD [24]36,000Negative and PositiveFrom EmojisMultiple
TSAC [25]10,000Negative and PositiveManualTunisian
SIACC [26]92Negative and PositiveManualAlgerian
[27]1000/ManualAlgerian
[28]5039Negative and PositiveManualAlgerian
TWIFIL [29]9000Negative, Neutral and PositiveManualAlgerian
[20]21,885Negative-related, Unrelated and PositiveManualAlgerian
SentiALG [30]8000Negative and PositiveFrom sentiment lexiconsAlgerian
Our dataset45,000Negative, Neutral and PositiveManualAlgerian
Table 2. Statistics of the dataset with different splits and classes.
Table 2. Statistics of the dataset with different splits and classes.
TrainingValidationTestAll
No. Samples(%)No. Samples(%)No. Samples(%)No. Samples(%)
Negative11,17535.5160835.7317435.315,95735.5
Neutral704022.3103022.9202322.510,09322.4
Positive13,28542.2186241.4380342.218,95042.1
Total31,5001004500100900010045,000100
Table 3. The number of samples by type of letters used.
Table 3. The number of samples by type of letters used.
Total SamplesPercentage (%)
Pure Arabic letters38,10884.684
Pure Latin letters569912.664
Mixed11932.652
Total45,000100.0
Table 4. Top most common unigrams by class.
Table 4. Top most common unigrams by class.
CountNegative ClassCountNeutral ClassCountPositive ClassCountEntire Dataset
3469الله (God)1506الله (God)8843الله (God)13818الله (God)
980ربي (My God)505ربي (My God)4264ربي (My God)5749ربي (My God)
962 الجزائر (Algeria)361 الجزائر (Algeria)1024 الجزائر (Algeria)2347 الجزائر (Algeria)
945 الشعب (The people)209 خويا (My brother)789 يبارك (bless)1294 الشعب (the people)
455 فرنسا (France)172 الشعب (The people)751 شكرا (Thank you)973 خويا (My brother)
353 الناس (the people)169 اللهم (Oh God)695 خويا (My brother)897 اللهم (Oh God)
346 الوكيل (The agent)150 استاذ (professor)676 يارب (Oh, Lord)874 يارب (Oh, Lord)
314 تبون (Tebboune)124 سلام (peace)615 اللهم (Oh God)871 شكرا (Thank you)
301 الجزائري (The Algerian)111 محمد (Mohammed)591 يحفظك (God keep you safe)836 يبارك (bless)
299 البلاد (The country)109 فضلك (Please)583 الصحة (The health)656 الصحة (The health)
Table 5. Top most common bigrams by class.
Table 5. Top most common bigrams by class.
Negative ClassNeutral ClassPositive ClassEntire Dataset
ونعم وكيل (The best disposer of affairs) شاء الله (If God wants) شاء الله (God willing) حمد لله (Thanks God)
شعب جزائري (Algerian people) حمد لله (Thanks God) حمد لله (Thanks God) شاء الله (God willing)
حسبنا الله (God suffices us) شعب جزائري (Algerian people) تحيا جزائر (Long live Algeria) يعطيك صحة (God gives you health)
مزبلة تاريخ (History dustbin) يارب عالمين (Lord of the Worlds) بارك الله (God bless) ونعم وكيل (The best disposer of affairs)
جزائر جديدة (New Algeria) تحيا جزائر (Long live Algeria) يارب عالمين (Lord of the Worlds) شعب جزائري (Algerian people)
حمد لله (Thanks God) لغة عربية (Arabic Language) ربي يحفظك (God protect you) تحيا جزائر (Long live Algeria)
وكيل فيكم (May God entrust you) نجيب باك (I will get the baccalaureate) يعطيك صحة (God gives you health) يارب عالمين (Lord of the Worlds)
Table 6. Different model weights and architectures along with training time and evaluation metrics.
Table 6. Different model weights and architectures along with training time and evaluation metrics.
Model VersionNo. of ParametersTraining TimeF1-ScoreAccuracy
LSTM~4 M3 min0.73990.7445
Bi-LSTM~4.3 M6 min 35 s0.73800.7437
BERT Base~109.5 M33 min 20 s0.69790.7500
BERT Large~335.1 M1 h 50 min0.69760.7484
BERT Arabic Mini~11.6 M2 min 40 s0.70570.7527
BERT Arabic Medium~42.1 M11 min 25 s0.75210.7860
BERT Arabic Base~110.6 M34 min 19 s0.76880.8002
BERT Arabic Large~336.7 M1 h 53 min0.78380.8174
Table 7. Training time and evaluation metrics when adjusting the sequence length parameter.
Table 7. Training time and evaluation metrics when adjusting the sequence length parameter.
Sequence LengthTraining TimeF1-ScoreAccuracy
1401 h 40 min0.77440.8056
1601 h 46 min0.78090.8114
1801 h 53 min0.78380.8174
2002 h0.78290.8143
2202 h 8 min0.77990.8084
Table 9. Evaluation metrics when using the best combinations of preprocessing tasks.
Table 9. Evaluation metrics when using the best combinations of preprocessing tasks.
Task IDsF1-ScoreAccuracy
1, 2, 3, 5, 70.78040.8100
2, 3, 50.78030.8114
2, 3, 5, 110.78380.8174
1, 2, 3, 5, 110.78160.8120
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Benmounah, Z.; Boulesnane, A.; Fadheli, A.; Khial, M. Sentiment Analysis on Algerian Dialect with Transformers. Appl. Sci. 2023, 13, 11157. https://doi.org/10.3390/app132011157

AMA Style

Benmounah Z, Boulesnane A, Fadheli A, Khial M. Sentiment Analysis on Algerian Dialect with Transformers. Applied Sciences. 2023; 13(20):11157. https://doi.org/10.3390/app132011157

Chicago/Turabian Style

Benmounah, Zakaria, Abdennour Boulesnane, Abdeladim Fadheli, and Mustapha Khial. 2023. "Sentiment Analysis on Algerian Dialect with Transformers" Applied Sciences 13, no. 20: 11157. https://doi.org/10.3390/app132011157

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop