Applying a Character-Level Model to a Short Arabic Dialect Sentence: A Saudi Dialect as a Case Study

Alqurashi, Tahani

doi:10.3390/app122312435

Open AccessArticle

Applying a Character-Level Model to a Short Arabic Dialect Sentence: A Saudi Dialect as a Case Study

by

Tahani Alqurashi

College of Computer and Information Systems, Information System Department, Umm Al-Qura University, Makkah 24382, Saudi Arabia

Appl. Sci. 2022, 12(23), 12435; https://doi.org/10.3390/app122312435

Submission received: 14 October 2022 / Revised: 28 November 2022 / Accepted: 1 December 2022 / Published: 5 December 2022

(This article belongs to the Special Issue Recent Trends in Natural Language Processing and Its Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Arabic dialect identification (ADI) has recently drawn considerable interest among researchers in language recognition and natural language processing fields. This study investigated the use of a character-level model that is effectively unrestricted in its vocabulary, to identify fine-grained Arabic language dialects in the form of short written text. The Saudi dialects, particularly the four main Saudi dialects across the country, were considered in this study. The proposed ADI approach consists of five main phases, namely dialect data collection, data preprocessing and labelling, character-based feature extraction, deep learning character-based model/classical machine learning character-based models, and model evaluation performance. Several classical machine learning methods, including logistic regression, stochastic gradient descent, variations of the naive Bayes models, and support vector classification, were applied to the dataset. For the deep learning, the character convolutional neural network (CNN) model was adapted with a bidirectional long short-term memory approach. The collected data were tested under various classification tasks, including two-, three- and four-way ADI tasks. The results revealed that classical machine learning algorithms outperformed the CNN approach. Moreover, the use of the term frequency–inverse document frequency, combined with a character n-grams model ranging from unigrams to four-grams achieved the best performance among the tested parameters.

Keywords:

Arabic natural language processing; supervised learning approach; automatic dialect language identification; Saudi dialects

1. Introduction

The Arabic language is spoken by more than 400 million people in 22 countries, including the Kingdom of Saudi Arabia [1], and is the fourth most frequently used language on the World Wide Web [1]. An estimated 58% of the Saudi Arabian population, approximately

19.84

million users [2], are reported to have access to the Internet. Generally, the Arabic language is classified into three main types: classical Arabic, modern standard Arabic (MSA) and Arabic dialect (AD). ADs are defined as the spoken or written variations of the Arabic language used by different Arab regions. Recently, it was noticed that ADs are more frequently used for informal written communication on the Web, owing to the wide range of available social media applications.

Language identification, and more recently, dialect identification and discrimination, has become a compelling task in language recognition and natural language processing (NLP). It is the task of automatically classifying the language vocabulary used by a specific community into the geographical region that the native speaker belongs to [3]. This task requires a more fine-grained level of identification, as it is the most challenging language identification task [4].

However, the knowledge obtained from dialect identification is very helpful for many applications, such as document retrieval to classify documents based on their dialect and according to user preferences [5], to complement existing recognition language modelling systems, and to build a natural language generation system using the generated dialectal mappings [6].

Arabic dialect identification (ADI) has been receiving increasingly significant attention in recent years. Early works on ADI focused on either distinguishing the dialects from MSA or among dialect countries [5,7].

According to a recent review [8], fine-grained dialects have not been thoroughly investigated in the ADI literature. Recently, Salameh et al. [9] created a fine-grained dialects dataset that included 25 cities from a number of different Arab countries. Their results were one of the main motivations of this study, to investigate the problem of fine-grained Arabic language dialects that only included dialects from very close regions in one country, using a short sentence that consisted of a few words, and without the notion of a word.

The use of character-model machine learning models, when building either classic machine learning or deep learning models, in fact shows a better applicability in the field of natural language processing (NLP) over word-level models [10,11]. This is due to the fact that the word-level model has a number of shortcomings. First, it represents each word separately as a token and as words sharing a common root, with the prefix or suffix treated as separate words. Thus, making the word-level model is statistically inefficient.

Second, the list of vocabulary in the word-level model is fixed in the training corpus, and when the model is tested on unseen words, it usually fails to handle it. So, it cannot handle small changes in words such as small differences in characters between fine-grained dialects.

Character-level models, on the other hand, are effectively unrestricted in their vocabulary. On this matter, a number of researchers have discovered that the character-level model can overcome word-level model issues in text classification problems, as long as the text is represented as a sequence of one-hot vectors, and without changing the machine learning models. Moreover, character-level models help to minimise the steps needed for data preprocessing compared to word-level models, and the Arabic language is indeed a very challenging language.

Based on this intuition, the character-level model was investigated in this study to solve the problem of ADI, more precisely for fine-grained dialects on regions from the same country, and the Saudi dialects were chosen as a case in this study, as each dialect has unique phrases or words that could be very informative. With disregard to the order or the meaning of these phrases or words, the CNN approach could be applied in this situation as it has been successfully applied in computer vision tasks.

Saudi dialects share the same Arabic characters as other dialects which are spoken in close geographical regions. It differs from MSA on all levels of linguistic representations. They also do not follow any grammar rules, unlike MSA.

However, two studies were found (i.e., [12,13]) that focus specifically on building Saudi dialects. Other studies either consider Saudi dialects as part of the Gulf dialects and classify them either as MSA or other Arabic regional dialects, such as the Maghrebi, Iraqi, Egyptian, and Levantine dialects.

In this study, the main aim was to investigate the use of a character-level model for solving the problem of automatically identifying Saudi dialects, considering a number of supervised machine learning approaches.

Therefore, the main objectives of this study are as follows:

1: To collect a fine-grained Saudi dialects corpus, which includes the four main Saudi region dialects, consisting of short dialect sentences.
2: To train and to test several classical machine learning models on character-level input features without the notation of words.
3: To investigate deep learning based on a character level for the ADI problem.

The rest of the paper is organised as follows. Section 2 explains the related works on ADI. Section 3 describes our proposed Saudi dialect identification approach. Section 4 discusses our experimental results. Finally, Section 5 presents the concluding remarks and suggests future research lines.

2. Related Work

ADI has been receiving significant attention in recent years. The methodologies used to tackle this intricate problem can be divided into five main categories (see [14] for a survey of the literature): first, nonautomated manual methods that depend on lexicons and linguistic rules; second, language models that estimate the probability of different linguistic units belonging to a particular dialect; third, machine learning; and finally, deep learning models.

As this study focuses on solving the ADI problem using classical machine learning and deep learning approaches based on character-level input features, this section will only review the studies that covered these approaches.

More recently, a number of works on ADI has involved the use of machine learning (ML) approaches with feature engineering, where the performance of different models using different features is assessed and compared [14].

Sadat et al. [15] compiled a dataset from social media outlets encompassing 18 different dialects, in an attempt to provide a framework for the multiclassification of ADI tasks. They implemented a Markov language model and a naïve Bayes (NB) classifier using unigram, bigram, and trigram character-level features. The best performance was recorded for the character-level bigram NB, with an F1 score of 80%.

Salameh et al. [9] performed a fine-grain ADI task covering 25 Arabic dialects plus MSA. They used two datasets. The first dataset consisted of 2000 sentences translated into 25 cities’ dialects and MSA (Corpus-26). The second had an additional 10,000 sentences translated into the dialects of five cities and MSA (Corpus-6). A number of different combinations of features were used to train the models, including word n-grams, character n-grams, and language model probability scores. Two machine learning algorithms were applied: linear support vector machine (SVM) and multinomial naïve Bayes (MNB), which reported the best-performing model results. The best performance set of features combined character-level uni-, bi-, and trigrams, word unigrams, and the probability score of a five-gram character language model.

A study by Adouane et al. [16] used an SVM to distinguish between Arabicised Berber and seven ADs at the country level (i.e., Algerian, Egyptian, Gulf, Levantine, Mesopotamian, Moroccan, and Tunisian) plus MSA. The dataset was a manually annotated corpus of blogs and newspapers that they had collected. The best feature set combined character-level five-grams and six-grams, and a lexicon that they constructed, weighted using the term frequency–inverse document frequency (TF-IDF). The SVM using the features mentioned resulted in an F1 score of 92.94%.

Malmasi et al. [17] put forward an ADI problem to distinguish between the transcriptions of the conversational speech of four Arabic dialects (i.e., Egyptian, Gulf, Levantine, and North African) and MSA. Adouane et al. [18], in an attempt to solve the ADI task posed by [17], also used a linear SVM. The best performance was achieved using character-level five-gram and six6-gram features (F1 score= 49.5%). Using the same data as in [17], Eldesouki [19] reported the best performing model as an SVM with character-level bi-, tri-, four-, and five-grams together weighted using TF-IDF (accuracy = 51.36%). Using the same features, the second best-performing model was a logistic regression with the same features (accuracy = 50.82%).

Another study by Malmasi et al. [5] conducted a six-way classification ADI task using a linear SVM. They used the dataset compiled by [20] containing a collection of 2000 parallel sentences in five ADs plus MSA for training the model. The results showed that the character n-grams were the best features compared to other features such as word n-grams. The best performance was achieved using a combination of character uni-, bi-, and trigrams with an accuracy of 66.48%.

In the ADI task stated in [17], to discriminate between four Arabic dialects and MSA, deep learning models showed encouraging results compared to the ML techniques that were mentioned earlier for the same task [18,19]. For example, Guggilla [21] used an adaptation of a CNN that included four layers: input, convolution, max-pooling and a fully connected softmax layer. He used randomly generated embeddings as features that kept updating during training. The model achieved an F1 score of 43.29%. Belinkov and Glass [22] employed a character-level CNN with seven layers: embedding, dropout, multiple parallel convolutions, max-pooling, fully connected, and softmax layers. The CNN using character embeddings as features reached an F1 score of 48.34%, outperforming the model in [21].

For the same ADI task mentioned earlier in [23] with data provided by [24], Ali [25] employed a character-level CNN combined with dialect embedding vectors and a representation extracted from linguistic features. He experimented with three CNN architectures that differed in the input layer before the convolution layer. The first CNN used a one-hot character representation for the input layer. The second used an embedding layer before the convolution layer and the third used a GRU recurrent layer before the convolution layer. The three architectures scored accuracies of 57.11%, 56.97% and 57.59%, respectively.

All of the studies mentioned above mainly involved the problem of distinguishing the Arabic dialects spoken in different countries and MSA. However, there has been an increasing level of interest in studying dialects within the same country in the past two years. Differentiating the dialects of different provinces of the same country is an even more complex task, as capturing linguistic differences becomes more intricate and convoluted.

In an attempt to explore the problem of ADI on a province level, Abdul-Mageed et al. [26] recently shared a Twitter-based dataset covering a total of 100 provinces from 21 Arab countries.

Using the same data as in [26], Nayel et al. [27] built an ensemble of five models to classify province-level dialects: complement Naïve Bayes (CNB), decision tree (DT), logistic regression (LR), random forest (RF), and support vector machine (SVM). In addition, they used TF-IDF with unigram features to train the system. The SVM outperformed the other models on the training data, with an F1 score of 4.73. The final ensemble classifier reached an accuracy of 4.8 and an F score of 4.55, outperforming other transformer techniques used for the same ADI task [26,28].

In summary, solving the problem of ADI on province-level dialects is still in the early stages, and more work needs to be performed. As we saw on ADI using country levels, models utilising character n-gram features showed higher accuracies than those using word n-gram features. Therefore, the former was chosen in this study to be investigated further for solving the problem of ADI on province-level dialects.

3. Saudi Dialect Identification Approach

The main task of automatic dialect identification is to build a model that can predict in which dialect a term or word w is written. This process requires a more fine-grained level of identification, as it is the most challenging language identification task [4].

The implemented ADI approach, as shown in Figure 1, consists of five main phases: dialect data collection, data preprocessing and labelling, character-based feature extraction, CNN character-based model/classical machine learning character-based models, and model evaluation performance. The following subsections explain these phases in more detail.

3.1. Saudi Dialect Data Collection Phase

Previous studies have shown that isolated words and individual phonemes can be successfully used for dialect identification [29,30]. Accordingly, a dialect corpus that mostly consisted of words or very short sentences was built for the purpose of this study.

In fact, the Saudi regional dialect is divided into four main categories [31]; these are:

Hejazi: The dialect spoken by native speakers in the west of Saudi Arabia, which includes the Makkah and Al-Madinah regions.
Najdi: The dialect language spoken by native speakers in central Saudi Arabia, which includes the Riyadh and Al-Qasim regions.
Janoobe: The dialect language used by native speakers in the south of Saudi Arabia, which includes the Aseer region, Najran city and Jazan city.
Hasawi: The dialect language used by native speakers in the east of Saudi Arabia, which includes the Al-Hasa region and Al-Dammam city.

Therefore, the four Saudi regional dialects were collected from Arabic social media data, such as blogs, discussion forums, and reader commentaries, given that the language of such social media is typically dialect language. Data were also collected from Twitter, based on the number of dialect regional hashtags over the period from January 2020 to May 2020, which included الهجةـالحجازية #للهجةـالنجدية #اللهجةـالجنوبي #اللهجةـالحساوية

Table 1 shows the statistics for the collected dataset. As displayed, there was a total of 3768 dialects sentences.

3.2. Data Preprocessing and Labelling

After the data were harvested, some data preprocessing steps were applied, including the two main steps, data cleaning and normalisation. For data cleaning, the decision was made to minimise its application to the data as much as possible, due to the fact that applying text-cleaning techniques to such a fine-grained dialect language might affect the contextual meaning.

However, punctuation marks, extra spaces, and all diacritics and elongations were deleted. Some duplicate words were found in more than one dialect language. Therefore, to validate these data, a human annotator was asked to check these duplicates and to validate the collected data in general.

For the normalisation step, the following was applied to normalise Arabic letters:utf8

أ, إ, and آ were replaced with ا (here the different forms of the letter alif were replaced with the standard alif form);
ئ was replaced with ا, (here the letter ya with hamza above was replaced with the standard alif form);
ى was replaced with ي, (here the letter ya was replaced with the standard ya form);
ة was replaced with ه, (here the letter ta marbuta replaced with the letter ha marbuta);
ؤ was replaced with و, (here the letter waw with hamza above was replaced with the standard waw form);
كـ was replaced with ك, (here the letter kaf with the initial shape was replaced with the isolated kaf shape);
Arabic stop words were removed, such as من, في, على, and الى.

The last stage of this phase was the labelling process. Four annotators were asked to label and validate the collected data. The selection of annotators was conditioned based on the fact that they had to have lived in at least two regions in order to make sure that they could distinguish between at least a pair of dialect languages. Then, each annotator validated a pair of dialect languages and distinguished between the overlap text, and they were asked to highlight the ones whose origin dialects they could not determine. For these words, a second annotator was asked, and if the first two could not agree, then a third annotator was asked to determine the origin dialect of the word. The three annotators were unable to agree on the origin dialects for a small number of words, and so the decision was made to delete these words. In total, six annotators were assigned to finishing the whole process of labelling and validating the collected data.

3.3. Character-Based Feature Extraction

In this phase, the main aim was to convert each sentence in our dialect dataset into a character-based feature. As two different techniques were used for building the predicating models, each one had a different feature extraction method.

For the deep learning, each sentence was represented by a numeric character sequence of vectors L in length, where L represented the maximum length of a sentence in our dataset, equal to 55. First, each character in the alphabet set was encoded in a one-hot representation of size

1 \times c

, where c was total number of characters in the alphabet set. Then, the sequence of characters for each sentence was transformed into a sequence of vectors.

Two different alphabet sets were used in the experiments: the first set consisted of 30 characters (28 Arabic alphabet characters plus alhamza and the space), while the second set consisted of 37 characters, the same 30 in set one plus the 7 diacritics in the Arabic language, which are Tashdid, Fatha, Damma, Tanwin Damm, Kasra, Tanwin Kasr, and Sukun.

The experiment was run for each character set, where each sentence and its corresponding dialects were encoded into a sequence of vectors for the two fixed character sets, and a sequence of vectors of size 55 was obtained as the input layer.

On the other hand, for classical machine learning, the TF–IDF was used to contract the character-based feature set from the dataset.

TF–IDF combines two scores, the term frequency TF, which calculates the frequency of character n-grams in each sentence, and the inverse document frequency IDF, in order to reduce the weights of the character n-grams that are repeated frequently and increase the weights of character n-grams that are repeated very rarely. Therefore, TF–IDF is defined as follows:

TF–IDF (g, s, D) = TF (g, s) \times IDF (g, D)

(1)

Here, TF(g, s) calculates the number of times n-gram g appears in sentence s, and the IDF is defined as shown in Equation (2), where D is the total number of sentences in the dataset and df(g) is the number of sentences in which the character n-gram g appears in D.

IDF (g, D) = log \frac{| D |}{df (g)}

(2)

3.4. Deep Learning Character-Based Model

Our classification problem is a multiclass problem, where each dialect represents a separate class. Therefore, given s and its label l, the task is to predict the Saudi dialect language that s is written in, using only its character sequence.

The CNN architecture described in [10] was adapted. The architecture consists of two main steps, as shown in Figure 2: the main feature of the first part uses the CNN layer as a feature extractor, and then the convolution output directly feeds to the long short-term memory (LSTM) layer to capture the long-term dependencies in the second step.

The input layer for the convolution layer was a sequence of vectors, which was the output of the previous step. In the convolution layer, two 1D convolutions were applied in parallel to each of our character input layers to map this sequence x into a hidden sequence h, with two different filters of size 64 and 100, each with a kernel size equal to 3 and a pool size of 2.

After each convolution operation, a nonlinear activation of the rectifier linear unit (ReLU) [32] type was applied. Then, a temporal max-pooling layer was applied that strongly activated some nodes in the sequence. Next, a dense layer with a size of 128 and a ReLU activation function was applied. The two parallel 1D convolution outputs were concatenated into a sequence and passed to a dropout layer, in which

0.4

of the input units were dropped to reduce overfitting. Afterwards, a TimeDistributed layer was applied to make the sequence suitable for the bidirectional LSTM (BiLSTM) layer.

Then, a BiLSTM was used, resulting in two sequences in the forward and backward directions with hidden sizes of 128 each. The final output of the BiLSTM layer for both sequences was concatenated to yield 256-dimensional hidden units, and the fully connected layer (Dense) with a size of 128 and a RelU activation function was applied. To reduce overfitting, two dropout layers with a drop rate of

0.4

were applied after and before the Dense layer. Finally, the resulting vectors passed to the Softmax layer to produce the final probability distribution over our dialect classes k.

3.5. Classical Machine Learning Character-Based Model

As the main aim of this project was to predict in which dialect a word or term is written, a variety of popular and powerful supervised classification algorithms were applied to the collected dataset, such as logistic regression (LR), stochastic gradient descent classifier (SGDC), variations of the naive Bayes (NB) models, and support vector classification (SVC).

3.6. Model Evaluation Performance

The performance of multiple classical machine learning and CNN algorithm was evaluated, based on the following widely used metrics: accuracy, recall, precision and F-measure. To calculate them, the confusion matrix must be built that contains true positive (TP), true negative (TN), false positive (FP), and false negative (FN) terms.

The accuracy can be calculated as follows:

Accuracy = \frac{TP}{TP + FP}

(3)

The recall is the ratio of the correctly predicted dialects to the total number of dialects in the actual class, and it can be calculated as follows:

Recall = \frac{TP}{TP + FN}

(4)

The precision is the ratio of correctly predicted dialects to the total number of predicted dialects in the sentence. It is calculated as follows:

Precision = \frac{TP}{TP + FP} .

(5)

The F-measure represents the weighted average, and it is calculated as follows:

F - measure = 2 \times \frac{Precision \times Recall}{Precision + Recall} .

(6)

4. Experiments and Results

All the experiments were run using an Apple Macintosh computer with a 2 GHz Quad-Core Intel Core i5 with 16 GB memory, and the implementation was carried out using the high-level technical computing language Python 3 version 3.7.6. For the classical machine learning classification algorithm, the Natural Language ToolKit (NLTK) was used, ref. [33] as well as the scikit-learn Python library [34]. For the CNN approach, the Keras library with TensorFlow as the back-end was used [35].

Two different experiments were run: the first as a multiway ADI problem, including four-way and three-way, and the second as a two-way ADI problem. For both experiments, the considered CNN and classical machine learning algorithms were trained on 80% of our dataset and tested on the rest of the dataset.

For classical machine learning in both experiments, the models were trained by using the popular approach of a five-fold validation, in which the data were trained in four folds, and the remaining fold was used as the validation set. This method was repeated five times, and then the average result was recorded. However, for both experiments a pipeline was developed to determine the best parameters using a grid-search approach, which includes the following tested parameters:

Different values for the document frequency threshold (max_df), which required when calculating TF–IDF, including 0.5, 0.75, and 1.0.
Different values for the minimum document frequency threshold (min_df), including 1, 5 and 10.
Different n-grams combination for character model as shown in Table 2.

For the CNN, the approach was tested with the data normalised in terms of removing tashkeel from the text in Models 2 and 4, and without the data normalised in Models 1 and 3. For the training, it was also tested by adding two drop layers in Models 3 and 4, and no drop layer in Models 1 and 2. Table 3 shows the combination of these tested parameters.

Table 4, Table 5 and Table 6 represent the results of the considered algorithms for the four-way, three-way, and two-way ADI problems, respectively. The following subsections discuss the results along with their experiments in more detail.

4.1. Results of the Multiway ADI Problem

For the multiway ADI problem, a four-way ADI problem was run including all of the four classes in the collected data, and a three-way ADI problem, which included the Hijazi, Najdi, and Hasawi dialects.

Table 4 shows the results of the four-way ADI problem for all of our considered algorithms, including the CNN approach (Models 1, 2, 3, and 4) and the classical machine learning algorithms.

For the CNN approach, the best result was achieved for Model 1, which reached an accuracy of

35.4 %

. In the model, the tashkeel was considered as a character (resulting in a 67-character list), and there were no drop layers.

In the case of removing tashkeel and adding drop layers (Model 4), the performance of the CNN model slightly increased in performance by

0.8

percent compared with the performance without the drop layers (Model 3).

On the other hand, when considering tashkeel in the character list and without adding the drop layers (Model 1), the performance increased by

1.7 %

compared with the same case but with the drop layers (Model 3). Therefore, the results indicated that this slight improvement might come from the dual effects of two factors, the addition of tashkeel in the character list and the lack of drop layers being added in the CNN architecture, not from the tashkeel alone. Thus, the decision was made to normalise the dataset by removing the tashkeel from the Arabic sentences and running the considered classical machine learning algorithms for the three- and two-way ADI problems.

For the classical machine learning algorithms, our grid search clearly indicated that most of them achieved the highest performance using TF-IDF with max_df being equal to

0.75

and min_df being equal to 1, with the ngram_range being equal to (1, 4). In the case of the four-way ADI problem, all the considered classical machine learning algorithms outperformed the CNN models, where the best performance was achieved using LR and NuSVC, reaching an accuracy of

40.9 %

.

Table 5 shows the results of the three-way ADI problem. In this problem, the worst class was deleted, which was Janobi, while keeping the other classes to test the performance of the considered algorithms in that case. However, the best performance was achieved by using the CNN approach (Model 6), in which the tashkeel was removed and no drop layer was added, reaching an accuracy of

47.0 %

, followed by a close performance of the LR algorithm achieving

46.8 %

. The classical machine learning algorithms also outperformed the other CNN models in the three-way ADI problem. It has to be mentioned that the overall increase in performance for all the considered algorithms in the three-way ADI problem compared with the four-way was due to the deletion of the worst class being predicted, which was the Janobi class.

4.2. Results of the Two-Way ADI Problem

For the two-way ADI problem, only the Hijazi and Hasawi dialects were included in the dataset. Only these two classes were considered because they originate from the furthest two regions in the Kingdom of Saudi Arabia (Hijazi is the dialect in the east, while Hasawi is the dialect in the west). Therefore, it was assumed that the difference between their dialects might be more easily distinguished by our considered machine learning algorithms than other dialects.

As Model 1 had achieved the best performance among the other CNN models in the multiway ADI problem, it was chosen to carry out our investigation for the two-way ADI problem.

Table 6 shows the results of the CNN approach (Model 1) and the classical machine learning algorithms in terms of precision, recall, F-measure, and accuracy for the two-way ADI problem.

The results revealed that all the considered algorithms increased in performance. Model 1 achieved the best performance, with an accuracy of

64.5 %

and an F-measure equal to

63.7 %

, compared with the other classical machine learning algorithms. Among the latter, the MultinomialNB algorithm, in which the parameters were determined through a grid search, achieved

63.4 %

accuracy, followed by SVC, which achieved

61.4 %

.

5. Discussion

In this study, the main aim was to investigate the use of a character-based model using classical and CNN machine learning algorithms to solve the problem of identifying fine-grained Arabic language dialects in the form of a written short text and without word notation.

In general, the results of all the considered algorithms demonstrated low performance for the four-way task, indicating how difficult the problem is for the existing approach. The same result was reported in the literature for such a very fine-grained ADI problem, particularly when the dialects were from very close regions (i.e., at a province level) because the degree of similarity between them was very high.

However, the results of the four-way ADI problem revealed that the classical machine learning algorithms based on a character model outperformed the CNN approach that was also based on a character model. This outcome indicated a need for further development in the architecture of the CNN approach to deal with such fined-grained dialects on a multiway ADI problem. By contrast, the CNN approach outperformed the other considered classical machine learning algorithms in the two-way ADI problem.

The results of the classical machine learning algorithms revealed that the best parameter for the TF-IDF was a combination of character n-grams model ranging from unigrams to four-grams, with a max_df equal to 0.75 and a min_df equal to 1. Among the considered algorithms, the LR achieved the highest performance in the four-way and three-way ADI problems.

The reason why most of the considered algorithms performed poorly in identifying the Janobi class was the high degree of similarity between the Janobi class and other classes, particular the Najdi and Hijazi classes. Most of the disagreements between our annotators occurred in that class as well. This situation might be due to the fact that people from the Janobi region have either lived there some time in their life or grown up in the Hijaz or Najd regions in Saudi Arabia. This indicates evidence that the text dialect of the Janobi region found on the Internet has been mingled with other dialects in Saudi Arabia.

Overall, it is not surprising that the accuracy of all considered algorithms ranged from

33.2 %

to

40.9 %

for the four-way, and from

43.1 %

to

47.0 %

for the three-way identification problem.

6. Conclusions

This study investigated the use of a character-level model to solve the ADI problem applied to a short Arabic sentence. This study focused on the Saudi dialects problem and particularly tested two-, three-, and four-way identification tasks. The main adaptive approach consisted of five phases: dialect data collection, data preprocessing and labelling, character-based feature extraction, classical machine learning/deep learning character-based models, and model evaluation performance. In the first phase, 3768 short dialect texts were collected from the Internet in four main Saudi dialects, including Hijazi, Najdi, Janobi, and Hasawi. The MultinomialNB, BernoulliNB, LogisticRegression, SGDClassifier, SVC, LinearSVC, NuSVC, and CNN approaches were then used in the learning phase, and their performance was evaluated and compared. The results showed that the best-performing algorithm was LR and NuSVC, reaching 40.9% accuracy. In the three-way task, the CNN approach (Model 6), in which the tashkeel was removed and no drop layers were added, outperformed the other models, reaching a 47.0% accuracy. Moreover, using TF-IDF with a combination of character n-grams ranging from unigrams to four-grams achieved the best performance for the considered classical machine learning algorithms.

In future research endeavours in ADI, the plan is to improve the CNN approach further, especially for the multiway problem. Different character feature construction techniques could also be considered, such as by representing a short part of the words, and not just characters.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available upon request from the author. The data are not publicly available, as this research is ongoing.

Conflicts of Interest

The authors declare no conflict of interest.

References

United Nations Educational, Scientific and Cultural Organization (UNESCO). World Arabic Language Day. 2020. Available online: https://en.unesco.org/commemorations/worldarabiclanguageday (accessed on 13 June 2020).
General Authority for Statistics, Kingdom of Saudi Arabia. Saudi Census. 2020. Available online: https://www.stats.gov.sa/en (accessed on 13 June 2020).
Guellil, I.; Saâdane, H.; Azouaou, F.; Gueni, B.; Nouvel, D. Arabic natural language processing: An overview. J. King Saud Univ.-Comput. Inf. Sci. 2019, 33, 497–507. [Google Scholar] [CrossRef]
Jauhiainen, T.; Lui, M.; Zampieri, M.; Baldwin, T.; Lindén, K. Automatic language identification in texts: A survey. J. Artif. Intell. Res. 2019, 65, 675–782. [Google Scholar] [CrossRef] [Green Version]
Malmasi, S.; Refaee, E.; Dras, M. Arabic dialect identification using a parallel multidialectal corpus. In Proceedings of the Conference of the Pacific Association for Computational Linguistics, Bali, Indonesia, 19–21 May 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 35–53. [Google Scholar]
Stede, M. Lexical choice criteria in language generation. In Proceedings of the Sixth Conference of the European Chapter of the Association for Computational Linguistics, Utrecht, The Netherlands, 21–23 April 1993. [Google Scholar]
Darwish, K.; Sajjad, H.; Mubarak, H. Verifiably effective arabic dialect identification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1465–1468. [Google Scholar]
Elnagar, A.; Yagi, S.M.; Nassif, A.B.; Shahin, I.; Salloum, S.A. Systematic literature review of dialectal Arabic: Identification and detection. IEEE Access 2021, 9, 31010–31042. [Google Scholar] [CrossRef]
Salameh, M.; Bouamor, H.; Habash, N. Fine-grained arabic dialect identification. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 21–25 August 2018; pp. 1332–1344. [Google Scholar]
Zhang, X.; Zhao, J.; LeCun, Y. Character-level convolutional networks for text classification. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef]
Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; Association for Computational Linguistics: Doha, Qatar, 2014; pp. 1746–1751. [Google Scholar] [CrossRef] [Green Version]
Al-Twairesh, N.; Al-Matham, R.; Madi, N.; Almugren, N.; Al-Aljmi, A.H.; Alshalan, S.; Alshalan, R.; Alrumayyan, N.; Al-Manea, S.; Bawazeer, S.; et al. Suar: Towards building a corpus for the Saudi dialect. Procedia Comput. Sci. 2018, 142, 72–82. [Google Scholar] [CrossRef]
Al-Twairesh, N.; Al-Khalifa, H.; Al-Salman, A.; Al-Ohali, Y. Arasenti-tweet: A corpus for arabic sentiment analysis of saudi tweets. Procedia Comput. Sci. 2017, 117, 63–72. [Google Scholar] [CrossRef]
Althobaiti, M.J. Automatic Arabic Dialect Identification Systems for Written Texts: A Survey. arXiv 2020, arXiv:2009.12622. [Google Scholar]
Sadat, F.; Kazemi, F.; Farzindar, A. Automatic Identification of Arabic Dialects in Social Media. In Proceedings of the First International Workshop on Social Media Retrieval and Analysis, Gold Coast, QLD, Australia, 11 July 2014; Association for Computing Machinery: New York, NY, USA, 2014; pp. 35–40. [Google Scholar] [CrossRef]
Adouane, W.; Semmar, N.; Johansson, R.; Bobicev, V. Automatic detection of arabicized berber and arabic varieties. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), Osaka, Japan, 11–16 December 2016; pp. 63–72. [Google Scholar]
Malmasi, S.; Zampieri, M.; Ljubešić, N.; Nakov, P.; Ali, A.; Tiedemann, J. Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), Osaka, Japan, 11–16 December 2016; pp. 1–14. [Google Scholar]
Adouane, W.; Semmar, N.; Johansson, R. ASIREM participation at the discriminating similar languages shared task 2016. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), Osaka, Japan, 11–16 December 2016; pp. 163–169. [Google Scholar]
Eldesouki, M.; Dalvi, F.; Sajjad, H.; Darwish, K. Qcri@ dsl 2016: Spoken arabic dialect identification using textual features. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), Osaka, Japan, 11–16 December 2016; pp. 221–226. [Google Scholar]
Bouamor, H.; Habash, N.; Oflazer, K. A Multidialectal Parallel Corpus of Arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, 26–31 May 2014; European Language Resources Association (ELRA): Reykjavik, Iceland, 2014; pp. 1240–1245. [Google Scholar]
Guggilla, C. Discrimination between Similar Languages, Varieties and Dialects using CNN- and LSTM-based Deep Neural Networks. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), Osaka, Japan, 12 December 2016; The COLING 2016 Organizing Committee: Osaka, Japan, 2016; pp. 185–194. [Google Scholar]
Belinkov, Y.; Glass, J.R. A Character-level Convolutional Neural Network for Distinguishing Similar Languages and Dialects. arXiv 2016, arXiv:1609.07568. [Google Scholar]
Dinu, L.P.; Ciobanu, A.M.; Zampieri, M.; Malmasi, S. Classifier Ensembles for Dialect and Language Variety Identification. arXiv 2018, arXiv:1808.04800. [Google Scholar]
Zampieri, M.; Malmasi, S.; Nakov, P.; Ali, A.; Shon, S.; Glass, J.; Scherrer, Y.; Samardžić, T.; Ljubešić, N.; Tiedemann, J.; et al. Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), Santa Fe, NM, USA, 20 August 2018; Association for Computational Linguistics: Santa Fe, NM, USA, 2018; pp. 1–17. [Google Scholar]
Ali, M. Character Level Convolutional Neural Network for Arabic Dialect Identification. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), Santa Fe, NM, USA, 20 August 2018; Association for Computational Linguistics: Santa Fe, NM, USA, 2018; pp. 122–127. [Google Scholar]
Abdul-Mageed, M.; Zhang, C.; Elmadany, A.A.; Bouamor, H.; Habash, N. NADI 2021: The Second Nuanced Arabic Dialect Identification Shared Task. arXiv 2021, arXiv:2103.08466. [Google Scholar]
Nayel, H.; Hassan, A.; Sobhi, M.; El-Sawy, A. Machine Learning-Based Approach for Arabic Dialect Identification. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, Kiev, Ukraine, 19 April 2021; Association for Computational Linguistics: Kyiv, Ukraine, 2021; pp. 287–290. [Google Scholar]
Wadhawan, A. Dialect Identification in Nuanced Arabic Tweets Using Farasa Segmentation and AraBERT. arXiv 2021, arXiv:2102.09749. [Google Scholar]
Yanguas, L.R.; O’Leary, G.C.; Zissman, M.A. Incorporating linguistic knowledge into automatic dialect identification of Spanish. In Proceedings of the Fifth International Conference on Spoken Language Processing, Sydney, NSW, Australia, 30 November–4 December 1998. [Google Scholar]
Arslan, L.M.; Hansen, J.H. Selective training for hidden Markov models with applications to speech classification. IEEE Trans. Speech Audio Process. 1999, 7, 46–54. [Google Scholar] [CrossRef]
Al-Darsouni, S. A Dictionary of Spoken Dialects in Saudi Arabia, The Vocabulary and Vocabulary of Dialects of Tribes and Regions; King Fahad National Library: Riyadh, Saudi Arabia, 2012. [Google Scholar]
Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML 2010), Haifa, Israel, 21–24 June 2010; Cambridge University Press: Cambridge, UK. [Google Scholar]
Hardeniya, N. NLTK Essentials; Packt Publishing Ltd.: Birmingham, UK, 2015. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Team, K. Keras. 2015. Available online: https://github.com/fchollet/keras (accessed on 3 July 2022).

Figure 1. The adaptive ADI approach, including five main stages.

Figure 2. The architecture of CNN approach.

Table 1. Statistics for the collected dataset.

Dialect Class	Translated to English	Total Number	Percentage
اللهجة الحجازية	Hijazi Dialect	1003	27.4%
اللهجة النجدية	Najdi Dialect	1007	26.7%
اللهجة الجنوبية	Janobi Dialect	808	21.4%
اللهجة الحساوية	Hasawi Dialect	920	24.4%
Total		3768	100%

Table 2. The different n-gram combinations that were tested in our experiments.

Character n-grams Model	Type of Character n-grams
(1, 1)	Unigrams
(1, 2)	Unigrams + bigrams
(1, 3)	Unigrams + bigrams + trigrams
(1, 4)	Unigrams + bigrams + trigrams + four-grams
(2, 2)	Bigrams
(2, 3)	Bigrams + trigrams
(2, 4)	Bigrams + trigrams + four-grams
(3, 3)	Trigrams
(3, 4)	Trigrams + four-grams
(4, 4)	Four-grams

Table 3. The different tested models for the CNN approach.

	ADI Problem	With Tashkeel #characters = 67	Without Tashkeel #characters = 55
Without Drop Layers	4-way 3-way	Model 1 Model 5	Model 2 Model 6
With Drop Layers	4-way 3-way	Model 3 Model 7	Model 4 Model 8

Table 4. Results of four-way ADI problem.

Classifier	Precision	Recall	F-Measure	Accuracy
Model 1	41.7%	33.9%	28.2%	35.4%
Model 2	25.0%	32.2%	28.0%	33.2%
Model 3	26.1%	33.5%	28.7%	33.7%
Model 4	41.5%	33.3%	27.5%	34.5%
MultinomialNB max_df = 0.75, min_df = 1, ngram_range = (1, 2)	34.83%	34.42%	34.25%	35.8%
BernoulliN max_df = 0.75, min_df = 1, ngram_range = (1, 2)	38.68%	36.0%	36.2%	36.3%
SGDC max_df = 0.5, min_df = 1, ngram_range = (1, 4)	38.7%	38.8%	38.7%	38.9%
SVC max_df = 0.75, min_df = 1, ngram_range = (1, 4)	40.5%	39.6%	39.5%	40.6%
Logistic Regression max_df = 0.75, min_df = 1, ngram_range = (1, 4)	40.6%	40.1%	40.2%	40.9%
LinearSvc max_df = 0.75, min_df = 1, ngram_range = (1, 4)	39.9%	39.7%	39.8%	40.2%
NuSVC max_df = 0.75, min_df = 1, ngram_range = (1, 4)	40.6%	40.4%	40.4%	40.9%

Table 5. Results of three-way ADI problem (classes 1 and 3).

Classifier	Precision	Recall	F-Measure	Accuracy
Model 5	43.0%	43.7%	41.1%	43.1%
Model 6	65.4%	46.1%	42.4%	47.0%
Model 7	44.78%	44.0%	43.7%	43.8%
Model 8	61.7%	42.1%	38.1%	43.2%
MultinominalNB max_df = 0.75, min_df = 1, ngram_range = (1, 2)	44.5%	44.4%	43.9%	44.3%
BernoulliN max_df = 0.75, min_df = 1, ngram_range = (1, 1)	45.8%	44.1%	43.9%	43.6%
SGDC max_df = 0.75, min_df = 1, ngram_range = (1, 4)	44.1%	44.3%	43.8%	43.9%
SVC max_df = 0.75, min_df = 1, ngram_range = (1, 3)	45.6%	45.6%	45.5%	45.8%
Logistic Regression max_df = 0.75, min_df = 1, ngram_range = (1, 4)	47.1%	47.1%	46.7%	46.8%
LinearSVC max_df = 0.75, min_df = 1, ngram_range = (1, 4)	46.0%	46.1%	45.7%	45.8%
NuSVC max_df = 0.75, min_df = 1, ngram_range = (1, 4)	45.9%	46.0%	45.7%	45.8%

Table 6. Results of two-way ADI problem (classes 1 and 3).

Classifier	Precision	Recall	F-Measure	Accuracy
Model 1	65.6%	64.1%	63.4%	64.5%
MultinomialNB max_df = 0.1, min_df = 1, ngram_range = (1, 4)	63.5%	62.9%	62.7%	63.4%
BernoulliN max_df = 0.75, min_df = 5, ngram_range = (1, 1)	60.3%	59.9%	59.1%	59.3%
SGDC max_df = 0.1, min_df = 1, ngram_range = (1, 4)	59.9%	59.9%	59.9%	60.1%
SVC max_df = 0.75, min_df = 1, ngram_range = (1, 3)	61.2%	61.2%	61.2%	61.4%
Logistic Regression max_df = 0.75, min_df = 1, ngram_range = (1, 4)	59.1%	59.0%	58.9%	59.3%
LinearSVC max_df = 0.75, min_df = 1, ngram_range = (1, 4)	59.7%	59.6%	59.6%	59.9%
NuSvc max_df = 0.75, min_df = 1, ngram_range = (1, 4)	60.4%	60.3%	60.3%	60.6%

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alqurashi, T. Applying a Character-Level Model to a Short Arabic Dialect Sentence: A Saudi Dialect as a Case Study. Appl. Sci. 2022, 12, 12435. https://doi.org/10.3390/app122312435

AMA Style

Alqurashi T. Applying a Character-Level Model to a Short Arabic Dialect Sentence: A Saudi Dialect as a Case Study. Applied Sciences. 2022; 12(23):12435. https://doi.org/10.3390/app122312435

Chicago/Turabian Style

Alqurashi, Tahani. 2022. "Applying a Character-Level Model to a Short Arabic Dialect Sentence: A Saudi Dialect as a Case Study" Applied Sciences 12, no. 23: 12435. https://doi.org/10.3390/app122312435

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Applying a Character-Level Model to a Short Arabic Dialect Sentence: A Saudi Dialect as a Case Study

Abstract

1. Introduction

2. Related Work

3. Saudi Dialect Identification Approach

3.1. Saudi Dialect Data Collection Phase

3.2. Data Preprocessing and Labelling

3.3. Character-Based Feature Extraction

3.4. Deep Learning Character-Based Model

3.5. Classical Machine Learning Character-Based Model

3.6. Model Evaluation Performance

4. Experiments and Results

4.1. Results of the Multiway ADI Problem

4.2. Results of the Two-Way ADI Problem

5. Discussion

6. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI