A range of empirical assessments was carried out to determine the efficacy of the SSA-TC-RPE model in classifying Arabic vernaculars. The capability of the proposed SSA-TC-RPE model in categorizing ADs was comprehensively scrutinized.
4.1. Data
The Training for the proposed model was executed using three key datasets. The primary dataset was HARD [
27], consisting of reviews from a variety of booking sites, sorted into five distinct classes. This was followed by employing the BRAD [
26] and LABR [
22] datasets for further training. This study made use of datasets at the review level, incorporating BRAD, HARD, and LABR. Specifically, BRAD’s reviews, sourced from the Goodreads website, were classified into five distinct scales. The class distributions for HARD, BRAD, and LABR are outlined in
Table 2,
Table 3 and
Table 4, respectively. It should be noted that the datasets in this study were used in their original, raw form, which might have implications for the accuracy of the model. Furthermore, preprocessing was performed on all sentences, involving sentence segmentation to break down the reviews into discrete sentences. This procedure also included the elimination of Latin characters, non-Arabic elements, diacritics, hashtags, punctuation, and URLs from the ADs texts. The Arabic dialect texts underwent orthographic normalization to ensure consistency and standardization [
2]. Emoticons within the data were converted into corresponding textual descriptions, and adjustments were made for words that were artificially lengthened. To prevent the risk of the model becoming over-fitted, an early stopping mechanism was implemented, setting the patience parameter at three epochs. For assessing the SSA-TC-RPE model’s effectiveness, which incorporates ITL in classifying Arabic dialect texts, a checkpoint system was utilized to save the most optimal weights of the model. The dataset was divided, with 80% allocated for training and 20% for testing. Additionally, a K-fold cross-validation technique with k = 2 was employed to establish a split between training and testing for the model’s evaluation [
53]. The choice of k = 2 for K-fold cross-validation in our model’s evaluation was primarily driven by the aim to optimize the balance between comprehensive model testing and computational efficiency. Given the extensive nature of our datasets and the computational resources at our disposal, k = 2 allowed us to rigorously validate our model while ensuring that the computational time remained manageable. This approach enabled a thorough assessment of the model’s performance under different data splits, providing valuable insights into its stability and generalizability across varying training and testing configurations.
Examining the HARD, BRAD, and LABR datasets revealed the sentiment distribution within these samples. The HARD dataset contained 409,562 entries, categorized into 5 sentiment types. Allocating 80% of this dataset (327,649 samples) for training and the remaining 20% (81,912 samples) for testing allowed for a comprehensive understanding of sentiment variation. In a similar manner, the BRAD dataset with 510,598 entries was divided, with 80% (408,478 samples) used for training and 20% (101,019 samples) for testing. The smaller LABR dataset, consisting of 63,257 entries, also maintained the same 80–20 division for training (50,606 samples) and testing (12,651 samples). These splits ensured that all five sentiment categories were well-represented in both the training and testing stages, aiding the models in grasping sentiment subtleties and applying this understanding to new data. Biases in text classification models can significantly affect their accuracy. The datasets BRAD, HARD, and LABR were chosen for their relevance in analyzing Arabic dialects through text classification tasks. These datasets encompass reviews from various sources, each offering unique insights into different aspects of Arabic language use, sentiment distribution, and dialectal variation. The BRAD dataset includes book reviews, providing a rich source of dialect Arabic literary content. The HARD dataset comprises hotel reviews, offering a wide range of informal Arabic expressions. Lastly, the LABR dataset contains movie reviews, which present both formal and informal language use. Employing these datasets allows for a comprehensive evaluation of the model’s performance across diverse domains, ensuring its applicability and robustness in real-world Arabic text classification scenarios.
If training data contain biases, they might distort the results. To mitigate this issue and ascertain the optimal data selection for our SSA-TC-RPE text classification model tailored for Arabic dialects, we executed four specific steps:
Ensured that the training dataset included a diverse range of sources, covering various demographic groups, geographic areas, and social environments. This method was aimed at reducing biases, leading to a dataset that was not just comprehensive but also balanced in its representation.
Verified that the sentiment labels within the training dataset were uniformly allocated across all demographic groups and perspectives.
Set established clear guidelines for labeling, instructing human annotators to maintain neutrality and avoid infusing their own biases into the sentiment labels. This strategy helped ensure consistency and minimize the likelihood of bias in the dataset.
Undertook a thorough review of the training data to identify any underlying biases. This involved examining aspects such as demographic imbalances, stereotype reinforcement, and potentially underrepresented groups. Once these biases were detected, corrective actions were taken. These included using methods like data augmentation, increasing the representation of underrepresented groups through oversampling, and implementing various preprocessing techniques.
4.5. Results
A series of experimental tests were conducted using the proposed SSA-TC-RPE model for Arabic dialects. We trained the SSA-TC-RPE system with various configurations, including different numbers of attention heads (AHs) in the MHA sub-layer and varying quantities of encoders to determine the most effective structure. The system also underwent training with different word embedding sizes for each token. This study evaluated the impact of employing two inductive transfer learning approaches–concurrent and alternating–on the system’s performance. The effectiveness of the SSA-TC-RPE system in text classification was measured using an automated accuracy metric. This part of the research presents an evaluation of the SSA-TC-RPE system’s performance in five-polarity classification tasks for Arabic dialects. The outcomes of these empirical tests on the HARD, BRAD, and LABR datasets are presented in
Table 5,
Table 6 and
Table 7. As detailed in
Table 5 and
Table 8, the SSA-TC-RPE system demonstrated high performance on the HARD imbalanced dataset, achieving an accuracy of 87.20%, an F-score of 86.91%, and a precision of 86.50%. This was accomplished with a configuration of 2 AH, 80 tokens, 10 experts, a batch size of 65, a filter size of 30, a dropout rate of 0.20, and an embedding dimension of 25 for each token. The system’s notable accuracy is attributed to the effective integration of the ITL framework, MoE mechanism, and MHA approach, especially for right-to-left scripts like ADs. The MoE mechanism uses a series of expert networks that analyze different aspects of the input data and combine their outputs through a gating network. This feature allows the model to dynamically choose from a range of parameters (expert modules) based on the input, enhancing its ability to accurately discern sentiments. Comparing the SSA-TC-RPE model’s results to the top-performing system on the HARD dataset, it outperformed the LR [
61] model by 11.1% in accuracy. Moreover, the developed model excelled beyond AraBERT [
59], achieving an accuracy that was 6.35% higher, and also outdid the T-TC-INT model [
45] by a margin of 5.37% in accuracy. Also, the proposed model outperformed the ST-SA model [
47] by 3.18% in accuracy. This superior performance can be attributed to the simultaneous processing of related learning tasks, which broadened the data spectrum and minimized the likelihood of overfitting [
62]. The system demonstrated adeptness in recognizing both syntactic and semantic elements, enabling precise detection of sentiments expressed in AD sentences.
Moreover, the suggested SSA-TC-RPE system demonstrated remarkable results on the imbalanced BRAD dataset. As depicted in
Table 6, the model attained an accuracy of 72.17%, an F-score of 71.89%, and a precision of 71.72%. These outcomes were achieved with configurations of 4 AH, 22 tokens, 18 experts, a batch size of 50, a filter size of 35, a dropout rate of 0.25, and an embedding dimension of 55 for each token. As indicated in
Table 9, the SSA-TC-RPE system significantly outperformed the LR method [
26], showing an accuracy improvement of 24.47%. It also exceeded the AraBERT model [
60] by 11.32% and the T-TC-INT [
44] system by 10.44%. It also exceeded the SA-ST model [
48] by 3.36%. Furthermore, the use of a switching self-attention-based shared encoder, one for each classification task, enabled the model to effectively represent the context before, after, and around any position within a sentence, leading to a more nuanced and comprehensive understanding.
Additionally, as outlined in
Table 7, the proposed SSA-TC-RPE exhibited remarkable results on the complex and imbalanced LABR dataset. In this research, the model impressively achieved an accuracy rate of 86.89%, an F-score of 85.91%, and a precision of 85.39%, outperforming other methods. It is notable that this high level of performance was attained with specific configurations, including 2 AHs, a filter size of 30, 90 tokens, 11 experts, a batch size of 80, a dropout rate of 0.20, and an embedding dimension of 65 for each token. These results underscore the efficacy of the SSA-TC-RPE model in tackling the intricacies of text classification in an imbalanced dataset, demonstrating its robust capabilities.
Table 10 showcases the superior performance of the proposed SSA-TC-REP when compared to a range of alternative methodologies. The SSA-TC-RPE model notably exceeded several other models by considerable margins. For instance, it surpassed the SVM [
23] model with an impressive accuracy increase of 36.59%, outdid the MNP [
22] model by 41.89% in accuracy, exceeded the HC(KNN) [
24] model by 29.09%, and achieved an accuracy that was 27.93% higher than AraBERT [
59]. Furthermore, it outstripped the HC(KNN) [
25] model by 14.25% in accuracy. The model also performed better than the T-TC-INT model [
44], showing an accuracy improvement of 8.76%. Furthermore, the proposed model also performed better than the SSA-TC-RPE model [
47], showing an accuracy improvement of 2.98%. The criteria for assessing model performance accuracy include empirical evaluations across several datasets (HARD, BRAD, and LABR), with accuracy rates of 87.20% for HARD, 72.17% for BRAD, and 86.89% for LABR. The significance of achieving these accuracy rates for all the datasets (HARD, BRAD, and LABR) lies in the model’s ability to handle complex linguistic features of Arabic dialects effectively, especially given the challenges associated with smaller datasets and under-resourced languages. The high accuracy rates demonstrate the model’s enhanced adaptability and improved sentence representation accuracy, highlighting its potential for applications in natural language processing tasks that require detailed understanding and classification of right-to-left texts such as Arabic Dialects.
Within the realm of deep learning, joint training involves training a single neural network on several interconnected tasks at the same time. Rather than developing individual models for each task, this strategy enables the model to recognize and utilize shared characteristics across all tasks, thereby enhancing its versatility and operational efficiency. This method typically leads to improved outcomes for each task, as the model benefits from the synergistic relationships between the tasks. In contrast, imbalanced data refers to datasets in which the distribution of classes (or categories) is uneven. This often means that one or more classes are underrepresented compared to others, presenting potential difficulties in training the model and assessing its performance. This scenario poses challenges for deep learning models, as they may develop a tendency to favor the more prevalent class, leading to inferior performancein the less represented classes. The evaluation results indicate that the SSA-TC-RPE system, when applied to both joint and alternative learning methods, shows remarkable effectiveness. Alternative training outperformed joint training, as evidenced by higher accuracies of 87.20% and 81.79% on the imbalanced HARD dataset, and 72.17% and 68.21% on BRAD, respectively, as shown in
Table 11. In contrast to conventional approaches, alternative training in a five-tier classification model seems to better capture subtle feature variations in text sequences compared to single-task learning. These results suggest that alternative learning is more suitable for complex TC tasks, enabling the development of more intricate and detailed latent representations for ADs TC tasks. The distinct performance difference between the two methodologies can be ascribed to how alternative training utilizes the diverse data volumes available in the datasets of each task.
Shared layers often contain a greater amount of information for tasks with larger datasets. However, joint learning may exhibit a bias towards tasks associated with significantly larger datasets. Therefore, alternative training methods are generally considered more appropriate for tasks like text classification of Arabic dialects. This is especially the case when dealing with two separate datasets for different tasks, such as in machine translation scenarios where the transition is from ADs to MSA and then to English [
2]. By alternating the network’s focus between tasks, the efficiency of each task is enhanced without the need for additional training data [
1]. Additionally, harnessing the synergistic relationship between related tasks can improve the efficacy of five-point classification systems. The marked improvements in our model’s performance can be ascribed to several factors. Outperforming established models like AraBERT, renowned for its proficiency in Arabic language tasks, is a notable accomplishment. Our model’s ability to exceed AraBERT’s performance on the same datasets demonstrates its enhanced precision in processing Arabic dialects. Even small improvements in accuracy are valuable, as they contribute to the overall development of models tailored for Arabic dialect processing. These advancements can have practical applications, such as in more accurate text classification, better information retrieval, and other natural language processing tasks designed for Arabic dialects.
The SSA-TC-RPE system notably did not show significant improvements in the BRAD dataset when compared to existing models. This lack of enhanced performance might stem from the system’s limited understanding of the distinct characteristics, idiomatic expressions, and linguistic subtleties unique to the BRAD Arabic dataset. A lack of adequate domain-specific adaptation could lead to a disconnect between the features the model learns and the unique elements of the BRAD dataset, resulting in suboptimal performance. To boost the model’s effectiveness in text classification on the BRAD dataset, implementing advanced deep learning techniques, particularly domain adaptation, is essential. For instance, the use of transformers, especially BERT, has been transformative in NLP due to their proficiency in contextualizing text. Optimizing a pre-trained BERT model specifically for the BRAD dataset could substantially improve its text classification capabilities.
In evaluating the practicality, while pre-trained models are easily accessible, fine-tuning them demands significant computational power and NLP expertise. This process is viable with the availability of these resources. Adversarial training is another crucial method, where the model is trained to withstand manipulative adversarial examples. For text classification involving five-polarity Arabic dialects, this approach can enhance the model’s ability to deal with subtle and varied expressions of sentiment. Although the implementation of adversarial training can be intricate and resource-intensive, it is achievable with sufficient deep-learning resources and know-how. Domain-adaptive fine-tuning stands out as an effective technique, particularly for text classification on the BRAD dataset. It involves incrementally fine-tuning a pre-trained model using a combination of data from both the source and target domains, with an increasing emphasis on the target domain. This method facilitates the model’s adaptation to the unique linguistic features and sentiment expressions specific to the BRAD dataset.
Moreover, domain-adaptive fine-tuning is a viable option when there is an ample supply of data from both the source and target domains. It requires fewer resources than building a model from the ground up. Meta-learning, another approach, trains a model across a variety of tasks, enabling it to quickly adapt to new tasks or domains. This method proves beneficial in five-polarity text classification for Arabic dialects ADs, as it can accommodate a wide range of expressions and contexts. However, meta-learning needs diverse training datasets and substantial computational resources, making it suitable for environments with ample resources. In cases where the BRAD dataset encompasses multilingual data, cross-lingual models like multilingual BERT become effective. These models, trained in several languages, are adept at conducting text classification across different linguistic scenarios. Pre-trained versions of these models, akin to BERT, are available. Fine-tuning them for the specific languages in the BRAD dataset is essential and can be executed with the right computational infrastructure.