An Optimized Arabic Multilabel Text Classification Approach Using Genetic Algorithm and Ensemble Learning

Alzanin, Samah M.; Gumaei, Abdu; Haque, Md Azimul; Muaad, Abdullah Y.

doi:10.3390/app131810264

Open AccessArticle

An Optimized Arabic Multilabel Text Classification Approach Using Genetic Algorithm and Ensemble Learning

¹

Department of Computer Science, College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Al-Kharj 11942, Saudi Arabia

²

Department of Commerce, Utkal University, Bhubaneswar 751004, India

³

Department of Studies in Computer Science, University of Mysore, Mysore 570006, India

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(18), 10264; https://doi.org/10.3390/app131810264

Submission received: 18 July 2023 / Revised: 7 September 2023 / Accepted: 8 September 2023 / Published: 13 September 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Multilabel classification of Arabic text is an important task for understanding and analyzing social media content. It can enable the categorization and monitoring of social media posts, the detection of important events, the identification of trending topics, and the gaining of insights into public opinion and sentiment. However, multilabel classification of Arabic contents can present a certain challenge due to the high dimensionality of the representation and the unique characteristics of the Arabic language. In this paper, an effective approach is proposed for Arabic multilabel classification using a metaheuristic Genetic Algorithm (GA) and ensemble learning. The approach explores the effect of Arabic text representation on classification performance using both Bag of Words (BOW) and Term Frequency-Inverse Document Frequency (TF-IDF) methods. Moreover, it compares the performance of ensemble learning methods such as the Extra Trees Classifier (ETC) and Random Forest Classifier (RFC) against a Logistic Regression Classifier (LRC) as a single and ensemble classifier. We evaluate the approach on a new public dataset, namely, the MAWQIF dataset. The MAWQIF is the first multilabel Arabic dataset for target-specific stance detection. The experimental results demonstrate that the proposed approach outperforms the related work on the same dataset, achieving 80.88% for sentiment classification and 68.76% for multilabel tasks in terms of the F1-score metric. In addition, the data augmentation with feature selection improves the F1-score result of the ETC from 65.62% to 68.80%. The study shows the ability of the GA-based feature selection with ensemble learning to improve the classification of multilabel Arabic text.

Keywords:

Arabic language; genetic algorithm; ensemble learning; multi-label; text classification

1. Introduction

The process of giving text a label is known as Text Categorization(TC/Text Classification) [1]. When the same comment belongs to many classes (multilabel), this becomes a more challenging task, especially when datasets are imbalanced [2,3]. Multilabel Arabic Text Classification (ML-ATC) is a crucial issue for the Arabic language because it is widely employed in many different fields, including sentiment analysis, bioinformatics, image classification, and scene classification [4].

There are different transformation methods to handle this task. The first approach is to convert the target variable to a multilabel binarize and then use the OnevsRest with any classification algorithm, such as logistic regression or support vector classifier. The second approach is domain adaption for multilabel classification tasks. In addition, much work has been done for binary and multi-class classification, but multilabel remains extremely difficult, particularly when the Arabic language is taken into account; for instance, one comment contains the categories “sarcasm,” “hate speech,” and “fake news” [5].

Social media is a widely used platform that offers a massive amount of user-generation daily, which is used to learn personal information about users and fight wrong information, such as rumors, gaining crucial information that can aid in decision-making [6]. At this time, one of the hotspots in text mining research is multilabel learning, which has been widely applied to different domains, including image processing, natural language processing, etc. However, multilabel classification of Arabic text is a challenging task due to the complexity and diversity of the Arabic language.

On the other side, the overwhelming volume of text data available today makes text processing a challenging task, especially with high dimensions. Selecting the appropriate features is a crucial task and compulsory to achieve by machine learning algorithms in many cases. Dimensionality reduction in the context of text categorization has been successfully achieved using a variety of optimization approaches, including the Genetic Algorithm (GA). However, the performance of these techniques in the multilabel classification of Arabic text is still relatively under-explored [7].

There are distinguished methods available in the literature to select features for the text classification task, such as chi-square [8], mutual information [9], information gain [10], document frequency, and proportional difference. [11,12]. All these techniques have been used to reduce the dimension and remove words that have a low contribution to enhancing classifier performance. However, in many cases, the performance of the classification task is still poor. Therefore, in this study, a GA-based metaheuristic optimization algorithm is used for feature selection effectively as the advanced binary meta-heuristics models proposed in [13,14,15,16].

This article aims to propose a new model that analyzes and compares the performance of optimization-based feature selection techniques, hybrid representation methods, and ensemble classification techniques for the multilabel classification of Arabic text. Specifically, we will evaluate the effectiveness of the Bag-of-Words (BOW) and Term Frequency-Inverse Document Frequency (TF-IDF) representations. Further, the performance of different classification techniques, such as the Logistic Regression Classifier (LRC), Random Forest Classifier (RFC), and Extra Trees Classifier (ETC) [17,18], to choose optimal features using a GA is discussed. The ensemble feature selection method helps us to find the optimal feature subset from several feature subsets through an integration method [19].

The contributions of the research work are summarized as follows:

Designing an effective Arabic multilabel classification approach using the GA-based metaheuristics feature selection method and ensemble machine learning models.
Applying unigram features with two different representation vectors to boost the classification performance of Arabic multilabel text.
Evaluating the proposed approach on a new benchmark Arabic text dataset, namely, MAWQIF.
Conducting a comprehensive analysis by comparing the experimental results to prove the efficiency of the proposed model.

The rest of the paper is organized as follows: Section 2 gives the associated related work. Section 3 explains the research methodology with more detail about the proposed approach algorithm. Section 4 presents the experimental results with a comprehensive analysis of the dataset used. Section 5 gives a detailed discussion of the results and the findings of the research study. Section 6 offers the conclusion and future directions of the research work.

2. Related Works

This section focuses on the current multilabel of Arabic text models on social media platforms such as Twitter, Instagram, and Facebook. For classification problems, ML models have been used in conjunction with different Feature Selection (FS) approaches and ensemble models. Data and a background review of pertinent models and approaches are provided by the study of the research. The summary of the literature review is included in this section. Several works have been done for multi-class Arabic text classification [20,21]. However, for multilabels, they are still relatively rare, and we explore some of the existing work as follows.

Seghir et al., presented a model to classify Arabic fake news and hate speech with multilabel. Their work evaluated 10,828 Arabic tweets, which contained 10 classes. They used it to train and evaluate various models using different models [22]. Alsalimi et al., aimed to promote multilabel Arabic text by creating new data called “RTAnews” that can be used for the classification of Arabic news articles [23]. Taha et al., addressed the missing label problem in an ML-ATC task. They proposed two methods: GB-AS and UG-MLP [24]. The goal of this work was to tag a news article automatically using its vocabulary features. [25]. Omar et al., created a multilabel dataset. He has used two methods for annotation: manual and semi-supervised. The data can be used for different tasks, including multilabel Arabic text [26]. Ameur et al., proposed and released a new dataset, AraCOVID19-MFH: Coronavirus Disease 2019 (COVID-19) Arabic Multilabel Fake News &Hate Speech Detection Data [22]. Abuqran et al., proposed Arabic multi-topic labeling using BiLSTM. They used the Mowjaz Multi-Topic dataset [27].

3. Research Methodology

In this section, we describe and explain the research methodology steps of the Arabic multilabel classification approach. We use the feature selection method to select the best text features for training machine learning (ML) models and improving their performance to classify topics’ classes and multilabels of the Arabic language. A grid search with document frequency is used to reduce the vector size of ML models. It is beneficial to make the models computationally faster and less complex. In the ensemble learning phase, the metaheuristics based on GA are utilized to select the best features and identify the smallest possible combinations of ensemble models that achieve the highest performance result. Figure 1 illustrates the proposed model with different stages, explained as the following.

3.1. Preprocessing

Preprocessing multilabel text classification involves a series of steps to prepare the text data before it can be used for classification. The steps include:

3.2. Representation

Representation of multilabel text classification refers to the process of converting text data into a numerical format that can be used for classification. This involves extracting relevant features from the text data and encoding them in a way that the classification algorithm can understand. Several representation techniques can be used for multilabel text classification, including:

Bag-of-Words (BOW): This technique represents the text data as a collection of words and their frequencies in a document. Each document is represented by a vector of word frequencies. This technique is simple and efficient but does not capture the context or meaning of the words.
Term Frequency-Inverse Document Frequency (TF-IDF): This technique calculates the importance of a word in a document based on its frequency in the document and its frequency in the corpus. Words that occur frequently in a document but rarely in the corpus are considered more important. This technique addresses the limitations of BOW by taking into account the importance of words in the corpus.

3.3. Genetic Algorithm (GA)

Metaheuristic is a higher-level optimization technique that provides a general framework for solving complex optimization problems. GA is a specific type of metaheuristic, falling under the broader umbrella of evolutionary algorithms. In this work, GA focuses on optimization through a process inspired by evolution. They offer a versatile and effective way to solve complex optimization problems by mimicking the principles of genetics, selection, and reproduction. GA is a search and optimization technique that is inspired by the process of natural selection and genetics. We use it to find approximate solutions to feature selection that are difficult to solve through traditional methods via different steps as follows:

Here is an overview of how a GA works:

Initialization.
Evaluation.
Selection.
Reproduction (crossover and mutation).
New Generation:
Termination.
Solution extraction.

The success of the GA involves consideration of parameters such as population size, crossover and mutation rates, and the structure of the solution space, as we mention in our online library: https://pypi.org/project/MetaHeuristicsFS/ (accessed on 1 June 2023).

3.4. Classification

The purpose of multilabel text classification is to classify text data into multiple categories or labels using ML models. These models are broadly categorized into two types: supervised and unsupervised. In this phase of the approach, we have divided the sentiment, sarcasm, and stance class dataset into 80% for training and 20% for testing.

We trained the proposed approach using one vs. rest classifier approaches for every class individually. We have also trained a multilabel classification model that simultaneously considers sentiment, sarcasm, and stance. Different feature types, such as unigram, bi-gram, and tri-gram, are used to enhance our proposed model performance. Document frequency for the term is used for performing feature selection for unigram, bi-gram, and tri-gram in higher and lower thresholds. A count between 1 and 5 document occurrences is used for words with a low document occurrence. For words that appear in higher numbers, a percentage value between 20% and 65% is used.

The best combination of lower and higher word counts, and percentages can be found through the grid search method. Lower and higher document count words are removed through the “min_df” and “max_df” options available in TfidfVectorizer: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html (accessed on 1 June 2023) and CountVectorizer: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html (accessed on 1 June 2023). These two vectors are from the popular scikit-learn Python machine-learning library for choosing low and high-frequency words, respectively [28].

The feature vector is trained for the base model using training data from a text corpus and generated for training and validation data using a trained vector. At the base model, ETC, LRC, and RFC are used. From each base model, class probabilities are obtained for each class for training data and validation data. The model is trained using logistic regression. Feature selection for the meta-model is performed using a metaheuristics algorithm. Finally, actual vs. predicted values are compared for actual and predicted labels for validation data using the F1-score.

For the metaheuristic, different hyper-parameters are used, such as 15 as the number of generations to evaluate 150 populations. Then, 0.9 is used as the probability of crossover, and 0.1 is the probability of mutation. Individual features represent feature type, feature vector type, modeling technique, and class for which class probability is predicted. The final feature subset is used to identify which base models have no contribution to the solution and is eliminated from the list of base models. The text feature selection with an ensemble module is used [29]. The steps of proposed feature selection-based ML models for multilabel Arabic text classification are given in Algorithm 1.

Algorithm 1: Metaheuristics feature selection for multilabel Arabic text classification
Inputs:	D = {D₁, D₂, ……, D_n};
	Where _n of Arabic documents;
	D_i selected document;
	Where ∀ (D_i) ∃ C_j (C is the name of the class) and (j) number of classes in the dataset;
Output:	Assign D_{i(Test Document)} to correct classes C_j;
Begin	Read All collections of documents in (dataset)
	For i = 1 to n
	Do Preprocessing for Di
	D[I] ← Tokenization(D[i]);
	D[I] ← Stopword Removal (D[i]);
	D[I] ← Stemming (D[i]);
	D[I] ← TFIDF(D[i])/D[I] ← BoW(D[i]);
	D[I]Train = 80%; D[I]Test = 20%;
“Training phase”	W = Input matrix with weights Train (TFIDF/BoW; Document);
	Weight (W) for document and add Labels (L) for each document;
	ExtraTreesClassifier ← (W + L₁, L₂, …, L_n) where W is refer text and L number of Labels;
	Logistic Regression ← (W + L); where W is refer text and L number of Labels;
	RandomForestClassifier ← (W + L₁, L₂, …, L_n) where W is refer text and L number of Labels;
	Apply Feature Selection Metaheuristics with properties from: “https://pypi.org/project/MetaHeuristicsFS/ (accessed on 1 June 2023)”
“Testing phase”	W = Input matrix with weights Train (BoW/TFIDF;Document);
	ExtraTreesClassifier ← (W + L₁, L₂, …, L_n) where W is refer text and L number of Labels;
	Logistic Regression ← (W + L); where W is refer text and L number of Labels;
	RandomForestClassifier ← (W + L₁, L₂, …, L_n); where W is reffer text and L number of Labels;
	Apply Feature Selection Metaheuristics with properties from: “https://pypi.org/project/MetaHeuristicsFS/ (accessed on 1 June 2023)”
	End for
	Push vector value without corresponding label to classification algorithm then let the algorithm to predict L₁, L₂, …, L_n;
End

Because most datasets are imbalanced, the F1-score is the most significant measure for unbiased evaluation of the classifiers and models. Accuracy and other metrics might be affected by the class imbalanced problem. Thus, we use the F1-score as a major single evaluation metric of the approach. It is an appropriate metric in the case of balanced and unbalanced datasets. Some other important metrics, such as precision and recall, are already used for computing the F1-score, and it is sufficient to compare the performance based on a single measure. Classification performance is also compared between the ensemble models and the metaheuristics algorithm for candidate selection.

4. Experimental Analysis and Results

We explore the performance analysis of our proposed approach in the following section. The datasets, baseline models, and evaluation matrices used in this evaluation process are described in this part, along with a comparison of the outcomes. The section below also includes a description of the recommended settings for the model’s optimized hyper-parameters.

4.1. Experimental Setup

The system requirements for the development environment are Windows 10 and an Intel(R) Core (TM) i5-1005G1 CPU running at 1.20 GHz to 1.19 GHz with 8 GB of RAM. The research models are developed using Python (version 3.11.1): https://www.python.org (accessed on 1 June 2023), a programming language, in a Jupyter: https://jupyter.org/ (accessed on 9 July 2023) notebook. Data preparation is carried out using the Python package known as Natural Language Toolkit NLTK: https://www.nltk.org (accessed on 15 July 2023). The text has been vectorized by Keras: https://keras.io/ (accessed on 15 July 2023). Data processing and analysis are done using Pandas 1.16.0: https://pandas.pydata.org (accessed on 15 July 2023) and NumPy 1.24.1: https://numpy.org (accessed on 15 July 2023), respectively. The dataset of the study is collected from [30] and divided into training and test sets. An exploratory data analysis of training and test sets is established for understanding the relationships of the classes. Because the dataset used for evaluating the proposed approach is imbalanced, the F1-score metric is obtained on the test set. The F1-score is the most significant measure for unbiased evaluation of the classifier model trained on class-imbalanced datasets. In all experiments, the results of the F1-score are given as a major single evaluation metric of the approach. The precision and recall are already used for computing the F1-score. Therefore, it is sufficient to give the F1-score result as a single evaluator for comparing the models.

4.2. Dataset

The first multilabel Arabic dataset for target-specific stance identification, namely MAWQIF, is used. It contains 4121 tweets for three different topics in different dialects of Arabic, making up the accessible dataset. Three tasks are offered by MAWQIF: specifically, stance, sentiment, and sarcasm. Hence, MAWQIF can act as a new multilabel benchmark scenario. The dataset is available in ref. [30]. The distribution of the dataset is explored in Table 1, such that the data is divided into two parts: 80 for training and 20 for testing. One example of the dataset instance is given in Table 2, and we visualize the distribution of these classes.

Based on Figure 2, we see that a large number of Arabic tweets in the dataset have positive and not sarcastic labels about the topic of the tweets. If ML models are used on top of this dataset, they can accurately predict the positive and not sarcasm classes. However, the fewer tweets of the negative and sarcasm classes might be overwhelmed by the majority classes. Therefore, it is sensible to generate or collect more instances for minority classes to have a good representation for making the ML models classify the instances accurately and work well.

The detail of the multilabel MAWGIF dataset is presented in this subsection. The preprocessing process is performed to convert every text sample to a binary classification problem. Therefore, every sample has only yes or no, meaning zero or one. After that, we apply different types of representation with different classification models. Finally, we discuss the results of the classifiers. Table 2 explores one example with its translation to the English language.

4.3. Evaluation

In this work, we have used only the F-score to evaluate our model, which works properly by considering recall and precision.

F 1 - s c o r e = \frac{1}{n} \sum_{i = 1}^{n} \frac{2 |y^{(i)} \land {\hat{y}}^{(i)}|}{(| y^{(i)} | + | {\hat{y}}^{(i)} |)}

(1)

where

$n ⟹$ Number of training examples
$y^{(i)} ⟹$ True labels for the ith training example
${\hat{y}}^{(i)} ⟹$ Predicted lables for the ith training example
$\land ⟹$ Logial AND operator
$\lor ⟹$ Logical OR operator

4.4. Hyper-Parameters Initialization

This subsection explores the hyper-parameters initialization process of metaheuristic-based feature selection and classification modules. The feature selection with a cross-validation module is used to identify a combination of features that gives the best result using the GA technique. The feature selection using the GA-based metaheuristic search selects the features of text after its representation, and the values of its hyper-parameters can control the feature selection process. Moreover, the classification task of multilabel classes is performed by the classification module. These hyper-parameters are variables that determine how the feature selection process and classification task can work. For the classification module, the hyper-parameters of classifier models are initialized with their default values. On the other hand, for initializing the hyper-parameters of metaheuristic-based feature selection with a cross-validation module, the ‘Trial and Error’ method is applied for tuning with appropriate values and used as default values for the module. The workflow of this method starts after building the classification models. Many possible hyper-parameter values are tried based on our experience and the analysis of previously evaluated experimental results. This method requires an adequate amount of experience and prior knowledge to select the optimal or near-optimal values for the hyper-parameters with restricted time. Cross-validation and feature selection are used to improve the classification performance and avoid overfitting. Table 3 shows the hyper-parameters of feature selection with a cross-validation module. For more details, you can refer to our library at the following link: https://pypi.org/project/TextFeatureSelection/ (accessed on 1 June 2023).

Moreover, the selected default values of metaheuristic hyper-parameters are given in Table 4. For more details, you can see our library at the following link: https://pypi.org/project/TextFeatureSelection (accessed on 1 June 2023).

For the genetic algorithm, Table 5 gives the selected default values of its hyper-parameters. The code of the optimization process is available in our library, and for more details, refer to this link: https://pypi.org/project/MetaHeuristicsFS/ (accessed on 1 June 2023).

4.5. Implementation of Metaheuristic GA-Based Feature Selection

In this subsection, we explain how the GA is implemented to select the features of multilabel Arabic text. We explore the fitness function, the feature representation, and the genetic operators of the GA. Due to the GA adopting the binary representation, we represent the processed text features as binary string bits vectors using both Count Vectorizer and Tfidf Vectorizer. The f1_score cost function mentioned in Table 4 is the fitness function for the metaheuristic GA method. In addition, the classification models are trained on the population and children of the GA using the cross-validation technique and evaluated based on the cost function, the type of cost function, and the cost function improvement. Because the task is classification, increasing the value of the cost function improvement is considered as an indication of the best value of the fitness function. The micro-averaged aggregation of the f1-score is used for the type of cost function. The initial values of the other hyper-parameters in Table 4 and Table 5 are used for implementing the metaheuristic GA-feature selection module. Algorithm 2 clearly presents the pseudocode of the GA operations and steps to select the best features and best model.

Algorithm 2: Metaheuristic GA-based feature selection pseudocode

1.: Hyper-parameters Initialization
2.: classification_models = {“ETC”, “RFC”, “LRC”}; //The approach’s classification models
3.: cost_function = “f1_score”; //It is the fitness function of GA
4.: cost_function_type = ”micro”; //The type of average f1_score is the micro
5.: cost_function_improvement = ”increase”; //The improvement is maximizing the cost function
6.: number_of_generations = 100; //Max number of iterations as stopping criteria
7.: number_of_population = 150; //Max number of population for GA
8.: probability_of_crossover = 0.9; //It is the probability of crossover
9.: probability_of_mutation = 0.1; //It is the probability of mutation
10.: run_time = 120; It is a time limit as another stopping criteria
11.: n_jobs = −1; //The number of parallel jobs for training
12.: random_state = 1; //The seed point for the random generation
13.: Input
14.: dataset_features; //The dataset with its features represented as binary string
vectors using Count Vectorizer and TFIDF Vectorizer
15.: Begin
16.: population_of_GA ← Generate_Initial_Population (dataset_features,
number_of_population);
17.: best_model, best_dataset_features ← Cross_Validation (population_of_GA, cost_function,
  classification_models,
  cost_function_improvement,
  cost_function_type, n_jobs,
random_state);
18.: while (number_of_generations) OR (run_time) is not reached do
19.: parents_of_GA ← Selection(population_of_GA);
20.: children_of_GA ← Crossover(parents_of_GA, probability_of_crossover);
21.: children_of_GA ← Mutation(children_of_GA, probability_of_mutation);
22.: best_model, best_dataset_features ← Cross_Validation (children_of_GA,
   cost_function,
   classification_models,
  cost_function_improvement,
  cost_function_type, n_jobs,
  random_state);
23.: population_of_GA ← {population_of_GA} ∪ {children_of_GA};
24.: End while
25.: End
26.: Output
27.: best_model, best_dataset_features;

4.6. Results

This subsection gives an exploratory analysis of the MAWGIF dataset and demonstrates the results of the Arabic multilabel classification task using GA and stack ensemble model models. It evaluates the results based on the dataset and its augmented data. The experimental results of the study are shown in Table 6, Table 7, Table 8, Table 9 and Table 10. The F1-score results of each task scenario are individually given, and the F1-score results of the multilabel task scenario are then obtained. The F1-score metric is used for evaluation because it is a suitable metric for class-imbalanced datasets, such as the case of the MAWGIF dataset.

We apply two types of representation, namely CountVectorizer and TfidfVectorizer. The experiments compare the performance of single-task label classification with feature selection and without feature selection and grid search optimization algorithm. We explore the experimental results and compare our method with existing work in the discussion section.

4.6.1. Exploratory Analysis of Training and Test Sets

After reading and cleaning the texts of the training and test sets, the results of their exploratory analysis are presented to understand meaningful relationships between their classes. It is very important to gain domain knowledge together with identifying whether there is bias in the labels of the dataset. Below are several figures and charts with detailed explanations. The following chart is one attempt to give an intuition of potentially different lengths of text. Figure 3 shows the distribution of text length in the training and test sets. Text length limits the density and richness of conveyed information used for better classification.

Figure 3 shows that the distribution of text length in the training and test sets is modeled as a normal distribution, and approximately both sets have the same distribution. Moreover, it shows that most tweets have 50–250 words, which is the richness of conveyed information. However, the density of text length for each class label of the training and test sets is significant for analyzing whether the distributions of text length do not differ from each other. Figure 4 demonstrates the density distribution of text length for each topic’s class label of the training and test sets.

Because the distributions of text length in Figure 4 have normal-like distributions and to ensure they do not differ from each other, the ANOVA test is used on targets of the training and test sets for the three topics. The results of the ANOVA test are listed in Table 6 and Table 7.

From the results in Table 6, we can see that the p-value < 0.05 for the training set of three topics, and we fail to accept that the distributions of text length do not differ from each other. According to the results of the ANOVA test for the test set of the three topics in Table 7, we can notice that the p-value > 0.05 for the sentiment target, and we fail to reject that the distributions of text length do not differ from each other; whereas, the p-value < 0.05 for the sarcasm and stance targets and we fail to accept that the distributions of text length do not differ from each other. To get an intuition on how the features of the dataset are organized in higher dimensions, a t-distributed Stochastic Neighbor Embedding (t-SNE) is applied to visualize complex datasets into two dimensions for more understanding of the underlying patterns and relationships in the dataset features and labels. Figure 5 gives the t-SNE charts of the training and test sets, and Figure 6, Figure 7 and Figure 8 show the density of the word counts and unique word counts for both the training and test sets, as well as the density training set based on its labels for each topic.

The charts from Figure 5 are obtained by passing the Arabic text features extracted using TfidfVectorizer and reduced using the Singular Value Decomposition (SVD) method to the TSNE Visualizer. The SVD method is a matrix factorization that simplifies the Eigen-decomposition of an

n \times n

square matrix to any

n \times m

matrix. From Figure 5, we can see that there are significant overlaps between the classes of the sarcasm topic in both the training and test set in addition to the minority of the sarcasm class that may decrease the results of sarcasm as well as a multilabel classification task. For more analysis of dataset Arabic texts, Figure 6, Figure 7 and Figure 8 show content quality as it is an essential measure for indicating how rich and diverse the content vocabulary is.

4.6.2. Results of the Arabic Sentiment

The first scenario is studying the sentiment of users’ opinions about the COVID-19 topic in which two cases (positive or negative) are only considered, as explored in Table 8. In addition, we have extended the result section with experimentation by augmenting the data using back translation with a transfer learning model. Furthermore, we have applied the proposed model for the sentiment class, as seen in Table 9. We can notice that the classifiers of the proposed approach have better results with augmented data.

Here, a different number of models are trained to serve as sentiment classifiers. The evaluation outcome for the trained models is summarized in Table 8 for original data and Table 9 for augmented data. The results with bold font represent the highest achieved F1 scores. With and without using the genetic algorithm for feature selection and two feature vectors, we implemented the LR, RFC, and ETC. Additionally, we used the unigram features to improve the outcome. According to the obtained results for original data, the ETC is the superlative successful model, but RFC was the best with augmented data. The best F1-score result of the sentiment scenario using the ETC with feature selection is 77.38% and 77.12% without feature selection. In addition, the RFC was 86.51%. The F1-score of the proposed GA-based ensemble ETC model for original data is better than the F1-score of the LRC and RFC by 2.28% and 0.29%, respectively.

4.6.3. Results on the Arabic Sarcasm

The second scenario is performing experiments on the sarcasm of users’ tweets about digital transformation topics in which we consider only two cases: sarcasm if sarcasm is there and not sarcasm if not. Table 10 gives the results with two different scenarios with feature selection and without feature selection. Furthermore, we have applied the proposed models for the sarcasm class, as shown in Table 11.

The adopted classifiers are also trained on the training set of sarcasm classes. The assessment outcome for the trained models is summarized in Table 10. The numbers in bold font represent the highest F1 score results for each data split. Using a GA with different feature vectors, we classify the sarcasm task scenario combined with feature selection. Ensemble classifiers have also been employed with the unigram feature to boost the outcome. The ensemble ETC is the most successful, according to the results. The best F1-score for the sarcasm scenario is 95.00% with feature selection and 95.38% without feature selection using ETC. The best F1-score of proposed GA-based ensemble models for augmented sarcasm data is 96.52% with feature selection for the ETC model.

4.6.4. Results on the Arabic Stance about Women’s Empowerment

This subsection explores the result based on the users’ stance on women’s empowerment. This class has two values: positive and negative. During the experiments of the stance scenario, we trained the adopted ML models to classify stance labels. The outcomes of trained models are presented in Table 12 for the original dataset and in Table 13 for the augmented dataset, respectively. The best F1-score results are 94.11% for the original dataset and 95.50% for the augmented data using the ETC model with feature selection. The results with bold font represent the highest F1-score for each classification model. Using a GA with different feature vectors, we optimize the learning procedure and feature selection method.

From Table 12 and Table 13, we can see that the ensemble classifiers improve the obtained results compared to the single LRC model. The ETC is the best model, achieving 90.73% of the F1-score for the stance scenario without feature selection, 94.11% with feature selection on the original data, and 95.50% for the augmented dataset. The F1-score of the ETC model outperforms the other models on both the original and augmented datasets by approximately 5.00%.

4.6.5. Results of Multilabel Task Scenario

This scenario is interesting because we study users’ opinions and stances based on the three topics, and their labels are individually classified and reported in Table 8, Table 9, Table 10, Table 11, Table 12 and Table 13. We train the ML models as multilabel classifiers. The assessment outcome of the trained models is listed in Table 14. The highlighted results with bold font represent the highest F1-score for each model. Using GA with different feature vectors, we perform the learning technique on selected features. ETC with unigram features improves the classification performance by 4.42% and 0.29% compared with the LRC and RFC. It achieves 65.62% of the F1-score for the multilabel scenario without feature selection and 68.76% of the F1-score with feature selection, according to the experimental results on the test set.

We notice during the optimization using GA that a change in one hyper-parameter value can affect the values of other hyper-parameters. Furthermore, the optimized values of other hyper-parameters are continually reliant on the best-selected values. The GA is utilized in the metaheuristic method to find the best values for optimization. The main advantage of this research is to use the GA’s capabilities through the metaheuristic method for optimizing the hyper-parameters and improving the classifier models of Arabic multilabel tasks. The highest F1-score results of each task’s label and multilabel scenario with and without feature selection and optimization are given in Table 14. In addition, Table 15 visualizes the result for augmented data with good performance.

Table 16 shows the highest result of the F1-score for the proposed approach. The results confirm the ability of the model for all scenarios, including multilabel. Hence, we can identify the percentages of improved results, which are equal to 7.34% for the sentiment class, 1.23% for the sarcasm, 4.77% for the stance classification, and 3.18% for a multilabel scenario in the case of the proposed approach with feature selection and data augmentation compared to without feature selection.

5. Discussion

We conducted experiments on a new benchmark Arabic text dataset, namely, MAWQIF, which was collected in December 2022 [30]. The MAWQIF contains three topics: sentiment, sarcasm, and stance, as well as the multilabels of the three topics. The proposed approach is evaluated on the dataset for each topic label and then for multilabel classes. Performing feature selection using GA and metaheuristics with ML base models and TF-IDF and count vectorizers of unigram features can impact the classification performance in two ways. Firstly, it decreases the vector size during the feature representation and reduces the model complexity and size. Secondly, it can also reduce noise and improve the performance of ML models.

We used three classifiers called the ETC, LRC, and RFC with both CountVectorizer and TfidfVectorizer. In all settings, the unigram for extracting features is also used. The F1-score is selected as an appropriate metric because the dataset is imbalanced, and the accuracy is not suitable in this case. Furthermore, the precision and recall are discounted because it is already used for computing the F1-score. The ensemble learning technique is adopted due to its ability to combine the power of multiple models to make better decisions than individual models. Regarding the sarcasm topic, all five configuration settings enhanced the performance, as seen in Table 8. Remarkably in sentiment, the ETC with CountVectorizer has the highest result, improving the F1-score from 79.17% to 80.88%, followed by the RFC with CountVectorizer, which boosts the F1-score from 75.81% to 77.26% and also better performance than the LRC.

For the classification results of the sarcasm topic shown in Table 10, the feature selection increases the F1-score of the LRC from 93.52% to 94.21%, indicating the effectiveness of the feature selection for improving the classification performance of Arabic sarcasm labels. The classification performance of other models is also slightly increased, but they achieved a higher F1-score than the LRC.

In the third scenario, stance classes are classified with and without feature selection, as presented in Table 12. The results showed a slight increase due to the imbalanced class problem. In the last experiment, Table 14 and Table 15 focused on the main goal of this study, which is how the GA-based feature selection with ensemble learning can improve the performance of multilabel classification results from 65.62% to 68.76% on the original dataset using ETC with feature selection. In addition, the data augmentation with feature selection improves the result of the F1-score to 68.80% using the same classifier.

Finally, due to a lack of studies on Arabic multilabel classification and datasets, the obtained results are compared with the available current work [26] and with our base model. Table 17 presents a comparison of the proposed approach with this current existing work on the same available dataset. It proves that the developed model outperforms the current existing method significantly.

In the last part of this discussion, we mention some implications of this work as follows:

The proposed multilabel model provides a new approach to solving the problem of text classification by allowing multiple labels to be assigned to a single document that represents real-life scenarios.
It enhances the performance and granularity of text classification by understanding Arabic text when various classes are assigned for a single document.
The model builds on existing techniques such as ML, DL, and Natural Language Processing (NLP), which are key areas of research in machine learning and artificial intelligence.
It could open up new avenues for research in the field of NLP and text classification, especially in areas such as sentiment analysis, topic modeling, and opinion mining.

6. Conclusions and Future Work

Compared to many other languages, such as English, there is less experimental study on multilabel Arabic text classification. In recent years, it has been shown that integrating the results of many models helps lower generalization mistakes and handle the large variety of individual classifiers. Additionally, the ensemble might be used to choose the best features in collaboration with the genetic algorithm. As a result, the ensemble is a sophisticated method for coping with the large variety of individual classifiers while reducing overall mistakes and selecting the best features. The concept of creating a prediction model by merging many models into an ensemble. In this study, a framework for multilabel text classification is proposed. The proposed model makes use of a new representation model and ensemble feature selection via metaheuristics using a genetic algorithm. In addition, we implement a unique ensemble classification model. The type of representation, feature selection, and classification algorithm play important roles in enhancing performance. Based on our results, the proposed model is an effective technique to enhance the performance of different scenarios, specifically the multilabel model.

The findings of the study state that the multilabel classification can be challenging for any language, including the Arabic language. However, Arabic text presents some unique issues, which give more challenges and opportunities for future work. Some of these issues are morphological complexity and various dialects; consequently, designing a model to address these challenges with a multilabel scenario is still interesting. In addition, there are different challenges, such as a lack of large-size balanced labeled datasets and Arabic multilabel pre-trained models. Overall, Arabic multilabel text classification is a rapidly evolving field, and there are numerous opportunities for future work to improve the performance, efficiency, and applicability of these models. Further, we plan to address the limitations of minor classes, label uncertainty, and reduce the number of labeled texts needed to train the models by using models’ hyper-parameters optimization, active learning, and zero-shot classification in future work.

Author Contributions

Conceptualization, S.M.A. and A.Y.M.; methodology, S.M.A. and A.Y.M.; software, S.M.A., A.G., M.A.H. and A.Y.M.; validation, S.M.A., A.G., M.A.H. and A.Y.M.; formal analysis, S.M.A., A.G., M.A.H. and A.Y.M.; investigation, S.M.A., A.G., M.A.H. and A.Y.M.; resources, S.M.A., A.G., M.A.H. and A.Y.M.; data curation, S.M.A., A.G., M.A.H. and A.Y.M.; writing—original draft preparation, S.M.A., A.G., M.A.H. and A.Y.M.; writing—review and editing, S.M.A., A.G. and A.Y.M.; visualization, S.M.A., A.G. and A.Y.M.; supervision S.M.A., A.G. and A.Y.M.; project administration, S.M.A. and A.G.; funding acquisition, S.M.A. and A.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this study is available in reference [30].

Acknowledgments

The authors extend their appreciation to the Deputyship for Research & Innovation Ministry of Education in Saudi Arabia for funding this research work through the project number (IF2/PSAU/2022/01/23208).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

ML	Machine Learning
ML-ATC	Multilabel Arabic Text Classification
TC	Text Classification
GA	Genetic Algorithm
TF-IDF	Term Frequency-Inverse Document Frequency
BOW	Bag-of-Words
PCA	Principal Component Analysis
SVD	Singular Value Decomposition
COVID-19	Coronavirus Disease 2019

References

Lee, J.; Yu, I.; Park, J.; Kim, D.W. Memetic feature selection for multilabel text categorization using label frequency difference. Inf. Sci. 2019, 485, 263–280. [Google Scholar] [CrossRef]
Zhu, X.; Li, J.; Ren, J.; Wang, J.; Wang, G. Dynamic ensemble learning for multi-label classification. Inf. Sci. 2023, 623, 94–111. [Google Scholar] [CrossRef]
Suhail, M. Representation and Classification of Text Data. Ph.D. Thesis, University of Mysore, Mysur, India, 2019. [Google Scholar]
Zhao, D.; Gao, Q.; Lu, Y.; Sun, D. Non-Aligned Multi-View Multi-Label Classification Via Learning View-Specific Labels. IEEE Trans. Multimed. 2022; early access. [Google Scholar] [CrossRef]
Almuzaini, H.A.; Azmi, A.M. An unsupervised annotation of Arabic texts using multi-label topic modeling and genetic algorithm. Expert Syst. Appl. 2022, 203, 117384. [Google Scholar] [CrossRef]
Bhowmick, R.S.; Ganguli, I.; Sil, J. Character-level inclusive transformer architecture for information gain in low resource code-mixed language. Neural Comput. Appl. 2022, 2, 1–19. [Google Scholar] [CrossRef]
Zhao, D.; Gao, Q.; Lu, Y.; Sun, D. Learning multi-label label-specific features via global and local label correlations. Soft Comput. 2022, 26, 2225–2239. [Google Scholar] [CrossRef]
Alhaj, Y.A.; Xiang, J.; Zhao, D.; Al-Qaness, M.A.A.; Abd Elaziz, M.; Dahou, A. A Study of the Effects of Stemming Strategies on Arabic Document Classification. IEEE Access 2019, 7, 32664–32671. [Google Scholar] [CrossRef]
Ali, L.; Bukhari, S.A.C. An Approach Based on Mutually Informed Neural Networks to Optimize the Generalization Capabilities of Decision Support Systems Developed for Heart Failure Prediction. IRBM 2021, 42, 345–352. [Google Scholar] [CrossRef]
Liu, Q.; Chen, C.; Zhang, Y.; Hu, Z. Feature selection for support vector machines with RBF kernel. Artif. Intell. Rev. 2011, 36, 99–115. [Google Scholar] [CrossRef]
Muaad, A.Y.; Davanagere, H.J.; Guru, D.S.; Benifa, J.V.B.; Chola, C.; AlSalman, H.; Gumaei, A.H.; Al-antari, M.A. Arabic Document Classification: Performance Investigation of Preprocessing and Representation Techniques. Math. Probl. Eng. 2022, 2022, 3720358. [Google Scholar] [CrossRef]
Masadeh, M.; Davanager, H.J.; Muaad, A.Y. A Novel Machine Learning-based Framework for Detecting Religious Arabic Hatred Speech in Social Networks. Int. J. Adv. Comput. Sci. Appl. 2022, 13, 767–776. [Google Scholar] [CrossRef]
Zhu, Y.; Li, W.; Li, T. A hybrid Artificial Immune optimization for high-dimensional feature selection. Knowl.-Based Syst. 2023, 260, 110111. [Google Scholar] [CrossRef]
Xue, Y.; Zhu, H.; Liang, J.; Słowik, A. Adaptive crossover operator based multi-objective binary genetic algorithm for feature selection in classification [Formula presented]. Knowl.-Based Syst. 2021, 227, 107218. [Google Scholar] [CrossRef]
Santucci, V.; Baioletti, M.; Di Bari, G. An improved Memetic Algebraic Differential Evolution for solving the Multidimensional Two-Way Number Partitioning Problem. Expert Syst. Appl. 2021, 178, 114938. [Google Scholar] [CrossRef]
Simeon, M.; Hilderman, R. Categorical Proportional Difference: A Feature Selection Method for Text Categorization. Available online: https://www.researchgate.net/publication/221337966_Categorical_Proportional_Difference_A_Feature_Selection_Method_for_Text_Categorization (accessed on 3 April 2023).
Muaad, A.Y.; Hanumanthappa, J.; Prakash, S.P.S.; Al-Sarem, M.; Ghabban, F.; Bibal Benifa, J.V.; Chola, C. Arabic Hate Speech Detection Using Different Machine Learning Approach; Springer: Cham, Switzerland, 2023; pp. 429–438. [Google Scholar]
Al-Salemi, B.; Ab Aziz, M.J.; Mohd Noah, S.A. BoWT: A hybrid text representation model for improving text categorization based on Adaboost.MH. In Multi-Disciplinary Trends in Artificial Intelligence; MIWAI 2016. Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2016; Volume 10053, pp. 3–11. [Google Scholar] [CrossRef]
Saeys, Y.; Abeel, T.; Van De Peer, Y. Robust feature selection using ensemble feature selection techniques. In Machine Learning and Knowledge Discovery in Databases; ECML PKDD 2008. Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2008; Volume 5212, pp. 313–325. [Google Scholar] [CrossRef]
Muaad, A.Y.; Davanagere, H.J.; Al-antari, M.A.; Benifa, J.V.B.; Chola, C. AI-Based Misogyny Detection from Arabic Levantine Twitter Tweets. Comput. Sci. Math. Forum 2022, 2, 15. [Google Scholar] [CrossRef]
Muaad, A.Y.; Davanagere, H.J.; Benifa, J.V.B.; Alabrah, A.; Ahmed, M.; Saif, N.; Pushpa, D.; Al-antari, M.A.; Alfakih, T.M. Artificial Intelligence-Based Approach for Misogyny and Sarcasm Detection from Arabic Texts. Comput. Intell. Neurosci. 2022, 2022, 7937667. [Google Scholar] [CrossRef]
Hadj Ameur, M.S.; Aliane, H. AraCOVID19-MFH: Arabic COVID-19 Multi-label Fake News & Hate Speech Detection Dataset. Procedia Comput. Sci. 2021, 189, 232–241. [Google Scholar] [CrossRef]
Al-Salemi, B.; Ayob, M.; Kendall, G.; Noah, S.A.M. Multi-label Arabic text categorization: A benchmark and baseline comparison of multi-label learning algorithms. Inf. Process. Manag. 2019, 56, 212–227. [Google Scholar] [CrossRef]
Taha, A.Y.; Tiun, S.; Rahman, A.H.A.; Ayob, M.; Abdulameer, A.S. Unified Graph-Based Missing Label Propagation Method for Multilabel Text Classification. Symmetry 2022, 14, 286. [Google Scholar] [CrossRef]
El Rifai, H.; Al Qadi, L.; Elnagar, A. Arabic text classification: The need for multi-labeling systems. Neural Comput. Appl. 2022, 34, 1135–1159. [Google Scholar] [CrossRef]
Omar, A.; Mahmoud, T.M.; Abd-El-Hafeez, T.; Mahfouz, A. Multi-label Arabic text classification in Online Social Networks. Inf. Syst. 2021, 100, 101785. [Google Scholar] [CrossRef]
Abuqran, S. Arabic Multi-Topic Labelling using Bidirectional Long Short-Term Memory. In Proceedings of the 2021 12th International Conference on Information and Communication Systems (ICICS), Valencia, Spain, 24–26 May 2021; pp. 492–494. [Google Scholar] [CrossRef]
Pedregosa Fabianpedregosa, F.; Michel, V.; Grisel Oliviergrisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Vanderplas, J.; Cournapeau, D.; Pedregosa, F.; Varoquaux, G.; et al. Scikit-learn: Machine Learning in Python Gaël Varoquaux Bertrand Thirion Vincent Dubourg Alexandre Passos PEDREGOSA, VAROQUAUX, GRAMFORT ET AL. Matthieu Perrot. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Haque, M.A. Feature Engineering & Selection for Explainable Models: A Second Course for Data Scientists. 27 December 2022. Available online: https://www.amazon.com/Feature-Engineering-Selection-Explainable-Models/dp/1387371312/ref=monarch_sidesheet (accessed on 7 September 2023).
Alturayeif, N.S.; Luqman, H.A.; Ahmed, M.A.K. Mawqif: A Multi-label Arabic Dataset for Target-specific Stance Detection. In Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP), Abu Dhabi, United Arab Emirates, 8 December 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 174–184. [Google Scholar]

Figure 1. Architecture model for multilabel classification of Arabic text.

Figure 2. Distribution of classes in the training and test sets, respectively: (a,b) are the distribution of sentiment classes; (c,d) are the distribution of sarcasm classes; and (e,f) are the distribution of stance classes.

Figure 3. Distribution of text length: (a) the distribution of length texts in the training set and (b) the distribution of text length in the test set.

Figure 4. Density distribution of text length for each topic’s class label of the training and test sets, respectively: (a,b) are the density distribution of sentiment classes; (c,d) are the density distribution of sarcasm classes; and (e,f) are the density distribution of stance classes.

Figure 5. t-SNE charts of the training and test sets, respectively: (a,b) are the t-SNE charts of the sentiment classes; (c,d) are the t-SNE charts of the sarcasm classes; and (e,f) are the t-SNE charts of the stance classes.

Figure 6. Density distribution of the word counts and unique word counts of the sentiment data texts: (a,c) are the density distribution of the word counts and unique word counts of the training set regarding the sentiment classes; (b,d) are the density distribution of the word counts and unique word counts of the training and test sets.

Figure 7. Density distribution of the word counts and unique word counts of the sarcasm data texts: (a,c) are the density distribution of the word counts and unique word counts of the training set regarding the sarcasm classes; (b,d) are the density distribution of the word counts and unique word counts of the training and test sets.

Figure 8. Density distribution of the word counts and unique word counts of the stance data texts: (a,c) are the density distribution of the word counts and unique word counts of the training set regarding the stance classes; (b,d) are the density distribution of the word counts and unique word counts of the training and test sets.

Table 1. Distribution of the MAWQIF dataset.

No.	Topic of Tweets	موضوع التغريدات	No. of Arabic Tweets	Training (80%)	Testing (20%)
1.	COVID-19 Vaccine	تحصين كويد -19	1373	1167	206
2.	Digital Transformation	موقف التحول الرقمي	1348	1145	203
3.	Women Empowerment	موقف من النساء	1400	1190	210
Total			4121	3502	619

Table 2. Examples of multilabel Arabic text by the MAWQIF dataset.

No.	Original Arabic Text	Translated To English Text *	Sentiment	Sarcasm	Stance
1	كما أشكر خادم الحرمين الشريفين وصاحب السمو ولي العهد على اهتمامهما بصحة المواطن، كما أشكر وزير الصحة على تنظيمه الرائع واستقباله الجيد.	I also thank the Custodian of the Two Holy Mosques and His Highness the Crown Prince for their concern for the health of the citizen, and I also thank the Minister of Health for his distinguished organization and good reception.	Negative/Positive	Sarcasm/Not Sarcasm	Negative/Positive

* This column is a translation of the original text from Arabic to English language.

Table 3. The hyper-parameters of feature selection with cross-validation module.

Hyper-Parameter	Description
doc_list	Python list with text documents
use_class_weight = True	Python list with Y labels
save_data = True	Boolean value representing if you want to apply class weight before training classifiers
label_list	The list of labels
use_class_weight = True	The use of class weight
n_crossvalidation = total_cross_val validation_done	The number of cross-validation
n_crossvalidation = total_cross	How many cross-validation samples
stop_words	Stop words for count and TF-IDF vectors
pickle_path	Path where base model
base_model_list	List of machine learning algorithms to be trained
vector_list	Type of text vectors from sklearn to be used
feature_list	Type of features to be used for ensembling
method	Which method you want to specify for metaheuristics feature selection

Table 4. Selected default values of metaheuristic hyper-parameters.

Hyper-Parameter	Value
classification_models	ETC, RFC, and LRC
n_jobs	−1
random_state	1
cost_function	f1_score
cost_function_type	micro-averaged
cost_function_improvement	increase

Table 5. Selected default values of genetic algorithm hyper-parameters.

Hyper-Parameter	Value
number of generations	100
number of population	150
probability of crossover	0.9
probability of mutation	0.1
run_time	120 min

Table 6. ANOVA test results of training set targets for the three topics.

Target	Sum_sq	Df	F	PR (>F)
Sentiment	679.471926	1.0	4.376833	0.036502
Sarcasm	5243.281393	1.0	34.060803	5.827878 × 10⁻⁹
Stance	7626.842878	1.0	49.764778	2.079751 × 10⁻¹²

Table 7. ANOVA test results of test set targets for the three topics.

Target	Sum_sq	Df	F	PR (>F)
Sentiment	130.569338	1.0	0.88228	0.347946
Sarcasm	896.219770	1.0	6.107121	0.013733
Stance	1221.586594	1.0	8.354293	0.003983

Table 8. The results of the five configurations settings for the sentiment task scenario with and without the feature selection and optimization methods.

Classifier Model	Feature	Vector	F1-Score without Feature Selection (%)	F1-Score with Feature Selection (%)
ETC	Unigram	CountVectorizer	79.17	80.88
ETC	Unigram	TfidfVectorizer	77.12	77.38
LRC	Unigram	CountVectorizer	75.01	75.32
RFC	Unigram	CountVectorizer	75.81	77.26
RFC	Unigram	TfidfVectorizer	75.86	76.86

Table 9. The results of the five configuration settings for the augmented sentiment task scenario with and without the feature selection and optimization method.

Classifier Model	Feature	Vector	F1-Score with Feature Selection (%)
ETC	Unigram	CountVectorizer	86.12
ETC	Unigram	TfidfVectorizer	86.26
LRC	Unigram	CountVectorizer	82.21
RFC	Unigram	CountVectorizer	86.51
RFC	Unigram	TfidfVectorizer	86.22

Table 10. The results of the five configuration settings for the sarcasm task scenarios with and without the feature selection and optimization method.

Classifier Model	Feature	Vector	F1-Score without Feature Selection (%)	F1-Score with Feature Selection (%)
ETC	Unigram	CountVectorizer	95.00	95.38
ETC	Unigram	TfidfVectorizer	95.20	95.38
LRC	Unigram	CountVectorizer	93.52	94.21
RFC	Unigram	CountVectorizer	95.38	95.38
RFC	Unigram	TfidfVectorizer	95.32	95.38

Table 11. The results of the five configuration settings for the augmented sarcasm task scenario with and without the feature selection and optimization method.

Classifier Model	Feature	Vector	F1-Score with Feature Selection (%)
ETC	Unigram	CountVectorizer	96.52
ETC	Unigram	TfidfVectorizer	96.22
LRC	Unigram	CountVectorizer	94.21
RFC	Unigram	CountVectorizer	96.22
RFC	Unigram	TfidfVectorizer	96.12

Table 12. The results of the five configuration settings for the stance task scenario with and without the feature selection and optimization method.

Classifier Model	Feature	Vector	F1-Score without Feature Selection (%)	F1-Score with Feature Selection (%)
ETC	Unigram	CountVectorizer	90.73	94.11
ETC	Unigram	TfidfVectorizer	90.41	90.41
LRC	Unigram	CountVectorizer	87.42	88.25
RFC	Unigram	CountVectorizer	89.79	89.79
RFC	Unigram	TfidfVectorizer	90.44	90.44

Table 13. The results of the five configuration settings for the augmented stance task scenario with and without the feature selection and optimization method.

Classifier Model	Feature	Vector	F1-Score with Feature Selection (%)
ETC	Unigram	CountVectorizer	95.50
ETC	Unigram	TfidfVectorizer	95.10
LRC	Unigram	CountVectorizer	94.54
RFC	Unigram	CountVectorizer	95.23
RFC	Unigram	TfidfVectorizer	95.44

Table 14. The results of the five configuration settings for the multilabel scenario with and without feature the selection and optimization method.

Classifier Model	Feature	Vector	F1-Score without Feature Selection (%)	F1-Score with Feature Selection (%)
ETC	Unigram	CountVectorizer	65.62	68.76
ETC	Unigram	TfidfVectorizer	64.79	64.79
LRC	Unigram	CountVectorizer	62.03	63.71
RFC	Unigram	CountVectorizer	64.02	64.54
RFC	Unigram	TfidfVectorizer	64.91	64.99

Table 15. The results of the five configuration settings for the augmented multilabel scenario with and without the feature selection and optimization method.

Classifier Model	Feature	Vector	F1-Score with Feature Selection (%)
ETC	Unigram	CountVectorizer	68.80
ETC	Unigram	TfidfVectorizer	66.54
LRC	Unigram	CountVectorizer	65.43
RFC	Unigram	CountVectorizer	65.00
RFC	Unigram	TfidfVectorizer	66.78

Table 16. The highest F1-score results of each topic and multilabel classes with and without feature selection and optimization method.

Task Scenario	F1-Score without Feature Selection (%)	F1-Score with Feature Selection (%)	F1-Score with Feature Selection and Data Augmentation (%)
Sentiment	79.17	80.88	86.51
Sarcasm	95.38	95.38	96.52
Stance	90.73	94.11	95.50
Multilabel	65.62	68.76	68.80

Table 17. Comparison with current related work on the new dataset.

Author and Ref.	Year	Model	Multilabel	No. of Topics	F1-Score (%)
Alturayeif et al. [30]	2022	Single Model	No	3	61.90
This work	2023	Single Model	Yes		65.62
This work	2023	Ensemble Model	Yes		68.76

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alzanin, S.M.; Gumaei, A.; Haque, M.A.; Muaad, A.Y. An Optimized Arabic Multilabel Text Classification Approach Using Genetic Algorithm and Ensemble Learning. Appl. Sci. 2023, 13, 10264. https://doi.org/10.3390/app131810264

AMA Style

Alzanin SM, Gumaei A, Haque MA, Muaad AY. An Optimized Arabic Multilabel Text Classification Approach Using Genetic Algorithm and Ensemble Learning. Applied Sciences. 2023; 13(18):10264. https://doi.org/10.3390/app131810264

Chicago/Turabian Style

Alzanin, Samah M., Abdu Gumaei, Md Azimul Haque, and Abdullah Y. Muaad. 2023. "An Optimized Arabic Multilabel Text Classification Approach Using Genetic Algorithm and Ensemble Learning" Applied Sciences 13, no. 18: 10264. https://doi.org/10.3390/app131810264

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Optimized Arabic Multilabel Text Classification Approach Using Genetic Algorithm and Ensemble Learning

Abstract

1. Introduction

2. Related Works

3. Research Methodology

3.1. Preprocessing

3.2. Representation

3.3. Genetic Algorithm (GA)

3.4. Classification

4. Experimental Analysis and Results

4.1. Experimental Setup

4.2. Dataset

4.3. Evaluation

4.4. Hyper-Parameters Initialization

4.5. Implementation of Metaheuristic GA-Based Feature Selection

4.6. Results

4.6.1. Exploratory Analysis of Training and Test Sets

4.6.2. Results of the Arabic Sentiment

4.6.3. Results on the Arabic Sarcasm

4.6.4. Results on the Arabic Stance about Women’s Empowerment

4.6.5. Results of Multilabel Task Scenario

5. Discussion

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI