Next Article in Journal
Antibacterial and In Vitro Bioactivity Studies of Silver-Doped, Cerium-Doped, and Silver–Cerium Co-Doped 80S Mesoporous Bioactive Glass Particles via Spray Pyrolysis
Previous Article in Journal
A Unified Vendor-Agnostic Solution for Big Data Stream Processing in a Multi-Cloud Environment
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Ad-RuLer: A Novel Rule-Driven Data Synthesis Technique for Imbalanced Classification

Soft Computing Research Group at Intelligent Data Science, Artificial Intelligence Research Center, Universitat Politècnica de Catalunya, 08003 Barcelona, Spain
*
Author to whom correspondence should be addressed.
Appl. Sci. 2023, 13(23), 12636; https://doi.org/10.3390/app132312636
Submission received: 5 October 2023 / Revised: 7 November 2023 / Accepted: 21 November 2023 / Published: 23 November 2023
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

:
When classifiers face imbalanced class distributions, they often misclassify minority class samples, consequently diminishing the predictive performance of machine learning models. Existing oversampling techniques predominantly rely on the selection of neighboring data via interpolation, with less emphasis on uncovering the intrinsic patterns and relationships within the data. In this research, we present the usefulness of an algorithm named RuLer to deal with the problem of classification with imbalanced data. RuLer is a learning algorithm initially designed to recognize new sound patterns within the context of the performative artistic practice known as live coding. This paper demonstrates that this algorithm, once adapted (Ad-RuLer), has great potential to address the problem of oversampling imbalanced data. An extensive comparison with other mainstream oversampling algorithms (SMOTE, ADASYN, Tomek-links, Borderline-SMOTE, and KmeansSMOTE), using different classifiers (logistic regression, random forest, and XGBoost) is performed on several real-world datasets with different degrees of data imbalance. The experiment results indicate that Ad-RuLer serves as an effective oversampling technique with extensive applicability.

1. Introduction

In recent years, the explosive growth of big data, amplified by significant strides in artificial intelligence, has led industries, from finance and healthcare to online platforms, to produce unprecedented amounts of data. As we navigate this data-rich environment, the challenge of imbalanced class distribution has emerged prominently. This issue, characterized by certain classes being notably underrepresented compared to others, negatively impacts the accuracy of predictive models.
Traditional machine learning approaches were conceived, assuming a roughly balanced data distribution. However, real-world datasets frequently deviate from this assumption. In some domains, achieving precise classification of minority classes is critical, given the steep repercussions associated with misclassification [1,2]. For instance, in the healthcare sector, the number of patients diagnosed with specific diseases (positive samples) is often significantly lower than those without such conditions (negative samples). Consequently, standard classifiers, which tend to favor the majority class, are prone to misclassification. Such inaccuracies may lead to critical delays in medical treatments, jeopardizing patient health. Similarly, in finance, only a small percentage of customers might have poor credit ratings, but misclassifying them can expose institutions to significant risks. Beyond the algorithmic challenges, these imbalances have tangible adverse effects in areas like financial fraud detection, intrusion detection, and product quality assessments.
To address class imbalances, researchers have developed data resampling techniques, including oversampling [3,4,5], undersampling [6,7,8], and hybrid sampling [9,10,11]. In recent years, the oversampling technique has attracted widespread attention within the academic community due to its ability to amplify underrepresented classes, enhancing the classifier’s sensitivity towards minority instances. This technique proves particularly advantageous in scenarios where each instance of the minority class is crucial, such as rare disease diagnosis or financial fraud detection.
Current oversampling methods primarily rely on interpolation techniques, aiming to achieve class balance by synthesizing samples for the minority class. However, these approaches also have some limitations, including over-constraint, low-efficiency expansion, and over-generalization [12]. Despite these challenges, there is still a relative dearth of research on alternative oversampling strategies, with rule-based oversampling algorithms, in particular, being less explored.
Recent studies have demonstrated the advantages inherent in rule-based oversampling methods, which are characterized by their robust adaptability, consistent performance, and superior interpretability [13]. Such attributes indicate that rule-based strategies have substantial potential for addressing issues of data imbalance. An in-depth exploration of these methods stands to not only enrich the existing research of oversampling techniques, but also potentially offer new insights into fields such as imbalanced data classification and predictive modeling.
In this study, we present Ad-RuLer, an innovative adaptive rule-based oversampling approach. Distinct from the prevalent interpolation-based data resampling methods, Ad-RuLer employs an iterative comparison mechanism to rapidly extract intrinsic rules from datasets, which are then used to synthesize new instances for the minority class. This algorithm builds on the principles of RuLer—an algorithm originally designed for detecting novel sound patterns in the field of live-coding performance art [14]. In this study, we apply RuLer for the first time to the problem of data oversampling, addressing the challenge of imbalanced data classification. The Ad-RuLer has been benchmarked against several mainstream data resampling techniques using ten diverse imbalanced datasets. Our research findings highlight its superior performance and notable application potential. These findings not only validate the effectiveness of the Ad-RuLer method in addressing issues of data imbalance, but also contribute a novel perspective to the methodologies for data imbalance processing.
To summarize, the main contributions of this research are as follows:
  • To tackle the challenge of data imbalance, we propose Ad-RuLer, an innovative adaptive rule-based oversampling algorithm. This method extracts intrinsic rules from minority class samples and, based on these rules, synthesizes additional minority class instances.
  • To ensure that the synthesized data maintains complete balance, a random sampling step is incorporated into the rule-based synthesizer to keep the number of minority class samples generated via rule extraction aligned with the majority class samples in the original dataset.
  • Ad-RuLer is extensively evaluated using various machine learning models and real-world datasets, demonstrating its effectiveness and good performance in comparison to existing mainstream data resampling methods. Our research findings suggest that Ad-RuLer may serve as an alternative option to conventional oversampling techniques.
The remainder of this paper is structured as follows: Section 2 introduces the relevant literature; Section 3 delves into the intricacies of Ad-RuLer; Section 4 presents the chosen datasets and our comparative analysis; and in Section 5, we discuss the experimental results and outline future work, while in Section 6, we draw our conclusions.

2. Related Work

In many application scenarios, the goal of imbalance learning is to optimize the classification performance of minority class samples and the overall classification performance. Research to address this class imbalance problem can be summarized into three methods: data-based [15,16,17], model-based [18,19,20], and hybrid approaches [21]. Model-based methods focus on algorithmic improvements to the imbalance problem, with cost-sensitive learning being a prime example. This strategy assigns different weights to the samples within a dataset, particularly in situations of class imbalance where minority class samples are prone to misclassification. Therefore, by constructing an appropriate cost-sensitive matrix to adjust the misclassification cost of samples from different classes, it provides an alternative way to balance the dataset. On the other hand, data-based methods primarily construct a new balanced dataset by altering the original data distribution. These methods can be further categorized into oversampling, undersampling, and hybrid sampling. Oversampling aims to increase the minority class samples in the original dataset by creating additional new samples. Undersampling, however, reduces the sample size of the original dataset by eliminating redundant majority class samples. Hybrid sampling combines the strategies of undersampling and oversampling to achieve a balanced treatment of the original dataset.
In the realm of undersampling techniques, the random undersampling (RUS) algorithm emerges as a foundational approach. It aims to achieve class imbalance by randomly discarding a portion of the majority class samples. However, the stochastic nature of RUS can inadvertently omit crucial characteristics embedded within the samples. Tomek-link pairs were introduced as an evolution to address this limitation, as highlighted by [22]. This method advocates for the removal of majority class samples present within Tomek-links pairs, building on the premise that such samples might represent noise or belong to boundary regions. Yet, the presence of Tomek-link pairs in datasets can sometimes limit the improvements offered by this technique in terms of classification accuracy. Branching into more advanced strategies, Yen and Lee [23] proposed an undersampling technique rooted in the principles of clustering and inter-sample distances, all while factoring in the disparity in class distribution. With clustering as its cornerstone, it guides the removal of select majority class samples to balance the dataset. In a parallel vein, Lin et al. [24] employed clustering to identify majority class samples, subsequently culling their quantity by opting for cluster centroids. Nevertheless, it is imperative to note that while undersampling approaches are adept at countering class imbalance, the removal of majority class samples might inadvertently lead to the loss of critical information from the dataset. This issue becomes more pronounced in scenarios with already limited sample sizes.
Oversampling methods, as the mainstream solution for imbalanced classification, are versatile and compatible with various machine learning classifiers. Chawla et al. [25] introduced the Synthetic Minority Oversampling Technique (SMOTE). SMOTE synthesizes minority class samples via interpolation between a data point and its neighbors, achieving sample balance. It overcomes the issue of duplicating sample data in random oversampling (ROS) algorithms, enabling the creation of diverse new samples based on linear interpolation. However, SMOTE’s random selection and generation of sample points do not fully consider the internal data structure, potentially leading to misclassification. To address this, Han et al. [26] proposed the Borderline-SMOTE algorithm, a boundary-based method that combines SMOTE with boundary information. To identify boundary samples more accurately, Barua et al. [27] introduced an optimized oversampling method, the Majority Weighted Minority Over-sampling Technique (MWMOTE). This innovative algorithm takes into account the neighboring information of both majority and minority class samples, bolstering boundary identification via reciprocal validation of neighbor conditions. Subsequently, the identified boundary samples of minority classes are clustered, and interpolation is performed within each cluster to generate new minority class samples. Expanding on this concept, Zhang et al. [28] proposed the Certainty Guided Minority OverSampling (CGMOS) algorithm, which places a particular emphasis on minority class samples. The algorithm determines the weight of each minority class sample selected based on relative certainty. These weights then inform the probability of selecting a seed sample for new sample synthesis. The next step involves generating new minority class sample data via interpolation, guided using the calculated probabilities. Douzas et al. [29] suggested an oversampling method combining k-means clustering with SMOTE to perform space division, effectively avoiding noise generation and addressing both inter-class and intra-class imbalances. He et al. [30] proposed the ADASYN algorithm, an adaptive synthetic sampling method that dynamically determines the number of synthetic samples based on sample distribution, synthesizing more minority class samples in hard-to-classify categories to achieve data balance.
In recent years, some new approaches to imbalanced learning have emerged in the literature [31,32]. A novel hybrid sampling method was proposed by Mirzaei et al. [33]. This method employs k-means clustering to categorize samples, followed by calculating the density of each sub-cluster. By selectively oversampling within less populated minority class sub-clusters and simultaneously undersampling from denser majority class segments, this method adeptly enhances the representation of minority samples while trimming superfluous majority class data, optimizing the overall classification results. Ai-jun and Peng [34] presented a method that integrates ROS, k-means clustering, and a support vector machine hybrid model. This approach first oversamples the minority class samples randomly and then uses k-means clustering to identify the majority class samples closer to the boundary. The advantage of this strategy is that it decreases the risk of accidentally deleting important samples from the majority class, which enhances the efficacy of sample selection.
Deep learning has increasingly become a focal point in research on imbalanced data classification in the past few years. Inspired by the random covering problem, Cui et al. [35] proposed the data sampling process by associating each sample with a small neighboring area, dynamically guiding the sampling through the computation of an effective sample size. Mullick et al. [36] utilized generative models to create new samples for minority classes using convex combinations of existing examples, while Kim et al. [37] sought to synthesize minority class samples by introducing learnable noise to majority class samples. Based on contrastive learning, some research aims to enhance the precision of models in classifying minority classes by improving the capability of representation learning [38,39]. In the realm of graph deep learning for imbalanced classification, Shi et al. [40] proposed a Dual-Regularized Graph Convolutional Network (DRGCN), which employs conditional adversarial training and distribution alignment training to differentiate nodes of various classes and balances the learning between majority and minority categories. The GraphSMOTE framework, introduced by Zhao et al. [41], encodes the similarity between nodes within an embedding space and synthesizes new nodes and edges to create a balanced graph for node classification. Qu et al. [42] introduced the ImGAGN model, generating nodes to emulate the attribute distribution and network topology of minority classes, which then facilitates the training of a Graph Convolutional Network (GCN) discriminator on a synthetic balanced network to distinguish between real and synthetic nodes, as well as between minority and majority nodes. In the context of semi-supervised learning with imbalanced data, a novel approach for semi-supervised medical image classification, named Adaptive Blended Consistency Loss (ABCL), has been proposed, effectively addressing class imbalance by adaptively blending the target class distribution according to the class frequency obtained by Huynh et al. [43]. Hyun et al. [44] tackled the challenge of class imbalance in semi-supervised learning (SSL) by introducing Class-Imbalanced Semi-Supervised Learning (CISSL) and proposing a new Suppressed Consistency Loss (SCL) that is robust to class imbalance in both labeled and unlabeled data.
Currently, there is sparse research on rule-based oversampling algorithms, and the gap in this research field deserves further exploration. Among the few existing studies, Liu et al. [45] proposed an oversampling algorithm based on fuzzy rules to address the issue of class imbalance. This method generates fuzzy rules and synthesizes new minority class instances to tackle the class imbalance problem. Moreover, it can handle imbalanced data with missing values. However, this method is limited to imbalanced data with numerical variables and cannot handle categorical variables. Additionally, it requires manual selection of the number of fuzzy partitions, lacking the ability for adaptive selection. In summary, harnessing data relationships and rules for synthetic data generation is a promising strategy that warrants deeper investigation. Exploring this approach could offer novel alternatives to current oversampling techniques.

3. Method

In this section, we elaborate on the proposed Ad-RuLer method, which is based on the RuLer inductive rule-learning algorithm [14,46]. This algorithm takes labeled minority class data as the input and generates corresponding IF-THEN rules. It iteratively compares the similarity of input minority class data, extracting rule entries that satisfy a user-defined threshold. New rules are then created by taking the union of the corresponding sets of extracted rules and eliminating redundant ones. Finally, the newly created rules are utilized to guide the synthesis of minority class data. The following are the main implementation details of the Ad-RuLer method.

3.1. Overview of Existing Resampling Methods

Before delving into the details of the Ad-RuLer method, we provide an overview of several prevalent resampling techniques. These techniques will later serve as benchmarks for a comparative performance assessment with the Ad-RuLer method in subsequent experimental sections. These methods include SMOTE, ADASYN, Tomek-links, Borderline-SMOTE, and KmeansSMOTE.
  • SMOTE: This oversampling technique generates new synthetic samples by interpolating between minority class instances, aiming to increase the representation of the underrepresented class. While widely applicable, SMOTE can introduce suboptimal samples in the presence of noise or significant class overlap.
  • ADASYN: This algorithm is an extension of SMOTE. ADASYN adjusts the number of synthetic samples according to the learning difficulty of individual minority class samples. It generates more synthetic data for those samples that are harder to learn, potentially leading to an improved classifier performance.
  • Tomek-links: This algorithm enhances class boundaries by identifying and removing Tomek-links, which are pairs of the nearest neighboring samples that belong to different classes. This method refines the dataset rather than augmenting it with new samples, potentially improving classifier boundaries without changing class distributions.
  • Borderline-SMOTE: This technique focuses on the border regions around minority class samples, oversampling those likely to be misclassified. Borderline-SMOTE attempts to directly enhance the quality of the classification boundary.
  • KMeansSMOTE: This method integrates K-Means clustering with SMOTE to perform oversampling within each identified cluster of the minority class, thereby preserving the inherent clustering structure of the data while synthesizing new samples.
For a more detailed introduction to these resampling methods, please refer to [22,25,26,29,30]. Distinct from the above-mentioned methods, the Ad-RuLer approach focuses on the rule-based structure inherent within the dataset. It employs an iterative process to pairwise compare various rules, ultimately generating new rules that satisfy the predefined conditions. These rules are then transformed into synthetic minority class samples, thus providing a new perspective and novel solutions to addressing the challenges associated with imbalanced datasets.

3.2. Ad-RuLer Oversampling Approach

Figure 1 depicts the entire framework of the Ad-RuLer approach. As can be seen from the figure, the rule extraction process constitutes the core of the system, incorporating two essential steps: New Rule Generation and Dissimilarity Measure. These steps form an interactive process that proceeds until the generation of all candidate rules is complete. Initially, Rule Extraction involves transforming minority class samples into IF-THEN rules. Subsequently, through pairwise comparisons, new rules are continually generated in an iterative process that continues until no further candidate rules emerge. During the Dissimilarity Measure procedure, the Hamming distance is employed to quantify the dissimilarity between rules. The process iteratively constructs new candidate rules by calculating the union of the corresponding sets of extracted rules, generating all candidates that meet a predefined dissimilarity threshold d, while the parameter ratio controls the quantity of synthesized rules. These synthesized rules are then reconverted into the minority class samples. If the majority and minority class samples within the data are not yet fully balanced, random sampling techniques are applied to achieve balance among the classes of the synthesized data. The Ad-RuLer method comprises several key procedures: Rule Extraction, Dissimilarity Measure, New Rule Generation, and Balanced Resampling.
Below, we provide a more detailed description of each procedure. The pseudocode for the data synthesis process is illustrated in Algorithm 1.
Rule extraction: Each data instance is treated as an IF-THEN rule, represented as an array of size N. The first N 1 entries are rule antecedents, and the Nth entry is the rule consequent, assigned to the label [14]. The rule extraction process iteratively compares different rules to identify patterns. For instance, the rule r = [ { 1 } , { 3 } , positive ] indicates “if the first attribute is 1 and the second attribute is 3, then the assigned label is positive”. Rules are stored in a list and can be accessed via their indices. The rule r = [ { 1 , 2 , 3 } , { 5 } , , { 7 } , positive ] means IF r [ 1 ] = 1 or 2 or 3 and r [ 2 ] = 5 and and   r [ N 1 ] = 7 , THEN the label = positive .
Dissimilarity measure: The dissimilarity measure function takes a pair of rules, r 1 and r 2 , with the same class, and if the calculated dissimilarity is less than a user-defined threshold d, it creates a new rule by taking the union of the corresponding sets of the extracted rules. For instance, if  r 1 = [ { 1 } , { 3 , 5 , 7 } , positive ] and r 2 = [ { 1 , 3 } , { 7 , 11 } , positive ] , then r 1 , 2 = [ { 1 , 3 } , { 3 , 5 , 7 , 11 } , positive ] . This procedure proceeds via pairwise comparison and iteration until no new rules can be created, returning all candidate rules that meet the conditions. The  d i s s i m i l a r i t y ( r 1 , r 2 ) is calculated by counting the number of empty intersections between the corresponding sets of the two rules, returning candidate rules that meet the conditions. For example, if  r 1 = [ { 1 } , { 4 , 5 } , positive ] and r 2 = [ { 1 , 2 } , { 5 , 6 } , positive ] , then d i s s i m i l a r i t y ( r 1 , r 2 ) = 0 . If  r 1 = [ { 1 } , { 4 , 5 } , positive ] and r 2 = [ { 1 , 2 } , { 6 } , positive ] , then d i s s i m i l a r i t y ( r 1 , r 2 ) = 1 .
New rule generation: During the rule generation process, any contradictions between rules are first checked, eliminating those with identical parameter values but different labels. In the rule generation function, r a t i o is a user-defined parameter, with  r a t i o [ 0 , 1 ] , which controls the quantity of synthesized rules. It represents that the proportion of the data generated via the candidate rules present in the original data should be greater than or equal to the user-defined value. A  r a t i o = 1 signifies that 100 % of the instances included in the candidate rules must be present in the input data to accept the rule. A  r a t i o = 0.5 indicates that at least 50 % of the input data instances should be included to accept the candidate rule. Take, for instance, the potential rule r 1 , 2 = [ { 1 , 3 } , { 5 , 6 } , positive ] . The complete set of data that could be generated by r 1 , 2 encompasses [ { 1 } , { 5 } , positive ] , [ { 1 } , { 6 } , positive ] , [ { 3 } , { 5 } , positive ] , [ { 3 } , { 6 } , positive ] . If we set the r a t i o to 0.5, it implies that half of the data must be present in the input data. If this condition is not met, the potential rule is discarded.
Balanced Resampling: In the RuLer system, the synthesis of all potential instances of the minority class is achieved using extracted patterns. By using the r a t i o parameter, the volume of the data generated can be adjusted to approximate the desired quantity as closely as possible. However, it is important to note that exact control over the volume of produced data is not feasible. To compensate for this variation, Ad-RuLer introduces a resampling phase. The initial step involves the calculation of n, the difference between the number of synthesized samples X r and the target sample size S. If  n > 0 , RUS is applied. This process randomly removes n instances from the synthesized data, thereby aligning the volume to the balanced level. Conversely, if  n < 0 , ROS is implemented. This process randomly replicates | n | instances from the synthesized data, effectively increasing the minority class volume to meet the targeted level. If n equals zero, it signifies that the synthesized instances are in perfect alignment with the targeted count, implying that the dataset is already balanced and does not require further adjustments
Algorithm 1 Rule-based Data Synthesizer
Require: 
X: input data, d N : user-defined dissimilarity threshold, r a t i o [ 0 , 1 ] : ratio for creating new rules, S: required sample size
Ensure: 
Oversampled data X r
1:
r u l e s R u l e s C r e a t i o n ( X )
2:
n e w R u l e s [ ]
3:
for  i = 0  to size of r u l e s  do
4:
    r 1 r u l e s [ i ]
5:
   for  j = i + 1 to size of r u l e s  do
6:
      r 2 r u l e s [ j ]
7:
      p a t t e r n d i s s i m i l a r i t y ( r 1 , r 2 , d )
8:
     if  p a t t e r n  then
9:
         r u l e c r e a t e R u l e ( r 1 , r 2 , r a t i o , r u l e s )
10:
        if  r u l e  then
11:
           n e w R u l e s . a p p e n d ( r u l e )
12:
        end if
13:
     end if
14:
   end for
15:
end for
16:
r u l e s . a p p e n d ( n e w R u l e s )
17:
r u l e s d e l e t e R e d u n d a n t ( r u l e s )
18:
X r s a m p l e s C r e a t i o n ( r u l e s )
19:
n N u m b e r D i f f e r e n c e ( X r , S )
20:
if  n > 0  then
21:
    X r r a n d o m U n d e r s a m p l i n g ( X r , n )   n < 0
22:
else if  n < 0  then
23:
    X r r a n d o m O v e r s a m p l i n g ( X r , a b s ( n ) )
24:
else
25:
    X r = X r
26:
end if
27:
return  X r

4. Experiment

In this section, the results of the experiments conducted on ten distinct real-world imbalanced datasets are presented to evaluate the effectiveness of the Ad-RuLer. The performance of Ad-RuLer is benchmarked against the five other resampling methods. The statistical significance of the performance differences between Ad-RuLer and the other resampling algorithms across various metrics was validated using the Wilcoxon signed-rank test. Furthermore, the Friedman test, followed by Nemenyi post hoc analysis, was employed to conduct a rank-based comparison of the resampling methods across different metrics.

4.1. Datasets

Among all the datasets used in this study, the Heart Disease and Breast Cancer datasets were sourced from the Kaggle repository, while the remaining datasets were acquired from the KEEL repository. The Heart Disease dataset, originally a multi-class dataset, was transformed into a binary classification problem by selecting samples with "num" as 0 (indicating no heart disease) and "num" as 1 (indicating stage-1 heart disease). The rest of the datasets are binary classification datasets. The statistical characteristics of these datasets, including the imbalance ratio (IR), are detailed in Table 1.

4.2. Classifiers and Resampling Approaches

For a comprehensive evaluation and comparison of Ad-RuLer with other oversampling algorithms, we implemented five widely used oversampling algorithms: SMOTE, ADASYN, Tomek-links, Borderline-SMOTE, and KMeansSMOTE. We employed three classifiers: logistic regression, random forest, and XGBoost, and combined them with Ad-RuLer and the five resampling algorithms, respectively, across the ten datasets. The optimal hyperparameters for each model were determined using five-fold cross-validation on the training set. Each dataset was partitioned into 75 % for training and 25 % for testing, where each method was executed for 50 iterations to ensure robustness in performance calculation.

4.3. Evaluation Metrics

Handling imbalanced datasets poses unique challenges, especially when distinguishing between the majority (often labeled as the negative class) and minority classes (labeled as the positive class). While accuracy is a frequently used metric, it is not always appropriate for imbalanced classification scenarios. Instead, four specific metrics, namely G-mean, F1 score, AUC, and Precision, offer a more comprehensive assessment for such cases.
The AUC calculates the area under the ROC curve, using the True Positive Rate (TPR) and the False Positive Rate (FPR)—making it a widely accepted metric for classification performance assessment. Precision, shown in Equation (1), defines the proportion of actual positive instances among all the predicted positive cases. A high Precision underscores the model’s reliability in its positive predictions. G-mean, elucidated in Equation (2), computes the geometric mean of Sensitivity and Specificity. This measurement provides a balanced perspective on the performance across both classes. The F1 score, given in Equation (5), stands as the harmonic mean of Precision and Sensitivity, encapsulating the model’s overall classification prowess.
To ensure rigorous comparisons, we employ the Wilcoxon Signed-rank Test and the Friedman Test. The former tests for performance similarities between two algorithms, while the latter discerns significant performance variations across multiple algorithms. If the Friedman Test detect significant differences, the Nemenyi post hoc analysis is used to locate these disparities precisely.
Precision = T P T P + F P
where T P and F P represent true positives and false positives, respectively.
G - mean = Specificity Sensitivity
where Sensitivity and Specificity are computed according to Equations (3) and (4), respectively.
Sensitivity = T P T P + F N
Specificity = T N T N + F P
F 1 = 2 × Precision × Sensitivity Precision + Sensitivity

4.4. Experimental Results

We conducted a comprehensive comparison of various resampling algorithms applied to ten datasets, evaluating their performance with logistic regression, random forest, and XGBoost as classifiers. The results using logistic regression are presented in Table 2. For the performance comparison results using random forest and XGBoost classifiers, please refer to Table A1 and Table A2 in Appendix A. The hyperparameters for the resampling approaches and classifiers employed in this study are shown in Table A3 in Appendix B. The experimental results highlight the superior performance of Ad-RuLer in most cases, compared to other resampling algorithms. For instance, when using logistic regression as the classifier, Ad-RuLer outperformed other resampling algorithms on all four evaluation metrics in the Breast Cancer and Car-good datasets. In the Heart Disease, Vehicle0, and Abalone9-18 datasets, apart from the Precision, the other three metrics all exceeded those of other resampling algorithms. When adopting random forest as the classifier, Ad-RuLer outperformed other resampling algorithms on all four metrics in the Heart Disease dataset. In the Vehicle0, Yeast3, Yeast-0-2-5-7-9vs3-6-8, Abalone9-18, and Car-good datasets, at least two metrics exceeded those of other resampling algorithms. In Figure 2, the synthesized minority class data samples obtained via Ad-RuLer for the Yeast3 and Heart Disease datasets were visualized together with the original data samples using the t-SNE technique [47], to give an idea of the distribution of the new data generated via Ad-RuLer.
Figure 3 and Figure 4 illustrate the impact of parameter variations in d and ratio on various performance metrics, using the Heart Disease dataset and the Yeast3 dataset as illustrative cases. We employed logistic regression as the classifier and, by holding all other parameters constant, we assessed the impact of the d and ratio parameters, respectively, on different performance metrics. Figure 3 reveals that on the Yeast3 dataset, both the F1 score and Precision exhibit a pattern of initial increase followed by a subsequent decrease as parameter d is varied, while the AUC remains largely invariant. In the Heart Disease dataset, despite the relatively minor fluctuations of the performance metrics in response to changes in d, it is noted that Precision reaches peaks when d is set to 5, after which it declines as d increases. With respect to the influence of the ratio parameter, the results from the Yeast3 dataset show that Precision, G-mean, and F1 score all reach their maximum when the ratio is 0.1, then gradually diminish and stabilize as the ratio increases. In addition, it is observed that the AUC demonstrates greater stability to changes in ratio compared to the other performance metrics. In summary, upon examining the trends of various performance metrics in relation to the alterations in the parameters d and ratio, where it can be concluded that the performance of the Ad-RuLer algorithm is not subject to significant fluctuations due to the adjustment of these two parameters. This demonstrates the Ad-RuLer’s robustness in terms of overall performance.
Figure 5 displays the average rankings of each algorithm across different datasets and classifiers. As expected, Ad-RuLer leads other algorithms in the average rankings of G-mean, F1 score, and AUC, with average rankings of 2.43, 2.33, and 2.10, respectively. In the average ranking of Precision, Ad-RuLer ranks second, only behind the KmeansSMOTE algorithm, with an average ranking of 3.07. Table 3 presents the results of the Wilcoxon signed-rank test, which targets the performance across all datasets, with Ad-RuLer as the reference algorithm, and assumes a significance level of α = 0.05 . In the Wilcoxon signed-rank test, the null hypothesis states that no significant difference exists between the performance metrics of the Ad-RuLer and those of the specific comparative algorithm. The results presented in Table 3 indicate that when employing the logistic regression classifier, the p-values for the AUC metric obtained using five different resampling algorithms are all below 0.05. This statistically significant finding leads us to reject the null hypothesis and conclude that the Ad-RuLer algorithm demonstrates significant differences in AUC when compared with all other algorithms. Similarly, we have observed that Ad-RuLer exhibits statistically significant differences in G-mean with algorithms such as SMOTE, Tomek-links, and KMeansSMOTE, and in terms of Precision, it demonstrates significant differences with ADASYN and Borderline-SMOTE. When using the random forest as the classifier, Ad-RuLer exhibits significant differences in the G-mean and F1 score compared to Borderline-SMOTE and KMeansSMOTE. When employing XGBoost as the classifier, Ad-RuLer demonstrates a significant difference in the G-mean relative to Borderline-SMOTE and KMeansSMOTE.
In addition to the Wilcoxon signed-rank test, we also conducted the Friedman test on the average rankings of all resampling algorithms across all classifiers and datasets. Using the Friedman test to assess the performance of various resampling algorithms, we hypothesized that there would be no significant differences in the average rankings across the different performance metrics. Our analysis revealed that for the G-mean and F1 score, the p-values were 0.0002 and 0.0012, respectively, with both markedly below 0.05. This led us to reject the null hypothesis related to these metrics, thereby confirming that Ad-RuLer’s performance exhibited statistically significant differences when compared with other algorithms on the G-mean and F1 score. Regarding the AUC, the p-value was 0.1462, exceeding the 0.05. This implies that we lack sufficient statistical evidence to claim significant differences in average rankings among the resampling algorithms for AUC; thus, the null hypothesis for the AUC is retained. For the Precision, although Ad-RuLer ranked second among all algorithms, closely following KmeansSMOTE, the computed p-value of 0.0005, substantially lower than the 0.05 level of significance, compels us to reject the null hypothesis. It indicates that at least one algorithm shows a significant performance deviation from the others on Precision.
However, the Friedman test does not indicate which specific algorithms exhibit significant differences. Therefore, we employed the Nemenyi post hoc test for multiple pairwise comparisons to determine which algorithms have significant differences from others. In the Nemenyi post hoc test, as shown in Figure 6, the algorithms are listed on the vertical axis, with ranking values on the horizontal axis. The average ranking of each algorithm is represented by a point, which extends to the left and right into a line segment, representing a 95 % confidence interval. We assess whether there are significant differences between algorithms by checking whether these line segments (i.e., 95 % confidence intervals) overlap. If the line segments of the two algorithms do not overlap, we can consider that these two algorithms have statistically significant differences. Observing the figure, it is found that Ad-RuLer exhibits a significant difference from KmeansSMOTE in terms of G-mean. Compared to other oversampling algorithms, although Ad-RuLer’s average ranking in G-mean, F1 score, and AUC is higher than other resampling algorithms, the significant differences cannot be determined from the current analysis. Given the conservative nature of the Nemenyi post hoc test, the significance of Ad-RuLer’s performance relative to other resampling algorithms remains to be conclusively established. Therefore, future experiments involving a broader range of datasets are necessary to robustly test these relationships.

5. Discussion

This study introduces a novel oversampling technique named Ad-RuLer. This method, grounded in the extraction of rules from datasets, aims to generate new data instances to enhance the classification performance of minority class instances in imbalanced datasets. Comprehensive experimental results reveal that, in comparison to prevailing oversampling algorithms, Ad-RuLer excels across various evaluation metrics, highlighting its potential application value and offering a novel approach to addressing data imbalance challenges.
To delve deeper into Ad-RuLer’s performance under varied dataset imbalances, datasets have been categorized based on their imbalance ratio (IR) into two groups: high imbalance (IR > 8) and low imbalance (IR ≤ 8), with each group comprising five datasets. Figure 7 and Figure 8 depict the average ranking results for these two groups, respectively. Evidently, Ad-RuLer leads in the AUC within the low imbalance group, secures the second place in both Precision and G-mean, and ranks third in F1 score. In the high imbalance group, Ad-RuLer dominates in the G-mean, AUC, and F1 score, and ranks second in Precision. Such findings underscore Ad-RuLer’s robust performance, even in datasets with pronounced imbalances, without a deterioration in performance as the imbalance ratio increases, further accentuating its broad applicability.
Additionally, when integrated with various classifiers, Ad-RuLer consistently demonstrates an exceptional predictive performance, emphasizing its reliability and adaptability. Based on the results of the Wilcoxon Signed-rank test, we observed that under the logistic regression classifier, Ad-RuLer exhibits a more significant difference compared to other algorithms, particularly in the AUC, where Ad-RuLer significantly outperforms all other methods.
Nevertheless, this research has its limitations. The impact of Ad-RuLer’s performance under some specific scenarios, such as in high-dimensional datasets or where class overlaps are prominent, has not been thoroughly explored. Future research will delve into the performance of Ad-RuLer in such scenarios. Moreover, this study evaluated Ad-RuLer’s performance using ten datasets. To provide a more holistic assessment of Ad-RuLer’s capabilities, subsequent research aims to test across a wider array of datasets. Furthermore, there is an intent to deploy Ad-RuLer in specialized application domains, such as addressing the severe data imbalance problem in the medical domain and considering to integrate it with other techniques, such as cost-sensitive learning, to further enhance its efficacy.

6. Conclusions

In this study, it has been demonstrated that Ad-RuLer, an approach based on inductive rule learning, can be used as an alternative option to conventional methodologies for classifying imbalanced data. As an efficient oversampling method, Ad-RuLer substantially enhances the predictive performance of the classifier when confronted with imbalanced datasets. Comparative experiments with other data resampling methods on ten real-world datasets with different imbalance ratios demonstrate the effectiveness and potential of our approach. Future research will focus on assessing the efficacy of Ad-RuLer in high-dimensional datasets and scenarios with significant class overlap, expanding the scope to address practical problems in specialized domains, such as healthcare and financial fraud, where data imbalance issues are more pronounced. Furthermore, the potential enhancement of Ad-RuLer in complex data settings will be investigated via its integration with cost-sensitive learning techniques and dimensionality reduction techniques, such as PCA and t-SNE.

Author Contributions

Conceptualization, X.Z., À.N., I.P., F.M. and E.R.; methodology, X.Z. and À.N.; software, X.Z. and I.P.; validation, X.Z. and À.N.; formal analysis, X.Z. and À.N.; investigation, X.Z., I.P. and À.N.; resources, À.N., F.M. and E.R.; data curation, X.Z.; writing—original draft preparation, X.Z.; writing—review and editing, X.Z. and À.N.; supervision, À.N., F.M. and E.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

These links for datasets: https://sci2s.ugr.es/keel/imbalanced.php (accessed on 18 September 2023), https://www.kaggle.com/datasets (accessed on 18 September 2023).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
SMOTESynthetic Minority Oversampling Technique
ADASYNAdaptive Synthetic Oversampling
RUSRandom Undersampling
ROSRandom Oversampling
TPRTrue Positive Rate
FPRFalse Positive Rate

Appendix A

Table A1. Performance comparison between different data resampling approaches using random forest.
Table A1. Performance comparison between different data resampling approaches using random forest.
DatasetsAlgorithmsG-MeanF1 ScoreAUCPrecision
Heart DiseaseAd-RuLer0.766 0.717 0.851 0.714
SMOTE0.7500.6960.8320.706
ADASYN0.7500.6960.8300.699
Tomek-links0.7460.6910.8290.700
Borderline-SMOTE0.7450.6910.8310.699
KmeansSMOTE0.7420.6870.8320.704
Breast CancerAd-RuLer0.9460.9340.9850.942
SMOTE0.9520.941 0.988 0.947
ADASYN 0.956 0.943 0.988 0.938
Tomek-links0.9510.9380.9870.938
Borderline-SMOTE0.9520.940 0.988 0.941
KMeansSMOTE0.9480.9370.987 0.950
PimaAd-RuLer0.7450.672 0.830 0.628
SMOTE0.7440.6710.8290.653
ADASYN0.7370.6610.8220.638
Tomek-links 0.749 0.676 0.8290.654
Borderline-SMOTE0.7370.6620.8180.633
KmeansSMOTE0.7200.6440.829 0.677
Vehicle0Ad-RuLer0.9630.935 0.996 0.918
SMOTE0.9630.9270.9940.898
ADASYN 0.968 0.9330.9950.898
Tomek-links0.9630.9270.9940.897
Borderline-SMOTE0.965 0.936 0.996 0.915
KMeansSMOTE0.9480.9150.9920.904
Ecoli1Ad-RuLer0.877 0.789 0.9500.746
SMOTE0.8860.7840.9530.717
ADASYN 0.895 0.7850.9500.704
Tomek-links0.8900.786 0.954 0.716
Borderline-SMOTE0.8890.7820.9440.705
KmeansSMOTE0.8630.773 0.954 0.747
Yeast3Ad-RuLer 0.911 0.782 0.973 0.713
SMOTE0.874 0.785 0.9660.788
ADASYN0.8830.7740.9640.746
Tomek-links0.8750.7810.9660.778
Borderline-SMOTE0.8840.7810.9620.759
KMeansSMOTE0.8290.7540.964 0.828
Yeast-0-2-5-7-9Ad-RuLer 0.887 0.784 0.957 0.760
vs3-6-8SMOTE0.8770.7810.9510.778
ADASYN0.8840.7730.9470.747
Tomek-links0.8800.7840.9490.779
Borderline-SMOTE0.8640.7700.9440.783
KMeansSMOTE0.856 0.802 0.940 0.878
Abalone9-18Ad-RuLer0.606 0.399 0.811 0.419
SMOTE0.6360.3500.8420.303
ADASYN 0.645 0.3500.8410.296
Tomek-links0.6200.3310.8460.287
Borderline-SMOTE0.5820.335 0.848 0.335
KMeansSMOTE0.3970.2250.8020.401
Car-goodAd-RuLer 0.969 0.652 0.992 0.490
SMOTE0.7250.5940.9870.706
ADASYN0.7360.5760.9850.633
Tomek-links0.7320.5830.9870.653
Borderline-SMOTE0.7320.5810.9860.639
KMeansSMOTE0.6630.5430.986 0.707
winequalityAd-RuLer0.2160.0860.747 0.307
-red-4SMOTE0.3050.1180.7420.125
ADASYN 0.346 0.141 0.7450.151
Tomek-links0.2920.1200.7410.129
Borderline-SMOTE0.1070.0490.7450.091
KMeansSMOTE0.2080.082 0.755 0.109
Table A2. Performance comparison between different data resampling approaches using XGBoost.
Table A2. Performance comparison between different data resampling approaches using XGBoost.
DatasetsAlgorithmsG-MeanF1 ScoreAUCPrecision
Heart DiseaseAd-RuLer0.743 0.690 0.823 0.691
SMOTE0.7290.6720.8100.673
ADASYN0.7310.6740.8120.680
Tomek-links0.7340.6770.8150.689
Borderline-SMOTE0.7340.6770.8100.681
KmeansSMOTE0.7350.6780.8140.687
Breast CancerAd-RuLer0.9470.9360.9880.946
SMOTE 0.959 0.950 0.992 0.954
ADASYN 0.959 0.946 0.992 0.935
Tomek-links0.9540.9430.9910.944
Borderline-SMOTE0.9520.9400.9910.942
KMeansSMOTE0.9570.9480.991 0.957
PimaAd-RuLer 0.726 0.649 0.809 0.626
SMOTE0.7200.6410.8040.623
ADASYN0.7170.6370.8000.611
Tomek-links0.7220.6430.8080.630
Borderline-SMOTE0.7060.6240.7960.602
KmeansSMOTE0.7150.6360.808 0.647
Vehicle0Ad-RuLer0.9550.933 0.993 0.909
SMOTE0.965 0.935 0.9920.913
ADASYN 0.968 0.935 0.9920.904
Tomek-links0.9640.9340.9910.913
Borderline-SMOTE0.9620.933 0.993 0.914
KMeansSMOTE0.9610.9300.9920.911
Ecoli1Ad-RuLer0.8680.7840.9520.754
SMOTE0.8760.7800.9580.733
ADASYN0.8760.7730.9550.714
Tomek-links0.8700.7730.9580.729
Borderline-SMOTE 0.877 0.7730.9530.713
KmeansSMOTE0.866 0.786 0.961 0.781
Yeast3Ad-RuLer 0.884 0.756 0.969 0.708
SMOTE0.8530.7550.9600.763
ADASYN0.8570.7480.9580.740
Tomek-links0.8540.7510.9600.752
Borderline-SMOTE0.855 0.756 0.9590.762
KMeansSMOTE0.8320.7430.960 0.783
Yeast-0-2-5-7-9Ad-RuLer 0.888 0.790 0.939 0.774
vs3-6-8SMOTE0.8840.7760.9380.753
ADASYN0.8650.733 0.939 0.701
Tomek-links0.8830.7750.9360.753
Borderline-SMOTE0.8730.7640.9360.753
KMeansSMOTE0.880 0.806 0.926 0.826
Abalone9-18Ad-RuLer0.6830.3700.7990.392
SMOTE 0.702 0.3930.8380.322
ADASYN0.701 0.394 0.8300.320
Tomek-links0.6920.3880.8350.324
Borderline-SMOTE0.6570.385 0.844 0.350
KMeansSMOTE0.5620.3570.801 0.453
Car-goodAd-RuLer 0.981 0.754 0.995 0.615
SMOTE0.874 0.764 0.993 0.776
ADASYN0.8560.7380.9920.748
Tomek-links0.874 0.764 0.993 0.776
Borderline-SMOTE0.8710.7550.9930.760
KMeansSMOTE0.8650.7320.9910.725
winequalityAd-RuLer 0.283 0.127 0.710 0.156
-red-4SMOTE0.2030.0710.6870.071
ADASYN0.2460.0850.6720.088
Tomek-links0.2720.0950.6650.097
Borderline-SMOTE0.0870.0320.6950.046
KMeansSMOTE0.1890.0670.7000.078

Appendix B

Table A3. Hyperparameter for resampling approach and classifiers used in this study.
Table A3. Hyperparameter for resampling approach and classifiers used in this study.
DatasetsApproachResampling ParameterClassifier Parameter
HeartAd-RuLerd = 3, ratio = 0.2LR: penalty = ‘l2’, C = 2;
RF: n_estimators = 600, max_depth = 5,
       min_samples_split = 5, min_samples_leaf = 5;
XGB: max_depth = 4, learning_rate = 0.05,
          n_estimators = 500, reg_lambda = 0.5;
SMOTEk_n = 7
ADASYNn_n = 5
Tomek-linksNone
B-SMOTEk_n = 5, m_n = 15
KMeansSMOTEk_n = 3, k_e = 2
BreastAd-RuLerd = 1, ratio = 1LR: penalty = ‘l2’, C = 1;
RF: n_estimators = 500, max_depth = 8,
       min_samples_split = 5, min_samples_leaf = 3;
XGB: max_depth = 15, learning_rate = 0.1,
          n_estimators = 500, reg_lambda = 0.1;
SMOTEk_n = 3
ADASYNn_n = 8
Tomek-linksNone
B-SMOTEk_n = 10, m_n = 5
KMeansSMOTEk_n = 3, k_e = 4
PimaAd-RuLerd = 3, ratio = 0.1LR: penalty = ‘l2’, C = 1.5;
RF: n_estimators = 600, max_depth = 5,
       min_samples_split = 10, min_samples_leaf = 3;
XGB: max_depth = 15, learning_rate = 0.05,
          n_estimators = 500, reg_lambda = 1;
SMOTEk_n = 5
ADASYNn_n = 10
Tomek-linksNone
B-SMOTEk_n = 5, m_n = 5
KMeansSMOTEk_n = 3, k_e = 4
Vehicle0Ad-RuLerd = 12, ratio = 0.1LR: penalty = ‘l2’, C = 1.5;
RF: n_estimators = 500, max_depth = 15,
       min_samples_split = 15, min_samples_leaf = 3;
XGB: max_depth = 20, learning_rate = 0.1
          n_estimators = 600, reg_lambda = 2;
SMOTEk_n = 5
ADASYNn_n = 8
Tomek-linksNone
B-SMOTEk_n = 5, m_n = 5
KMeansSMOTEk_n = 4, k_e = 2
Ecoli1Ad-RuLerd = 4, ratio = 0.2LR: penalty = ‘l2’, C = 2;
RF: n_estimators = 600, max_depth = 15,
       min_samples_split = 20, min_samples_leaf = 10;
XGB: max_depth = 10, learning_rate = 0.1,
          n_estimators = 300, reg_lambda = 0.5;
SMOTEk_n = 8
ADASYNn_n = 6
Tomek-linksNone
B-SMOTEk_n = 4, m_n = 10
KMeansSMOTEk_n = 6, k_e = 3
Yeast3Ad-RuLerd = 4, ratio = 0.1LR: penalty = ‘l2’, C = 3;
RF: n_estimators = 600, max_depth = 20,
       min_samples_split = 15, min_samples_leaf = 3;
XGB: max_depth = 20, learning_rate = 0.1,
          n_estimators = 300, reg_lambda = 0.1;
SMOTEk_n = 3
ADASYNn_n = 6
Tomek-linksNone
B-SMOTEk_n = 3, m_n = 10
KMeansSMOTEk_n = 6, k_e = 2
Yeast-0-2-5-7-9Ad-RuLerd = 4, ratio = 0.2LR: penalty = ‘l2’, C = 2;
RF: n_estimators = 600, max_depth = 25,
       min_samples_split = 10, min_samples_leaf = 3;
XGB: max_depth = 10, learning_rate = 0.1,
          n_estimators = 300, reg_lambda = 2;
SMOTEk_n = 7
ADASYNn_n = 10
Tomek-linksNone
B-SMOTEk_n = 3, m_n = 8
KMeansSMOTEk_n = 10, k_e = 3
Abalone9Ad-RuLerd = 6, ratio = 0.1LR: penalty = ‘l2’, C = 3;
RF: n_estimators = 500, max_depth = 30,
       min_samples_split = 5, min_samples_leaf = 3;
XGB: max_depth = 20, learning_rate = 0.1,
          n_estimators = 200, reg_lambda = 0.5;
SMOTEk_n = 5
ADASYNn_n = 5
Tomek-linksNone
B-SMOTEk_n = 5, m_n = 8
KMeansSMOTEk_n = 10, k_e = 4
Car-goodAd-RuLerd = 3, ratio = 0.3LR: penalty = ‘l2’, C = 2;
RF: n_estimators = 500, max_depth = 40,
       min_samples_split = 20, min_samples_leaf = 10;
XGB: max_depth = 5, learning_rate = 0.01,
          n_estimators = 200, reg_lambda = 3;
SMOTEk_n = 10
ADASYNn_n = 5
Tomek-linksNone
B-SMOTEk_n = 8, m_n = 3
KMeansSMOTEk_n = 12, k_e = 4
WinequalityAd-RuLerd = 6, ratio = 0.1LR: penalty = ‘l2’, C = 0.5;
RF: n_estimators = 500, max_depth = 20,
       min_samples_split = 10, min_samples_leaf = 3;
XGB: max_depth = 5, learning_rate = 0.01,
          n_estimators = 100, reg_lambda = 0.5;
SMOTEk_n = 3
ADASYNn_n = 5
Tomek-linksNone
B-SMOTEk_n = 8, m_n = 5
KMeansSMOTEk_n = 8, k_e = 3
Abbreviations for datasets: Abalone9, Abalone9-18; Winequality, Winequality-red-4. Abbreviations for classifier: LR, Logistic Regression; RF, Random Forest; XGB, XGBoost. Abbreviations for parameter: k_n, k_neighbors; n_n, n_neighbors; m_n, m_neighbors; k_e, kmeans_estimator.

References

  1. Gupta, A.; Lohani, M.; Manchanda, M. Financial fraud detection using naive bayes algorithm in highly imbalance data set. J. Discret. Math. Sci. Cryptogr. 2021, 24, 1559–1572. [Google Scholar] [CrossRef]
  2. Gu, Q.; Cai, Z.; Zhu, L.; Huang, B. Data mining on imbalanced data sets. In Proceedings of the 2008 International Conference on Advanced Computer Theory and Engineering, Phuket, Thailand, 20–22 December 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 1020–1024. [Google Scholar]
  3. Jiang, Z.; Pan, T.; Zhang, C.; Yang, J. A new oversampling method based on the classification contribution degree. Symmetry 2021, 13, 194. [Google Scholar] [CrossRef]
  4. Gonzalez-Cuautle, D.; Hernandez-Suarez, A.; Sanchez-Perez, G.; Toscano-Medina, L.K.; Portillo-Portillo, J.; Olivares-Mercado, J.; Perez-Meana, H.M.; Sandoval-Orozco, A.L. Synthetic minority oversampling technique for optimizing classification tasks in botnet and intrusion-detection-system datasets. Appl. Sci. 2020, 10, 794. [Google Scholar] [CrossRef]
  5. Liu, J.; Gao, Y.; Hu, F. A fast network intrusion detection system using adaptive synthetic oversampling and LightGBM. Comput. Secur. 2021, 106, 102289. [Google Scholar] [CrossRef]
  6. Guzmán-Ponce, A.; Valdovinos, R.M.; Sánchez, J.S.; Marcial-Romero, J.R. A new under-sampling method to face class overlap and imbalance. Appl. Sci. 2020, 10, 5164. [Google Scholar] [CrossRef]
  7. Dai, Q.; Liu, J.w.; Liu, Y. Multi-granularity relabeled under-sampling algorithm for imbalanced data. Appl. Soft Comput. 2022, 124, 109083. [Google Scholar] [CrossRef]
  8. Aridas, C.K.; Karlos, S.; Kanas, V.G.; Fazakis, N.; Kotsiantis, S.B. Uncertainty based under-sampling for learning naive bayes classifiers under imbalanced data sets. IEEE Access 2019, 8, 2122–2133. [Google Scholar] [CrossRef]
  9. Jiang, K.; Wang, W.; Wang, A.; Wu, H. Network intrusion detection combined hybrid sampling with deep hierarchical network. IEEE Access 2020, 8, 32464–32476. [Google Scholar] [CrossRef]
  10. Xu, Z.; Shen, D.; Nie, T.; Kou, Y. A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data. J. Biomed. Informatics 2020, 107, 103465. [Google Scholar] [CrossRef]
  11. Sowah, R.A.; Kuditchar, B.; Mills, G.A.; Acakpovi, A.; Twum, R.A.; Buah, G.; Agboyi, R. HCBST: An efficient hybrid sampling technique for class imbalance problems. ACM Trans. Knowl. Discov. Data (TKDD) 2021, 16, 1–37. [Google Scholar] [CrossRef]
  12. Zhu, T.; Lin, Y.; Liu, Y. Improving interpolation-based oversampling for imbalanced data learning. Knowl.-Based Syst. 2020, 187, 104826. [Google Scholar] [CrossRef]
  13. Alkan, O.; Wei, D.; Mattetti, M.; Nair, R.; Daly, E.; Saha, D. FROTE: Feedback rule-driven oversampling for editing models. Proc. Mach. Learn. Syst. 2022, 4, 276–301. [Google Scholar]
  14. Paz, I. On-the-Fly Synthesizer Programming with Rule Learning. Ph.D. Thesis, Universitat Politècnica de Catalunya—BarcelonaTech, Catalonia, Spain, 2021. [Google Scholar]
  15. Islam, A.; Belhaouari, S.B.; Rehman, A.U.; Bensmail, H. KNNOR: An oversampling technique for imbalanced datasets. Appl. Soft Comput. 2022, 115, 108288. [Google Scholar] [CrossRef]
  16. Zhu, T.; Lin, Y.; Liu, Y. Synthetic minority oversampling technique for multiclass imbalance problems. Pattern Recognit. 2017, 72, 327–340. [Google Scholar] [CrossRef]
  17. Vuttipittayamongkol, P.; Elyan, E. Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Inf. Sci. 2020, 509, 47–70. [Google Scholar] [CrossRef]
  18. Li, F.; Zhang, X.; Zhang, X.; Du, C.; Xu, Y.; Tian, Y.C. Cost-sensitive and hybrid-attribute measure multi-decision tree over imbalanced data sets. Inf. Sci. 2018, 422, 242–256. [Google Scholar] [CrossRef]
  19. Zhang, C.; Tan, K.C.; Li, H.; Hong, G.S. A cost-sensitive deep belief network for imbalanced classification. IEEE Trans. Neural Networks Learn. Syst. 2018, 30, 109–122. [Google Scholar] [CrossRef]
  20. Peng, P.; Zhang, W.; Zhang, Y.; Xu, Y.; Wang, H.; Zhang, H. Cost sensitive active learning using bidirectional gated recurrent neural networks for imbalanced fault diagnosis. Neurocomputing 2020, 407, 232–245. [Google Scholar] [CrossRef]
  21. Leevy, J.L.; Khoshgoftaar, T.M.; Bauder, R.A.; Seliya, N. A survey on addressing high-class imbalance in big data. J. Big Data 2018, 5, 1–30. [Google Scholar] [CrossRef]
  22. Kubat, M.; Matwin, S. Addressing the curse of imbalanced training sets: One-sided selection. Proc. ICML Citeseer 1997, 97, 179. [Google Scholar]
  23. Yen, S.J.; Lee, Y.S. Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. 2009, 36, 5718–5727. [Google Scholar] [CrossRef]
  24. Lin, W.C.; Tsai, C.F.; Hu, Y.H.; Jhang, J.S. Clustering-based undersampling in class-imbalanced data. Inf. Sci. 2017, 409, 17–26. [Google Scholar] [CrossRef]
  25. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  26. Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Proceedings of the Advances in Intelligent Computing: International Conference on Intelligent Computing, ICIC 2005, Hefei, China, 23–26 August 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 878–887. [Google Scholar]
  27. Barua, S.; Islam, M.M.; Yao, X.; Murase, K. MWMOTE–majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. 2012, 26, 405–425. [Google Scholar] [CrossRef]
  28. Zhang, X.; Ma, D.; Gan, L.; Jiang, S.; Agam, G. Cgmos: Certainty guided minority oversampling. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, Indianapolis, IN, USA, 24–28 October 2016; pp. 1623–1631. [Google Scholar]
  29. Douzas, G.; Bacao, F.; Last, F. Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf. Sci. 2018, 465, 1–20. [Google Scholar] [CrossRef]
  30. He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 1322–1328. [Google Scholar]
  31. Maldonado, S.; López, J.; Vairetti, C. An alternative SMOTE oversampling strategy for high-dimensional datasets. Appl. Soft Comput. 2019, 76, 380–389. [Google Scholar] [CrossRef]
  32. Li, M.; Xiong, A.; Wang, L.; Deng, S.; Ye, J. ACO Resampling: Enhancing the performance of oversampling methods for class imbalance classification. Knowl.-Based Syst. 2020, 196, 105818. [Google Scholar] [CrossRef]
  33. Mirzaei, B.; Nikpour, B.; Nezamabadi-pour, H. CDBH: A clustering and density-based hybrid approach for imbalanced data classification. Expert Syst. Appl. 2021, 164, 114035. [Google Scholar] [CrossRef]
  34. Ai-jun, L.; Peng, Z. Research on Unbalanced Data Processing Algorithm Base Tomeklinks-Smote. In Proceedings of the 2020 3rd International Conference on Artificial Intelligence and Pattern Recognition, Xiamen China, 26–28 June 2020; pp. 13–17. [Google Scholar]
  35. Cui, Y.; Jia, M.; Lin, T.Y.; Song, Y.; Belongie, S. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9268–9277. [Google Scholar]
  36. Mullick, S.S.; Datta, S.; Das, S. Generative adversarial minority oversampling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1695–1704. [Google Scholar]
  37. Kim, J.; Jeong, J.; Shin, J. M2m: Imbalanced classification via major-to-minor translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13896–13905. [Google Scholar]
  38. Cui, J.; Zhong, Z.; Liu, S.; Yu, B.; Jia, J. Parametric contrastive learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 715–724. [Google Scholar]
  39. Wang, P.; Han, K.; Wei, X.S.; Zhang, L.; Wang, L. Contrastive learning based hybrid networks for long-tailed image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 943–952. [Google Scholar]
  40. Shi, M.; Tang, Y.; Zhu, X.; Wilson, D.; Liu, J. Multi-class imbalanced graph convolutional network learning. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20), Yokohama, Japan, 11–17 July 2020. [Google Scholar]
  41. Zhao, T.; Zhang, X.; Wang, S. Graphsmote: Imbalanced node classification on graphs with graph neural networks. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, Virtual, 8–12 March 2021; pp. 833–841. [Google Scholar]
  42. Qu, L.; Zhu, H.; Zheng, R.; Shi, Y.; Yin, H. Imgagn: Imbalanced network embedding via generative adversarial graph networks. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Virtual, 14–18 August 2021; pp. 1390–1398. [Google Scholar]
  43. Huynh, T.; Nibali, A.; He, Z. Semi-supervised learning for medical image classification using imbalanced training data. Comput. Methods Programs Biomed. 2022, 216, 106628. [Google Scholar] [CrossRef]
  44. Hyun, M.; Jeong, J.; Kwak, N. Class-imbalanced semi-supervised learning. arXiv 2020, arXiv:2002.06815. [Google Scholar]
  45. Liu, G.; Yang, Y.; Li, B. Fuzzy rule-based oversampling technique for imbalanced and incomplete data learning. Knowl.-Based Syst. 2018, 158, 154–174. [Google Scholar] [CrossRef]
  46. Paz, I.; Nebot, À.; Mugica, F.; Romero, E. On-The-Fly Syntheziser Programming with Fuzzy Rule Learning. Entropy 2020, 22, 969. [Google Scholar] [CrossRef] [PubMed]
  47. Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Figure 1. The flowchart of Ad-RuLer for imbalanced data processing.
Figure 1. The flowchart of Ad-RuLer for imbalanced data processing.
Applsci 13 12636 g001
Figure 2. The synthetic data generated via Ad-RuLer for the Yeast3 dataset (a) and Heart Disease dataset (b).
Figure 2. The synthetic data generated via Ad-RuLer for the Yeast3 dataset (a) and Heart Disease dataset (b).
Applsci 13 12636 g002
Figure 3. Impact of variations in the d parameter on the performance metrics of the Ad-RuLer algorithm.
Figure 3. Impact of variations in the d parameter on the performance metrics of the Ad-RuLer algorithm.
Applsci 13 12636 g003
Figure 4. Impact of variations in the ratio parameter on the performance metrics of the Ad-RuLer algorithm.
Figure 4. Impact of variations in the ratio parameter on the performance metrics of the Ad-RuLer algorithm.
Applsci 13 12636 g004
Figure 5. Average ranking of resampling algorithms for all three classifiers on all datasets. (a) Average ranking for G-mean. (b) Average ranking for F1 score. (c) Average ranking for AUC. (d) Average ranking for Precision.
Figure 5. Average ranking of resampling algorithms for all three classifiers on all datasets. (a) Average ranking for G-mean. (b) Average ranking for F1 score. (c) Average ranking for AUC. (d) Average ranking for Precision.
Applsci 13 12636 g005
Figure 6. Critical differences between different resampling algorithms on all datasets using Nemenyi post hoc test. (a) Critical differences calculation for G-mean. (b) Critical differences calculation for F1 score. (c) Critical differences calculation for AUC. (d) Critical differences calculation for Precision.
Figure 6. Critical differences between different resampling algorithms on all datasets using Nemenyi post hoc test. (a) Critical differences calculation for G-mean. (b) Critical differences calculation for F1 score. (c) Critical differences calculation for AUC. (d) Critical differences calculation for Precision.
Applsci 13 12636 g006
Figure 7. Average rankings of resampling methods in low IR dataset group.
Figure 7. Average rankings of resampling methods in low IR dataset group.
Applsci 13 12636 g007
Figure 8. Average rankings of resampling methods in high IR dataset group.
Figure 8. Average rankings of resampling methods in high IR dataset group.
Applsci 13 12636 g008
Table 1. Details of the dataset used in the study.
Table 1. Details of the dataset used in the study.
NumDatasetS S min S maj FIR
1Heart676265411151.55
2Breast569212357311.68
3Pima76826850081.87
4Vehicle0846199647183.25
5Ecoli13367725973.36
6Yeast31484163132188.10
7Yeast-0-210049990589.14
8Abalone973142689816.40
9Car-good1728691659624.04
10Winequality15995315461129.17
Title Abbreviations: S, Total samples; S min , Minority class samples; S maj , Majority class samples; F, Number of features; IR, Imbalance Ratio. Abbreviations for datasets: Heart, Heart Disease; Breast, Breast Cancer; Yeast-0-2, Yeast-0-2-5-7-9vs3-6-8; Abalone9, Abalone9-18; Winequality, winequality-red-4.
Table 2. Performance comparison between different data resampling approaches using logistic regression.
Table 2. Performance comparison between different data resampling approaches using logistic regression.
DatasetsAlgorithmsG-MeanF1 ScoreAUCPrecision
Heart DiseaseAd-RuLer 0.769 0.721 0.841 0.708
SMOTE0.7560.7030.8240.693
ADASYN0.7610.7090.8230.690
Tomek-links0.7530.7000.8230.693
Borderline-SMOTE0.7580.7060.8210.688
KmeansSMOTE0.7480.6940.821 0.719
Breast CancerAd-RuLer 0.970 0.963 0.993 0.968
SMOTE0.9360.9190.9880.918
ADASYN0.9360.9130.9860.879
Tomek-links0.9320.9130.9850.909
Borderline-SMOTE0.9330.9100.9850.874
KMeansSMOTE0.9310.9150.9860.925
PimaAd-RuLer0.7330.659 0.827 0.632
SMOTE0.7290.6520.8170.605
ADASYN0.7260.6510.8130.586
Tomek-links 0.736 0.660 0.8180.612
Borderline-SMOTE0.733 0.660 0.8120.585
KmeansSMOTE0.7190.6420.826 0.662
Vehicle0Ad-RuLer 0.977 0.941 0.996 0.897
SMOTE0.9600.9310.993 0.915
ADASYN0.9580.9220.9940.894
Tomek-links0.9590.9290.9930.913
Borderline-SMOTE0.9590.9270.9940.905
KMeansSMOTE0.9590.9290.9930.913
Ecoli1Ad-RuLer0.8640.749 0.946 0.667
SMOTE0.8600.731 0.946 0.638
ADASYN0.8770.7390.9430.622
Tomek-links0.8610.732 0.946 0.637
Borderline-SMOTE 0.891 0.759 0.9400.642
KmeansSMOTE0.8460.7280.944 0.669
Yeast3Ad-RuLer0.8900.6850.9630.570
SMOTE0.8850.6850.9640.575
ADASYN 0.907 0.655 0.965 0.509
Tomek-links0.8850.6850.9640.576
Borderline-SMOTE0.9060.6500.9640.502
KMeansSMOTE0.860 0.717 0.952 0.669
Yeast-0-2-5-7-9Ad-RuLer 0.882 0.631 0.929 0.500
vs3-6-8SMOTE 0.882 0.6610.9230.535
ADASYN0.8550.5120.9220.365
Tomek-links0.876 0.663 0.9230.538
Borderline-SMOTE0.8640.5620.9160.430
KMeansSMOTE0.5960.5000.920 0.878
Abalone9-18Ad-RuLer 0.834 0.425 0.937 0.291
SMOTE0.7920.3550.8840.238
ADASYN0.7940.3480.8840.231
Tomek-links0.7910.3520.8830.235
Borderline-SMOTE0.7230.3090.8510.211
KMeansSMOTE0.5620.3570.801 0.453
Car-goodAd-RuLer 0.970 0.581 0.978 0.410
SMOTE0.9680.5610.9760.393
ADASYN0.9670.5560.9770.389
Tomek-links0.9680.5610.9740.393
Borderline-SMOTE0.9680.559 0.978 0.391
KMeansSMOTE0.9410.5580.974 0.410
winequalityAd-RuLer0.6460.1210.7050.068
-red-4SMOTE0.6560.1250.6950.070
ADASYN 0.658 0.1240.6970.070
Tomek-links0.6510.1240.7050.069
Borderline-SMOTE0.646 0.158 0.711 0.094
KMeansSMOTE0.6420.1360.7030.078
Table 3. The Wilcoxon Signed-rank test results performed on all datasets with Ad-RuLer serving as the control algorithm. If the p-value for a particular method is less than 0.05, it indicates a significant performance difference between that method and Ad-RuLer.
Table 3. The Wilcoxon Signed-rank test results performed on all datasets with Ad-RuLer serving as the control algorithm. If the p-value for a particular method is less than 0.05, it indicates a significant performance difference between that method and Ad-RuLer.
SMOADATLB-SMOK-SMO
LRGM0.03800.2324 0.0371 0.2070 0.0020
F10.0856 0.0039 0.13860.08400.0644
AUC 0.0108 0.0010 0.0173 0.0328 0.0020
Prec0.2754 0.0039 0.2324 0.0273 0.0506
RFGM0.67840.92190.7671 0.0371 0.0039
F10.27540.16010.3139 0.0195 0.0195
AUC0.32230.21350.32230.17270.1309
Prec0.92190.32230.76950.76950.5566
XGBGM0.43160.27540.3222 0.0371 0.0371
F10.62500.13090.43160.16140.1602
AUC0.49220.34290.49220.31390.1933
Prec0.62500.23240.76950.49220.1055
Approach abbreviations: SMO, SMOTE; ADA, ADASYN; T-L, Tomek-links; B-SMO, Borderline-SMOTE; K-SMO, KMeansSMOTE; LR, Logistic Regression; RF, Random Forest; XGB, XGBoost; GM, G-mean; F1, F1 score; Prec, Precision.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, X.; Paz, I.; Nebot, À.; Mugica, F.; Romero, E. Ad-RuLer: A Novel Rule-Driven Data Synthesis Technique for Imbalanced Classification. Appl. Sci. 2023, 13, 12636. https://doi.org/10.3390/app132312636

AMA Style

Zhang X, Paz I, Nebot À, Mugica F, Romero E. Ad-RuLer: A Novel Rule-Driven Data Synthesis Technique for Imbalanced Classification. Applied Sciences. 2023; 13(23):12636. https://doi.org/10.3390/app132312636

Chicago/Turabian Style

Zhang, Xiao, Iván Paz, Àngela Nebot, Francisco Mugica, and Enrique Romero. 2023. "Ad-RuLer: A Novel Rule-Driven Data Synthesis Technique for Imbalanced Classification" Applied Sciences 13, no. 23: 12636. https://doi.org/10.3390/app132312636

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop