Ad-RuLer: A Novel Rule-Driven Data Synthesis Technique for Imbalanced Classification

Zhang, Xiao; Paz, Iván; Nebot, Àngela; Mugica, Francisco; Romero, Enrique

doi:10.3390/app132312636

Open AccessArticle

Ad-RuLer: A Novel Rule-Driven Data Synthesis Technique for Imbalanced Classification

Soft Computing Research Group at Intelligent Data Science, Artificial Intelligence Research Center, Universitat Politècnica de Catalunya, 08003 Barcelona, Spain

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(23), 12636; https://doi.org/10.3390/app132312636

Submission received: 5 October 2023 / Revised: 7 November 2023 / Accepted: 21 November 2023 / Published: 23 November 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

When classifiers face imbalanced class distributions, they often misclassify minority class samples, consequently diminishing the predictive performance of machine learning models. Existing oversampling techniques predominantly rely on the selection of neighboring data via interpolation, with less emphasis on uncovering the intrinsic patterns and relationships within the data. In this research, we present the usefulness of an algorithm named RuLer to deal with the problem of classification with imbalanced data. RuLer is a learning algorithm initially designed to recognize new sound patterns within the context of the performative artistic practice known as live coding. This paper demonstrates that this algorithm, once adapted (Ad-RuLer), has great potential to address the problem of oversampling imbalanced data. An extensive comparison with other mainstream oversampling algorithms (SMOTE, ADASYN, Tomek-links, Borderline-SMOTE, and KmeansSMOTE), using different classifiers (logistic regression, random forest, and XGBoost) is performed on several real-world datasets with different degrees of data imbalance. The experiment results indicate that Ad-RuLer serves as an effective oversampling technique with extensive applicability.

Keywords:

rule-based approach; oversampling; data synthesis; imbalanced data; classification

1. Introduction

In recent years, the explosive growth of big data, amplified by significant strides in artificial intelligence, has led industries, from finance and healthcare to online platforms, to produce unprecedented amounts of data. As we navigate this data-rich environment, the challenge of imbalanced class distribution has emerged prominently. This issue, characterized by certain classes being notably underrepresented compared to others, negatively impacts the accuracy of predictive models.

Traditional machine learning approaches were conceived, assuming a roughly balanced data distribution. However, real-world datasets frequently deviate from this assumption. In some domains, achieving precise classification of minority classes is critical, given the steep repercussions associated with misclassification [1,2]. For instance, in the healthcare sector, the number of patients diagnosed with specific diseases (positive samples) is often significantly lower than those without such conditions (negative samples). Consequently, standard classifiers, which tend to favor the majority class, are prone to misclassification. Such inaccuracies may lead to critical delays in medical treatments, jeopardizing patient health. Similarly, in finance, only a small percentage of customers might have poor credit ratings, but misclassifying them can expose institutions to significant risks. Beyond the algorithmic challenges, these imbalances have tangible adverse effects in areas like financial fraud detection, intrusion detection, and product quality assessments.

To address class imbalances, researchers have developed data resampling techniques, including oversampling [3,4,5], undersampling [6,7,8], and hybrid sampling [9,10,11]. In recent years, the oversampling technique has attracted widespread attention within the academic community due to its ability to amplify underrepresented classes, enhancing the classifier’s sensitivity towards minority instances. This technique proves particularly advantageous in scenarios where each instance of the minority class is crucial, such as rare disease diagnosis or financial fraud detection.

Current oversampling methods primarily rely on interpolation techniques, aiming to achieve class balance by synthesizing samples for the minority class. However, these approaches also have some limitations, including over-constraint, low-efficiency expansion, and over-generalization [12]. Despite these challenges, there is still a relative dearth of research on alternative oversampling strategies, with rule-based oversampling algorithms, in particular, being less explored.

Recent studies have demonstrated the advantages inherent in rule-based oversampling methods, which are characterized by their robust adaptability, consistent performance, and superior interpretability [13]. Such attributes indicate that rule-based strategies have substantial potential for addressing issues of data imbalance. An in-depth exploration of these methods stands to not only enrich the existing research of oversampling techniques, but also potentially offer new insights into fields such as imbalanced data classification and predictive modeling.

In this study, we present Ad-RuLer, an innovative adaptive rule-based oversampling approach. Distinct from the prevalent interpolation-based data resampling methods, Ad-RuLer employs an iterative comparison mechanism to rapidly extract intrinsic rules from datasets, which are then used to synthesize new instances for the minority class. This algorithm builds on the principles of RuLer—an algorithm originally designed for detecting novel sound patterns in the field of live-coding performance art [14]. In this study, we apply RuLer for the first time to the problem of data oversampling, addressing the challenge of imbalanced data classification. The Ad-RuLer has been benchmarked against several mainstream data resampling techniques using ten diverse imbalanced datasets. Our research findings highlight its superior performance and notable application potential. These findings not only validate the effectiveness of the Ad-RuLer method in addressing issues of data imbalance, but also contribute a novel perspective to the methodologies for data imbalance processing.

To summarize, the main contributions of this research are as follows:

To tackle the challenge of data imbalance, we propose Ad-RuLer, an innovative adaptive rule-based oversampling algorithm. This method extracts intrinsic rules from minority class samples and, based on these rules, synthesizes additional minority class instances.
To ensure that the synthesized data maintains complete balance, a random sampling step is incorporated into the rule-based synthesizer to keep the number of minority class samples generated via rule extraction aligned with the majority class samples in the original dataset.
Ad-RuLer is extensively evaluated using various machine learning models and real-world datasets, demonstrating its effectiveness and good performance in comparison to existing mainstream data resampling methods. Our research findings suggest that Ad-RuLer may serve as an alternative option to conventional oversampling techniques.

The remainder of this paper is structured as follows: Section 2 introduces the relevant literature; Section 3 delves into the intricacies of Ad-RuLer; Section 4 presents the chosen datasets and our comparative analysis; and in Section 5, we discuss the experimental results and outline future work, while in Section 6, we draw our conclusions.

2. Related Work

In many application scenarios, the goal of imbalance learning is to optimize the classification performance of minority class samples and the overall classification performance. Research to address this class imbalance problem can be summarized into three methods: data-based [15,16,17], model-based [18,19,20], and hybrid approaches [21]. Model-based methods focus on algorithmic improvements to the imbalance problem, with cost-sensitive learning being a prime example. This strategy assigns different weights to the samples within a dataset, particularly in situations of class imbalance where minority class samples are prone to misclassification. Therefore, by constructing an appropriate cost-sensitive matrix to adjust the misclassification cost of samples from different classes, it provides an alternative way to balance the dataset. On the other hand, data-based methods primarily construct a new balanced dataset by altering the original data distribution. These methods can be further categorized into oversampling, undersampling, and hybrid sampling. Oversampling aims to increase the minority class samples in the original dataset by creating additional new samples. Undersampling, however, reduces the sample size of the original dataset by eliminating redundant majority class samples. Hybrid sampling combines the strategies of undersampling and oversampling to achieve a balanced treatment of the original dataset.

In the realm of undersampling techniques, the random undersampling (RUS) algorithm emerges as a foundational approach. It aims to achieve class imbalance by randomly discarding a portion of the majority class samples. However, the stochastic nature of RUS can inadvertently omit crucial characteristics embedded within the samples. Tomek-link pairs were introduced as an evolution to address this limitation, as highlighted by [22]. This method advocates for the removal of majority class samples present within Tomek-links pairs, building on the premise that such samples might represent noise or belong to boundary regions. Yet, the presence of Tomek-link pairs in datasets can sometimes limit the improvements offered by this technique in terms of classification accuracy. Branching into more advanced strategies, Yen and Lee [23] proposed an undersampling technique rooted in the principles of clustering and inter-sample distances, all while factoring in the disparity in class distribution. With clustering as its cornerstone, it guides the removal of select majority class samples to balance the dataset. In a parallel vein, Lin et al. [24] employed clustering to identify majority class samples, subsequently culling their quantity by opting for cluster centroids. Nevertheless, it is imperative to note that while undersampling approaches are adept at countering class imbalance, the removal of majority class samples might inadvertently lead to the loss of critical information from the dataset. This issue becomes more pronounced in scenarios with already limited sample sizes.

Oversampling methods, as the mainstream solution for imbalanced classification, are versatile and compatible with various machine learning classifiers. Chawla et al. [25] introduced the Synthetic Minority Oversampling Technique (SMOTE). SMOTE synthesizes minority class samples via interpolation between a data point and its neighbors, achieving sample balance. It overcomes the issue of duplicating sample data in random oversampling (ROS) algorithms, enabling the creation of diverse new samples based on linear interpolation. However, SMOTE’s random selection and generation of sample points do not fully consider the internal data structure, potentially leading to misclassification. To address this, Han et al. [26] proposed the Borderline-SMOTE algorithm, a boundary-based method that combines SMOTE with boundary information. To identify boundary samples more accurately, Barua et al. [27] introduced an optimized oversampling method, the Majority Weighted Minority Over-sampling Technique (MWMOTE). This innovative algorithm takes into account the neighboring information of both majority and minority class samples, bolstering boundary identification via reciprocal validation of neighbor conditions. Subsequently, the identified boundary samples of minority classes are clustered, and interpolation is performed within each cluster to generate new minority class samples. Expanding on this concept, Zhang et al. [28] proposed the Certainty Guided Minority OverSampling (CGMOS) algorithm, which places a particular emphasis on minority class samples. The algorithm determines the weight of each minority class sample selected based on relative certainty. These weights then inform the probability of selecting a seed sample for new sample synthesis. The next step involves generating new minority class sample data via interpolation, guided using the calculated probabilities. Douzas et al. [29] suggested an oversampling method combining k-means clustering with SMOTE to perform space division, effectively avoiding noise generation and addressing both inter-class and intra-class imbalances. He et al. [30] proposed the ADASYN algorithm, an adaptive synthetic sampling method that dynamically determines the number of synthetic samples based on sample distribution, synthesizing more minority class samples in hard-to-classify categories to achieve data balance.

In recent years, some new approaches to imbalanced learning have emerged in the literature [31,32]. A novel hybrid sampling method was proposed by Mirzaei et al. [33]. This method employs k-means clustering to categorize samples, followed by calculating the density of each sub-cluster. By selectively oversampling within less populated minority class sub-clusters and simultaneously undersampling from denser majority class segments, this method adeptly enhances the representation of minority samples while trimming superfluous majority class data, optimizing the overall classification results. Ai-jun and Peng [34] presented a method that integrates ROS, k-means clustering, and a support vector machine hybrid model. This approach first oversamples the minority class samples randomly and then uses k-means clustering to identify the majority class samples closer to the boundary. The advantage of this strategy is that it decreases the risk of accidentally deleting important samples from the majority class, which enhances the efficacy of sample selection.

Deep learning has increasingly become a focal point in research on imbalanced data classification in the past few years. Inspired by the random covering problem, Cui et al. [35] proposed the data sampling process by associating each sample with a small neighboring area, dynamically guiding the sampling through the computation of an effective sample size. Mullick et al. [36] utilized generative models to create new samples for minority classes using convex combinations of existing examples, while Kim et al. [37] sought to synthesize minority class samples by introducing learnable noise to majority class samples. Based on contrastive learning, some research aims to enhance the precision of models in classifying minority classes by improving the capability of representation learning [38,39]. In the realm of graph deep learning for imbalanced classification, Shi et al. [40] proposed a Dual-Regularized Graph Convolutional Network (DRGCN), which employs conditional adversarial training and distribution alignment training to differentiate nodes of various classes and balances the learning between majority and minority categories. The GraphSMOTE framework, introduced by Zhao et al. [41], encodes the similarity between nodes within an embedding space and synthesizes new nodes and edges to create a balanced graph for node classification. Qu et al. [42] introduced the ImGAGN model, generating nodes to emulate the attribute distribution and network topology of minority classes, which then facilitates the training of a Graph Convolutional Network (GCN) discriminator on a synthetic balanced network to distinguish between real and synthetic nodes, as well as between minority and majority nodes. In the context of semi-supervised learning with imbalanced data, a novel approach for semi-supervised medical image classification, named Adaptive Blended Consistency Loss (ABCL), has been proposed, effectively addressing class imbalance by adaptively blending the target class distribution according to the class frequency obtained by Huynh et al. [43]. Hyun et al. [44] tackled the challenge of class imbalance in semi-supervised learning (SSL) by introducing Class-Imbalanced Semi-Supervised Learning (CISSL) and proposing a new Suppressed Consistency Loss (SCL) that is robust to class imbalance in both labeled and unlabeled data.

Currently, there is sparse research on rule-based oversampling algorithms, and the gap in this research field deserves further exploration. Among the few existing studies, Liu et al. [45] proposed an oversampling algorithm based on fuzzy rules to address the issue of class imbalance. This method generates fuzzy rules and synthesizes new minority class instances to tackle the class imbalance problem. Moreover, it can handle imbalanced data with missing values. However, this method is limited to imbalanced data with numerical variables and cannot handle categorical variables. Additionally, it requires manual selection of the number of fuzzy partitions, lacking the ability for adaptive selection. In summary, harnessing data relationships and rules for synthetic data generation is a promising strategy that warrants deeper investigation. Exploring this approach could offer novel alternatives to current oversampling techniques.

3. Method

In this section, we elaborate on the proposed Ad-RuLer method, which is based on the RuLer inductive rule-learning algorithm [14,46]. This algorithm takes labeled minority class data as the input and generates corresponding IF-THEN rules. It iteratively compares the similarity of input minority class data, extracting rule entries that satisfy a user-defined threshold. New rules are then created by taking the union of the corresponding sets of extracted rules and eliminating redundant ones. Finally, the newly created rules are utilized to guide the synthesis of minority class data. The following are the main implementation details of the Ad-RuLer method.

3.1. Overview of Existing Resampling Methods

Before delving into the details of the Ad-RuLer method, we provide an overview of several prevalent resampling techniques. These techniques will later serve as benchmarks for a comparative performance assessment with the Ad-RuLer method in subsequent experimental sections. These methods include SMOTE, ADASYN, Tomek-links, Borderline-SMOTE, and KmeansSMOTE.

SMOTE: This oversampling technique generates new synthetic samples by interpolating between minority class instances, aiming to increase the representation of the underrepresented class. While widely applicable, SMOTE can introduce suboptimal samples in the presence of noise or significant class overlap.
ADASYN: This algorithm is an extension of SMOTE. ADASYN adjusts the number of synthetic samples according to the learning difficulty of individual minority class samples. It generates more synthetic data for those samples that are harder to learn, potentially leading to an improved classifier performance.
Tomek-links: This algorithm enhances class boundaries by identifying and removing Tomek-links, which are pairs of the nearest neighboring samples that belong to different classes. This method refines the dataset rather than augmenting it with new samples, potentially improving classifier boundaries without changing class distributions.
Borderline-SMOTE: This technique focuses on the border regions around minority class samples, oversampling those likely to be misclassified. Borderline-SMOTE attempts to directly enhance the quality of the classification boundary.
KMeansSMOTE: This method integrates K-Means clustering with SMOTE to perform oversampling within each identified cluster of the minority class, thereby preserving the inherent clustering structure of the data while synthesizing new samples.

For a more detailed introduction to these resampling methods, please refer to [22,25,26,29,30]. Distinct from the above-mentioned methods, the Ad-RuLer approach focuses on the rule-based structure inherent within the dataset. It employs an iterative process to pairwise compare various rules, ultimately generating new rules that satisfy the predefined conditions. These rules are then transformed into synthetic minority class samples, thus providing a new perspective and novel solutions to addressing the challenges associated with imbalanced datasets.

3.2. Ad-RuLer Oversampling Approach

Figure 1 depicts the entire framework of the Ad-RuLer approach. As can be seen from the figure, the rule extraction process constitutes the core of the system, incorporating two essential steps: New Rule Generation and Dissimilarity Measure. These steps form an interactive process that proceeds until the generation of all candidate rules is complete. Initially, Rule Extraction involves transforming minority class samples into IF-THEN rules. Subsequently, through pairwise comparisons, new rules are continually generated in an iterative process that continues until no further candidate rules emerge. During the Dissimilarity Measure procedure, the Hamming distance is employed to quantify the dissimilarity between rules. The process iteratively constructs new candidate rules by calculating the union of the corresponding sets of extracted rules, generating all candidates that meet a predefined dissimilarity threshold d, while the parameter ratio controls the quantity of synthesized rules. These synthesized rules are then reconverted into the minority class samples. If the majority and minority class samples within the data are not yet fully balanced, random sampling techniques are applied to achieve balance among the classes of the synthesized data. The Ad-RuLer method comprises several key procedures: Rule Extraction, Dissimilarity Measure, New Rule Generation, and Balanced Resampling.

Below, we provide a more detailed description of each procedure. The pseudocode for the data synthesis process is illustrated in Algorithm 1.

Rule extraction: Each data instance is treated as an IF-THEN rule, represented as an array of size N. The first

N - 1

entries are rule antecedents, and the Nth entry is the rule consequent, assigned to the label [14]. The rule extraction process iteratively compares different rules to identify patterns. For instance, the rule

r = [{1}, {3}, positive]

indicates “if the first attribute is 1 and the second attribute is 3, then the assigned label is positive”. Rules are stored in a list and can be accessed via their indices. The rule

r = [{1, 2, 3}, {5}, \dots, {7}, positive]

means IF

r [1] = 1 or 2 or 3 and r [2] = 5 and \dots and

r [N - 1] = 7

, THEN the label

= positive

.

Dissimilarity measure: The dissimilarity measure function takes a pair of rules,

r_{1}

and

r_{2}

, with the same class, and if the calculated dissimilarity is less than a user-defined threshold d, it creates a new rule by taking the union of the corresponding sets of the extracted rules. For instance, if

r_{1} = [{1}, {3, 5, 7}, positive]

and

r_{2} = [{1, 3}, {7, 11}, positive]

, then

r_{1, 2} = [{1, 3}, {3, 5, 7, 11}, positive]

. This procedure proceeds via pairwise comparison and iteration until no new rules can be created, returning all candidate rules that meet the conditions. The

d i s s i m i l a r i t y (r_{1}, r_{2})

is calculated by counting the number of empty intersections between the corresponding sets of the two rules, returning candidate rules that meet the conditions. For example, if

r_{1} = [{1}, {4, 5}, positive]

and

r_{2} = [{1, 2}, {5, 6}, positive]

, then

d i s s i m i l a r i t y (r_{1}, r_{2}) = 0

. If

r_{1} = [{1}, {4, 5}, positive]

and

r_{2} = [{1, 2}, {6}, positive]

, then

d i s s i m i l a r i t y (r_{1}, r_{2}) = 1

.

New rule generation: During the rule generation process, any contradictions between rules are first checked, eliminating those with identical parameter values but different labels. In the rule generation function,

r a t i o

is a user-defined parameter, with

r a t i o \in [0, 1]

, which controls the quantity of synthesized rules. It represents that the proportion of the data generated via the candidate rules present in the original data should be greater than or equal to the user-defined value. A

r a t i o = 1

signifies that

100 %

of the instances included in the candidate rules must be present in the input data to accept the rule. A

r a t i o = 0.5

indicates that at least

50 %

of the input data instances should be included to accept the candidate rule. Take, for instance, the potential rule

r_{1, 2} = [{1, 3}, {5, 6}, positive]

. The complete set of data that could be generated by

r_{1, 2}

encompasses

[{1}, {5}, positive]

,

[{1}, {6}, positive]

,

[{3}, {5}, positive]

,

[{3}, {6}, positive]

. If we set the

r a t i o

to 0.5, it implies that half of the data must be present in the input data. If this condition is not met, the potential rule is discarded.

Balanced Resampling: In the RuLer system, the synthesis of all potential instances of the minority class is achieved using extracted patterns. By using the

r a t i o

parameter, the volume of the data generated can be adjusted to approximate the desired quantity as closely as possible. However, it is important to note that exact control over the volume of produced data is not feasible. To compensate for this variation, Ad-RuLer introduces a resampling phase. The initial step involves the calculation of n, the difference between the number of synthesized samples

X_{r}

and the target sample size S. If

n > 0

, RUS is applied. This process randomly removes n instances from the synthesized data, thereby aligning the volume to the balanced level. Conversely, if

n < 0

, ROS is implemented. This process randomly replicates

| n |

instances from the synthesized data, effectively increasing the minority class volume to meet the targeted level. If n equals zero, it signifies that the synthesized instances are in perfect alignment with the targeted count, implying that the dataset is already balanced and does not require further adjustments

Algorithm 1 Rule-based Data Synthesizer

Require:: X: input data, $d \in N$ : user-defined dissimilarity threshold, $r a t i o \in [0, 1]$ : ratio for creating new rules, S: required sample size
Ensure:: Oversampled data $X_{r}^{'}$
1:: $r u l e s \leftarrow R u l e s C r e a t i o n (X)$
2:: $n e w R u l e s \leftarrow []$
3:: for $i = 0$ to size of $r u l e s$ do
4:: $r 1 \leftarrow r u l e s [i]$
5:: for $j = i + 1$ to size of $r u l e s$ do
6:: $r 2 \leftarrow r u l e s [j]$
7:: $p a t t e r n \leftarrow d i s s i m i l a r i t y (r 1, r 2, d)$
8:: if $p a t t e r n$ then
9:: $r u l e \leftarrow c r e a t e R u l e (r 1, r 2, r a t i o, r u l e s)$
10:: if $r u l e$ then
11:: $n e w R u l e s . a p p e n d (r u l e)$
12:: end if
13:: end if
14:: end for
15:: end for
16:: $r u l e s . a p p e n d (n e w R u l e s)$
17:: $r u l e s \leftarrow d e l e t e R e d u n d a n t (r u l e s)$
18:: $X_{r} \leftarrow s a m p l e s C r e a t i o n (r u l e s)$
19:: $n \leftarrow N u m b e r D i f f e r e n c e (X_{r}, S)$
20:: if $n > 0$ then
21:: $X_{r}^{'} \leftarrow r a n d o m U n d e r s a m p l i n g (X_{r}, n)$ $n < 0$
22:: else if $n < 0$ then
23:: $X_{r}^{'} \leftarrow r a n d o m O v e r s a m p l i n g (X_{r}, a b s (n))$
24:: else
25:: $X_{r}^{'} = X_{r}$
26:: end if
27:: return $X_{r}^{'}$

4. Experiment

In this section, the results of the experiments conducted on ten distinct real-world imbalanced datasets are presented to evaluate the effectiveness of the Ad-RuLer. The performance of Ad-RuLer is benchmarked against the five other resampling methods. The statistical significance of the performance differences between Ad-RuLer and the other resampling algorithms across various metrics was validated using the Wilcoxon signed-rank test. Furthermore, the Friedman test, followed by Nemenyi post hoc analysis, was employed to conduct a rank-based comparison of the resampling methods across different metrics.

4.1. Datasets

Among all the datasets used in this study, the Heart Disease and Breast Cancer datasets were sourced from the Kaggle repository, while the remaining datasets were acquired from the KEEL repository. The Heart Disease dataset, originally a multi-class dataset, was transformed into a binary classification problem by selecting samples with "num" as 0 (indicating no heart disease) and "num" as 1 (indicating stage-1 heart disease). The rest of the datasets are binary classification datasets. The statistical characteristics of these datasets, including the imbalance ratio (IR), are detailed in Table 1.

4.2. Classifiers and Resampling Approaches

For a comprehensive evaluation and comparison of Ad-RuLer with other oversampling algorithms, we implemented five widely used oversampling algorithms: SMOTE, ADASYN, Tomek-links, Borderline-SMOTE, and KMeansSMOTE. We employed three classifiers: logistic regression, random forest, and XGBoost, and combined them with Ad-RuLer and the five resampling algorithms, respectively, across the ten datasets. The optimal hyperparameters for each model were determined using five-fold cross-validation on the training set. Each dataset was partitioned into

75 %

for training and

25 %

for testing, where each method was executed for 50 iterations to ensure robustness in performance calculation.

4.3. Evaluation Metrics

Handling imbalanced datasets poses unique challenges, especially when distinguishing between the majority (often labeled as the negative class) and minority classes (labeled as the positive class). While accuracy is a frequently used metric, it is not always appropriate for imbalanced classification scenarios. Instead, four specific metrics, namely G-mean, F1 score, AUC, and Precision, offer a more comprehensive assessment for such cases.

The AUC calculates the area under the ROC curve, using the True Positive Rate (TPR) and the False Positive Rate (FPR)—making it a widely accepted metric for classification performance assessment. Precision, shown in Equation (1), defines the proportion of actual positive instances among all the predicted positive cases. A high Precision underscores the model’s reliability in its positive predictions. G-mean, elucidated in Equation (2), computes the geometric mean of Sensitivity and Specificity. This measurement provides a balanced perspective on the performance across both classes. The F1 score, given in Equation (5), stands as the harmonic mean of Precision and Sensitivity, encapsulating the model’s overall classification prowess.

To ensure rigorous comparisons, we employ the Wilcoxon Signed-rank Test and the Friedman Test. The former tests for performance similarities between two algorithms, while the latter discerns significant performance variations across multiple algorithms. If the Friedman Test detect significant differences, the Nemenyi post hoc analysis is used to locate these disparities precisely.

Precision = \frac{T P}{T P + F P}

(1)

where

T P

and

F P

represent true positives and false positives, respectively.

G - mean = \sqrt{Specificity * Sensitivity}

(2)

where Sensitivity and Specificity are computed according to Equations (3) and (4), respectively.

Sensitivity = \frac{T P}{T P + F N}

(3)

Specificity = \frac{T N}{T N + F P}

(4)

F 1 = 2 \times \frac{Precision \times Sensitivity}{Precision + Sensitivity}

(5)

4.4. Experimental Results

We conducted a comprehensive comparison of various resampling algorithms applied to ten datasets, evaluating their performance with logistic regression, random forest, and XGBoost as classifiers. The results using logistic regression are presented in Table 2. For the performance comparison results using random forest and XGBoost classifiers, please refer to Table A1 and Table A2 in Appendix A. The hyperparameters for the resampling approaches and classifiers employed in this study are shown in Table A3 in Appendix B. The experimental results highlight the superior performance of Ad-RuLer in most cases, compared to other resampling algorithms. For instance, when using logistic regression as the classifier, Ad-RuLer outperformed other resampling algorithms on all four evaluation metrics in the Breast Cancer and Car-good datasets. In the Heart Disease, Vehicle0, and Abalone9-18 datasets, apart from the Precision, the other three metrics all exceeded those of other resampling algorithms. When adopting random forest as the classifier, Ad-RuLer outperformed other resampling algorithms on all four metrics in the Heart Disease dataset. In the Vehicle0, Yeast3, Yeast-0-2-5-7-9vs3-6-8, Abalone9-18, and Car-good datasets, at least two metrics exceeded those of other resampling algorithms. In Figure 2, the synthesized minority class data samples obtained via Ad-RuLer for the Yeast3 and Heart Disease datasets were visualized together with the original data samples using the t-SNE technique [47], to give an idea of the distribution of the new data generated via Ad-RuLer.

Figure 3 and Figure 4 illustrate the impact of parameter variations in d and ratio on various performance metrics, using the Heart Disease dataset and the Yeast3 dataset as illustrative cases. We employed logistic regression as the classifier and, by holding all other parameters constant, we assessed the impact of the d and ratio parameters, respectively, on different performance metrics. Figure 3 reveals that on the Yeast3 dataset, both the F1 score and Precision exhibit a pattern of initial increase followed by a subsequent decrease as parameter d is varied, while the AUC remains largely invariant. In the Heart Disease dataset, despite the relatively minor fluctuations of the performance metrics in response to changes in d, it is noted that Precision reaches peaks when d is set to 5, after which it declines as d increases. With respect to the influence of the ratio parameter, the results from the Yeast3 dataset show that Precision, G-mean, and F1 score all reach their maximum when the ratio is 0.1, then gradually diminish and stabilize as the ratio increases. In addition, it is observed that the AUC demonstrates greater stability to changes in ratio compared to the other performance metrics. In summary, upon examining the trends of various performance metrics in relation to the alterations in the parameters d and ratio, where it can be concluded that the performance of the Ad-RuLer algorithm is not subject to significant fluctuations due to the adjustment of these two parameters. This demonstrates the Ad-RuLer’s robustness in terms of overall performance.

Figure 5 displays the average rankings of each algorithm across different datasets and classifiers. As expected, Ad-RuLer leads other algorithms in the average rankings of G-mean, F1 score, and AUC, with average rankings of 2.43, 2.33, and 2.10, respectively. In the average ranking of Precision, Ad-RuLer ranks second, only behind the KmeansSMOTE algorithm, with an average ranking of 3.07. Table 3 presents the results of the Wilcoxon signed-rank test, which targets the performance across all datasets, with Ad-RuLer as the reference algorithm, and assumes a significance level of

α = 0.05

. In the Wilcoxon signed-rank test, the null hypothesis states that no significant difference exists between the performance metrics of the Ad-RuLer and those of the specific comparative algorithm. The results presented in Table 3 indicate that when employing the logistic regression classifier, the p-values for the AUC metric obtained using five different resampling algorithms are all below 0.05. This statistically significant finding leads us to reject the null hypothesis and conclude that the Ad-RuLer algorithm demonstrates significant differences in AUC when compared with all other algorithms. Similarly, we have observed that Ad-RuLer exhibits statistically significant differences in G-mean with algorithms such as SMOTE, Tomek-links, and KMeansSMOTE, and in terms of Precision, it demonstrates significant differences with ADASYN and Borderline-SMOTE. When using the random forest as the classifier, Ad-RuLer exhibits significant differences in the G-mean and F1 score compared to Borderline-SMOTE and KMeansSMOTE. When employing XGBoost as the classifier, Ad-RuLer demonstrates a significant difference in the G-mean relative to Borderline-SMOTE and KMeansSMOTE.

In addition to the Wilcoxon signed-rank test, we also conducted the Friedman test on the average rankings of all resampling algorithms across all classifiers and datasets. Using the Friedman test to assess the performance of various resampling algorithms, we hypothesized that there would be no significant differences in the average rankings across the different performance metrics. Our analysis revealed that for the G-mean and F1 score, the p-values were 0.0002 and 0.0012, respectively, with both markedly below 0.05. This led us to reject the null hypothesis related to these metrics, thereby confirming that Ad-RuLer’s performance exhibited statistically significant differences when compared with other algorithms on the G-mean and F1 score. Regarding the AUC, the p-value was 0.1462, exceeding the 0.05. This implies that we lack sufficient statistical evidence to claim significant differences in average rankings among the resampling algorithms for AUC; thus, the null hypothesis for the AUC is retained. For the Precision, although Ad-RuLer ranked second among all algorithms, closely following KmeansSMOTE, the computed p-value of 0.0005, substantially lower than the 0.05 level of significance, compels us to reject the null hypothesis. It indicates that at least one algorithm shows a significant performance deviation from the others on Precision.

However, the Friedman test does not indicate which specific algorithms exhibit significant differences. Therefore, we employed the Nemenyi post hoc test for multiple pairwise comparisons to determine which algorithms have significant differences from others. In the Nemenyi post hoc test, as shown in Figure 6, the algorithms are listed on the vertical axis, with ranking values on the horizontal axis. The average ranking of each algorithm is represented by a point, which extends to the left and right into a line segment, representing a

95 %

confidence interval. We assess whether there are significant differences between algorithms by checking whether these line segments (i.e.,

95 %

confidence intervals) overlap. If the line segments of the two algorithms do not overlap, we can consider that these two algorithms have statistically significant differences. Observing the figure, it is found that Ad-RuLer exhibits a significant difference from KmeansSMOTE in terms of G-mean. Compared to other oversampling algorithms, although Ad-RuLer’s average ranking in G-mean, F1 score, and AUC is higher than other resampling algorithms, the significant differences cannot be determined from the current analysis. Given the conservative nature of the Nemenyi post hoc test, the significance of Ad-RuLer’s performance relative to other resampling algorithms remains to be conclusively established. Therefore, future experiments involving a broader range of datasets are necessary to robustly test these relationships.

5. Discussion

This study introduces a novel oversampling technique named Ad-RuLer. This method, grounded in the extraction of rules from datasets, aims to generate new data instances to enhance the classification performance of minority class instances in imbalanced datasets. Comprehensive experimental results reveal that, in comparison to prevailing oversampling algorithms, Ad-RuLer excels across various evaluation metrics, highlighting its potential application value and offering a novel approach to addressing data imbalance challenges.

To delve deeper into Ad-RuLer’s performance under varied dataset imbalances, datasets have been categorized based on their imbalance ratio (IR) into two groups: high imbalance (IR > 8) and low imbalance (IR ≤ 8), with each group comprising five datasets. Figure 7 and Figure 8 depict the average ranking results for these two groups, respectively. Evidently, Ad-RuLer leads in the AUC within the low imbalance group, secures the second place in both Precision and G-mean, and ranks third in F1 score. In the high imbalance group, Ad-RuLer dominates in the G-mean, AUC, and F1 score, and ranks second in Precision. Such findings underscore Ad-RuLer’s robust performance, even in datasets with pronounced imbalances, without a deterioration in performance as the imbalance ratio increases, further accentuating its broad applicability.

Additionally, when integrated with various classifiers, Ad-RuLer consistently demonstrates an exceptional predictive performance, emphasizing its reliability and adaptability. Based on the results of the Wilcoxon Signed-rank test, we observed that under the logistic regression classifier, Ad-RuLer exhibits a more significant difference compared to other algorithms, particularly in the AUC, where Ad-RuLer significantly outperforms all other methods.

Nevertheless, this research has its limitations. The impact of Ad-RuLer’s performance under some specific scenarios, such as in high-dimensional datasets or where class overlaps are prominent, has not been thoroughly explored. Future research will delve into the performance of Ad-RuLer in such scenarios. Moreover, this study evaluated Ad-RuLer’s performance using ten datasets. To provide a more holistic assessment of Ad-RuLer’s capabilities, subsequent research aims to test across a wider array of datasets. Furthermore, there is an intent to deploy Ad-RuLer in specialized application domains, such as addressing the severe data imbalance problem in the medical domain and considering to integrate it with other techniques, such as cost-sensitive learning, to further enhance its efficacy.

6. Conclusions

In this study, it has been demonstrated that Ad-RuLer, an approach based on inductive rule learning, can be used as an alternative option to conventional methodologies for classifying imbalanced data. As an efficient oversampling method, Ad-RuLer substantially enhances the predictive performance of the classifier when confronted with imbalanced datasets. Comparative experiments with other data resampling methods on ten real-world datasets with different imbalance ratios demonstrate the effectiveness and potential of our approach. Future research will focus on assessing the efficacy of Ad-RuLer in high-dimensional datasets and scenarios with significant class overlap, expanding the scope to address practical problems in specialized domains, such as healthcare and financial fraud, where data imbalance issues are more pronounced. Furthermore, the potential enhancement of Ad-RuLer in complex data settings will be investigated via its integration with cost-sensitive learning techniques and dimensionality reduction techniques, such as PCA and t-SNE.

Author Contributions

Conceptualization, X.Z., À.N., I.P., F.M. and E.R.; methodology, X.Z. and À.N.; software, X.Z. and I.P.; validation, X.Z. and À.N.; formal analysis, X.Z. and À.N.; investigation, X.Z., I.P. and À.N.; resources, À.N., F.M. and E.R.; data curation, X.Z.; writing—original draft preparation, X.Z.; writing—review and editing, X.Z. and À.N.; supervision, À.N., F.M. and E.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

These links for datasets: https://sci2s.ugr.es/keel/imbalanced.php (accessed on 18 September 2023), https://www.kaggle.com/datasets (accessed on 18 September 2023).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SMOTE	Synthetic Minority Oversampling Technique
ADASYN	Adaptive Synthetic Oversampling
RUS	Random Undersampling
ROS	Random Oversampling
TPR	True Positive Rate
FPR	False Positive Rate

Appendix A

Table A1. Performance comparison between different data resampling approaches using random forest.

Datasets	Algorithms	G-Mean	F1 Score	AUC	Precision
Heart Disease	Ad-RuLer	0.766	$0.717$	$0.851$	$0.714$
	SMOTE	0.750	0.696	0.832	0.706
	ADASYN	0.750	0.696	0.830	0.699
	Tomek-links	0.746	0.691	0.829	0.700
	Borderline-SMOTE	0.745	0.691	0.831	0.699
	KmeansSMOTE	0.742	0.687	0.832	0.704
Breast Cancer	Ad-RuLer	0.946	0.934	0.985	0.942
	SMOTE	0.952	0.941	$0.988$	0.947
	ADASYN	$0.956$	$0.943$	$0.988$	0.938
	Tomek-links	0.951	0.938	0.987	0.938
	Borderline-SMOTE	0.952	0.940	$0.988$	0.941
	KMeansSMOTE	0.948	0.937	0.987	$0.950$
Pima	Ad-RuLer	0.745	0.672	$0.830$	0.628
	SMOTE	0.744	0.671	0.829	0.653
	ADASYN	0.737	0.661	0.822	0.638
	Tomek-links	$0.749$	$0.676$	0.829	0.654
	Borderline-SMOTE	0.737	0.662	0.818	0.633
	KmeansSMOTE	0.720	0.644	0.829	$0.677$
Vehicle0	Ad-RuLer	0.963	0.935	$0.996$	$0.918$
	SMOTE	0.963	0.927	0.994	0.898
	ADASYN	$0.968$	0.933	0.995	0.898
	Tomek-links	0.963	0.927	0.994	0.897
	Borderline-SMOTE	0.965	$0.936$	$0.996$	0.915
	KMeansSMOTE	0.948	0.915	0.992	0.904
Ecoli1	Ad-RuLer	0.877	$0.789$	0.950	0.746
	SMOTE	0.886	0.784	0.953	0.717
	ADASYN	$0.895$	0.785	0.950	0.704
	Tomek-links	0.890	0.786	$0.954$	0.716
	Borderline-SMOTE	0.889	0.782	0.944	0.705
	KmeansSMOTE	0.863	0.773	$0.954$	$0.747$
Yeast3	Ad-RuLer	$0.911$	0.782	$0.973$	0.713
	SMOTE	0.874	$0.785$	0.966	0.788
	ADASYN	0.883	0.774	0.964	0.746
	Tomek-links	0.875	0.781	0.966	0.778
	Borderline-SMOTE	0.884	0.781	0.962	0.759
	KMeansSMOTE	0.829	0.754	0.964	$0.828$
Yeast-0-2-5-7-9	Ad-RuLer	$0.887$	0.784	$0.957$	0.760
vs3-6-8	SMOTE	0.877	0.781	0.951	0.778
	ADASYN	0.884	0.773	0.947	0.747
	Tomek-links	0.880	0.784	0.949	0.779
	Borderline-SMOTE	0.864	0.770	0.944	0.783
	KMeansSMOTE	0.856	$0.802$	0.940	$0.878$
Abalone9-18	Ad-RuLer	0.606	$0.399$	0.811	$0.419$
	SMOTE	0.636	0.350	0.842	0.303
	ADASYN	$0.645$	0.350	0.841	0.296
	Tomek-links	0.620	0.331	0.846	0.287
	Borderline-SMOTE	0.582	0.335	$0.848$	0.335
	KMeansSMOTE	0.397	0.225	0.802	0.401
Car-good	Ad-RuLer	$0.969$	$0.652$	$0.992$	0.490
	SMOTE	0.725	0.594	0.987	0.706
	ADASYN	0.736	0.576	0.985	0.633
	Tomek-links	0.732	0.583	0.987	0.653
	Borderline-SMOTE	0.732	0.581	0.986	0.639
	KMeansSMOTE	0.663	0.543	0.986	$0.707$
winequality	Ad-RuLer	0.216	0.086	0.747	$0.307$
-red-4	SMOTE	0.305	0.118	0.742	0.125
	ADASYN	$0.346$	$0.141$	0.745	0.151
	Tomek-links	0.292	0.120	0.741	0.129
	Borderline-SMOTE	0.107	0.049	0.745	0.091
	KMeansSMOTE	0.208	0.082	$0.755$	0.109

Table A2. Performance comparison between different data resampling approaches using XGBoost.

Datasets	Algorithms	G-Mean	F1 Score	AUC	Precision
Heart Disease	Ad-RuLer	0.743	$0.690$	$0.823$	$0.691$
	SMOTE	0.729	0.672	0.810	0.673
	ADASYN	0.731	0.674	0.812	0.680
	Tomek-links	0.734	0.677	0.815	0.689
	Borderline-SMOTE	0.734	0.677	0.810	0.681
	KmeansSMOTE	0.735	0.678	0.814	0.687
Breast Cancer	Ad-RuLer	0.947	0.936	0.988	0.946
	SMOTE	$0.959$	$0.950$	$0.992$	0.954
	ADASYN	$0.959$	0.946	$0.992$	0.935
	Tomek-links	0.954	0.943	0.991	0.944
	Borderline-SMOTE	0.952	0.940	0.991	0.942
	KMeansSMOTE	0.957	0.948	0.991	$0.957$
Pima	Ad-RuLer	$0.726$	$0.649$	$0.809$	0.626
	SMOTE	0.720	0.641	0.804	0.623
	ADASYN	0.717	0.637	0.800	0.611
	Tomek-links	0.722	0.643	0.808	0.630
	Borderline-SMOTE	0.706	0.624	0.796	0.602
	KmeansSMOTE	0.715	0.636	0.808	$0.647$
Vehicle0	Ad-RuLer	0.955	0.933	$0.993$	0.909
	SMOTE	0.965	$0.935$	0.992	0.913
	ADASYN	$0.968$	$0.935$	0.992	0.904
	Tomek-links	0.964	0.934	0.991	0.913
	Borderline-SMOTE	0.962	0.933	$0.993$	$0.914$
	KMeansSMOTE	0.961	0.930	0.992	0.911
Ecoli1	Ad-RuLer	0.868	0.784	0.952	0.754
	SMOTE	0.876	0.780	0.958	0.733
	ADASYN	0.876	0.773	0.955	0.714
	Tomek-links	0.870	0.773	0.958	0.729
	Borderline-SMOTE	$0.877$	0.773	0.953	0.713
	KmeansSMOTE	0.866	$0.786$	$0.961$	$0.781$
Yeast3	Ad-RuLer	$0.884$	$0.756$	$0.969$	0.708
	SMOTE	0.853	0.755	0.960	0.763
	ADASYN	0.857	0.748	0.958	0.740
	Tomek-links	0.854	0.751	0.960	0.752
	Borderline-SMOTE	0.855	$0.756$	0.959	0.762
	KMeansSMOTE	0.832	0.743	0.960	$0.783$
Yeast-0-2-5-7-9	Ad-RuLer	$0.888$	0.790	$0.939$	0.774
vs3-6-8	SMOTE	0.884	0.776	0.938	0.753
	ADASYN	0.865	0.733	$0.939$	0.701
	Tomek-links	0.883	0.775	0.936	0.753
	Borderline-SMOTE	0.873	0.764	0.936	0.753
	KMeansSMOTE	0.880	$0.806$	0.926	$0.826$
Abalone9-18	Ad-RuLer	0.683	0.370	0.799	0.392
	SMOTE	$0.702$	0.393	0.838	0.322
	ADASYN	0.701	$0.394$	0.830	0.320
	Tomek-links	0.692	0.388	0.835	0.324
	Borderline-SMOTE	0.657	0.385	$0.844$	0.350
	KMeansSMOTE	0.562	0.357	0.801	$0.453$
Car-good	Ad-RuLer	$0.981$	0.754	$0.995$	0.615
	SMOTE	0.874	$0.764$	0.993	$0.776$
	ADASYN	0.856	0.738	0.992	0.748
	Tomek-links	0.874	$0.764$	0.993	$0.776$
	Borderline-SMOTE	0.871	0.755	0.993	0.760
	KMeansSMOTE	0.865	0.732	0.991	0.725
winequality	Ad-RuLer	$0.283$	$0.127$	$0.710$	$0.156$
-red-4	SMOTE	0.203	0.071	0.687	0.071
	ADASYN	0.246	0.085	0.672	0.088
	Tomek-links	0.272	0.095	0.665	0.097
	Borderline-SMOTE	0.087	0.032	0.695	0.046
	KMeansSMOTE	0.189	0.067	0.700	0.078

Appendix B

Table A3. Hyperparameter for resampling approach and classifiers used in this study.

Datasets	Approach	Resampling Parameter	Classifier Parameter
Heart	Ad-RuLer	d = 3, ratio = 0.2	LR: penalty = ‘l2’, C = 2; RF: n_estimators = 600, max_depth = 5, min_samples_split = 5, min_samples_leaf = 5; XGB: max_depth = 4, learning_rate = 0.05, n_estimators = 500, reg_lambda = 0.5;
	SMOTE	k_n = 7
	ADASYN	n_n = 5
	Tomek-links	None
	B-SMOTE	k_n = 5, m_n = 15
	KMeansSMOTE	k_n = 3, k_e = 2
Breast	Ad-RuLer	d = 1, ratio = 1	LR: penalty = ‘l2’, C = 1; RF: n_estimators = 500, max_depth = 8, min_samples_split = 5, min_samples_leaf = 3; XGB: max_depth = 15, learning_rate = 0.1, n_estimators = 500, reg_lambda = 0.1;
	SMOTE	k_n = 3
	ADASYN	n_n = 8
	Tomek-links	None
	B-SMOTE	k_n = 10, m_n = 5
	KMeansSMOTE	k_n = 3, k_e = 4
Pima	Ad-RuLer	d = 3, ratio = 0.1	LR: penalty = ‘l2’, C = 1.5; RF: n_estimators = 600, max_depth = 5, min_samples_split = 10, min_samples_leaf = 3; XGB: max_depth = 15, learning_rate = 0.05, n_estimators = 500, reg_lambda = 1;
	SMOTE	k_n = 5
	ADASYN	n_n = 10
	Tomek-links	None
	B-SMOTE	k_n = 5, m_n = 5
	KMeansSMOTE	k_n = 3, k_e = 4
Vehicle0	Ad-RuLer	d = 12, ratio = 0.1	LR: penalty = ‘l2’, C = 1.5; RF: n_estimators = 500, max_depth = 15, min_samples_split = 15, min_samples_leaf = 3; XGB: max_depth = 20, learning_rate = 0.1 n_estimators = 600, reg_lambda = 2;
	SMOTE	k_n = 5
	ADASYN	n_n = 8
	Tomek-links	None
	B-SMOTE	k_n = 5, m_n = 5
	KMeansSMOTE	k_n = 4, k_e = 2
Ecoli1	Ad-RuLer	d = 4, ratio = 0.2	LR: penalty = ‘l2’, C = 2; RF: n_estimators = 600, max_depth = 15, min_samples_split = 20, min_samples_leaf = 10; XGB: max_depth = 10, learning_rate = 0.1, n_estimators = 300, reg_lambda = 0.5;
	SMOTE	k_n = 8
	ADASYN	n_n = 6
	Tomek-links	None
	B-SMOTE	k_n = 4, m_n = 10
	KMeansSMOTE	k_n = 6, k_e = 3
Yeast3	Ad-RuLer	d = 4, ratio = 0.1	LR: penalty = ‘l2’, C = 3; RF: n_estimators = 600, max_depth = 20, min_samples_split = 15, min_samples_leaf = 3; XGB: max_depth = 20, learning_rate = 0.1, n_estimators = 300, reg_lambda = 0.1;
	SMOTE	k_n = 3
	ADASYN	n_n = 6
	Tomek-links	None
	B-SMOTE	k_n = 3, m_n = 10
	KMeansSMOTE	k_n = 6, k_e = 2
Yeast-0-2-5-7-9	Ad-RuLer	d = 4, ratio = 0.2	LR: penalty = ‘l2’, C = 2; RF: n_estimators = 600, max_depth = 25, min_samples_split = 10, min_samples_leaf = 3; XGB: max_depth = 10, learning_rate = 0.1, n_estimators = 300, reg_lambda = 2;
	SMOTE	k_n = 7
	ADASYN	n_n = 10
	Tomek-links	None
	B-SMOTE	k_n = 3, m_n = 8
	KMeansSMOTE	k_n = 10, k_e = 3
Abalone9	Ad-RuLer	d = 6, ratio = 0.1	LR: penalty = ‘l2’, C = 3; RF: n_estimators = 500, max_depth = 30, min_samples_split = 5, min_samples_leaf = 3; XGB: max_depth = 20, learning_rate = 0.1, n_estimators = 200, reg_lambda = 0.5;
	SMOTE	k_n = 5
	ADASYN	n_n = 5
	Tomek-links	None
	B-SMOTE	k_n = 5, m_n = 8
	KMeansSMOTE	k_n = 10, k_e = 4
Car-good	Ad-RuLer	d = 3, ratio = 0.3	LR: penalty = ‘l2’, C = 2; RF: n_estimators = 500, max_depth = 40, min_samples_split = 20, min_samples_leaf = 10; XGB: max_depth = 5, learning_rate = 0.01, n_estimators = 200, reg_lambda = 3;
	SMOTE	k_n = 10
	ADASYN	n_n = 5
	Tomek-links	None
	B-SMOTE	k_n = 8, m_n = 3
	KMeansSMOTE	k_n = 12, k_e = 4
Winequality	Ad-RuLer	d = 6, ratio = 0.1	LR: penalty = ‘l2’, C = 0.5; RF: n_estimators = 500, max_depth = 20, min_samples_split = 10, min_samples_leaf = 3; XGB: max_depth = 5, learning_rate = 0.01, n_estimators = 100, reg_lambda = 0.5;
	SMOTE	k_n = 3
	ADASYN	n_n = 5
	Tomek-links	None
	B-SMOTE	k_n = 8, m_n = 5
	KMeansSMOTE	k_n = 8, k_e = 3

Abbreviations for datasets: Abalone9, Abalone9-18; Winequality, Winequality-red-4. Abbreviations for classifier: LR, Logistic Regression; RF, Random Forest; XGB, XGBoost. Abbreviations for parameter: k_n, k_neighbors; n_n, n_neighbors; m_n, m_neighbors; k_e, kmeans_estimator.

References

Gupta, A.; Lohani, M.; Manchanda, M. Financial fraud detection using naive bayes algorithm in highly imbalance data set. J. Discret. Math. Sci. Cryptogr. 2021, 24, 1559–1572. [Google Scholar] [CrossRef]
Gu, Q.; Cai, Z.; Zhu, L.; Huang, B. Data mining on imbalanced data sets. In Proceedings of the 2008 International Conference on Advanced Computer Theory and Engineering, Phuket, Thailand, 20–22 December 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 1020–1024. [Google Scholar]
Jiang, Z.; Pan, T.; Zhang, C.; Yang, J. A new oversampling method based on the classification contribution degree. Symmetry 2021, 13, 194. [Google Scholar] [CrossRef]
Gonzalez-Cuautle, D.; Hernandez-Suarez, A.; Sanchez-Perez, G.; Toscano-Medina, L.K.; Portillo-Portillo, J.; Olivares-Mercado, J.; Perez-Meana, H.M.; Sandoval-Orozco, A.L. Synthetic minority oversampling technique for optimizing classification tasks in botnet and intrusion-detection-system datasets. Appl. Sci. 2020, 10, 794. [Google Scholar] [CrossRef]
Liu, J.; Gao, Y.; Hu, F. A fast network intrusion detection system using adaptive synthetic oversampling and LightGBM. Comput. Secur. 2021, 106, 102289. [Google Scholar] [CrossRef]
Guzmán-Ponce, A.; Valdovinos, R.M.; Sánchez, J.S.; Marcial-Romero, J.R. A new under-sampling method to face class overlap and imbalance. Appl. Sci. 2020, 10, 5164. [Google Scholar] [CrossRef]
Dai, Q.; Liu, J.w.; Liu, Y. Multi-granularity relabeled under-sampling algorithm for imbalanced data. Appl. Soft Comput. 2022, 124, 109083. [Google Scholar] [CrossRef]
Aridas, C.K.; Karlos, S.; Kanas, V.G.; Fazakis, N.; Kotsiantis, S.B. Uncertainty based under-sampling for learning naive bayes classifiers under imbalanced data sets. IEEE Access 2019, 8, 2122–2133. [Google Scholar] [CrossRef]
Jiang, K.; Wang, W.; Wang, A.; Wu, H. Network intrusion detection combined hybrid sampling with deep hierarchical network. IEEE Access 2020, 8, 32464–32476. [Google Scholar] [CrossRef]
Xu, Z.; Shen, D.; Nie, T.; Kou, Y. A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data. J. Biomed. Informatics 2020, 107, 103465. [Google Scholar] [CrossRef]
Sowah, R.A.; Kuditchar, B.; Mills, G.A.; Acakpovi, A.; Twum, R.A.; Buah, G.; Agboyi, R. HCBST: An efficient hybrid sampling technique for class imbalance problems. ACM Trans. Knowl. Discov. Data (TKDD) 2021, 16, 1–37. [Google Scholar] [CrossRef]
Zhu, T.; Lin, Y.; Liu, Y. Improving interpolation-based oversampling for imbalanced data learning. Knowl.-Based Syst. 2020, 187, 104826. [Google Scholar] [CrossRef]
Alkan, O.; Wei, D.; Mattetti, M.; Nair, R.; Daly, E.; Saha, D. FROTE: Feedback rule-driven oversampling for editing models. Proc. Mach. Learn. Syst. 2022, 4, 276–301. [Google Scholar]
Paz, I. On-the-Fly Synthesizer Programming with Rule Learning. Ph.D. Thesis, Universitat Politècnica de Catalunya—BarcelonaTech, Catalonia, Spain, 2021. [Google Scholar]
Islam, A.; Belhaouari, S.B.; Rehman, A.U.; Bensmail, H. KNNOR: An oversampling technique for imbalanced datasets. Appl. Soft Comput. 2022, 115, 108288. [Google Scholar] [CrossRef]
Zhu, T.; Lin, Y.; Liu, Y. Synthetic minority oversampling technique for multiclass imbalance problems. Pattern Recognit. 2017, 72, 327–340. [Google Scholar] [CrossRef]
Vuttipittayamongkol, P.; Elyan, E. Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Inf. Sci. 2020, 509, 47–70. [Google Scholar] [CrossRef]
Li, F.; Zhang, X.; Zhang, X.; Du, C.; Xu, Y.; Tian, Y.C. Cost-sensitive and hybrid-attribute measure multi-decision tree over imbalanced data sets. Inf. Sci. 2018, 422, 242–256. [Google Scholar] [CrossRef]
Zhang, C.; Tan, K.C.; Li, H.; Hong, G.S. A cost-sensitive deep belief network for imbalanced classification. IEEE Trans. Neural Networks Learn. Syst. 2018, 30, 109–122. [Google Scholar] [CrossRef]
Peng, P.; Zhang, W.; Zhang, Y.; Xu, Y.; Wang, H.; Zhang, H. Cost sensitive active learning using bidirectional gated recurrent neural networks for imbalanced fault diagnosis. Neurocomputing 2020, 407, 232–245. [Google Scholar] [CrossRef]
Leevy, J.L.; Khoshgoftaar, T.M.; Bauder, R.A.; Seliya, N. A survey on addressing high-class imbalance in big data. J. Big Data 2018, 5, 1–30. [Google Scholar] [CrossRef]
Kubat, M.; Matwin, S. Addressing the curse of imbalanced training sets: One-sided selection. Proc. ICML Citeseer 1997, 97, 179. [Google Scholar]
Yen, S.J.; Lee, Y.S. Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. 2009, 36, 5718–5727. [Google Scholar] [CrossRef]
Lin, W.C.; Tsai, C.F.; Hu, Y.H.; Jhang, J.S. Clustering-based undersampling in class-imbalanced data. Inf. Sci. 2017, 409, 17–26. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Proceedings of the Advances in Intelligent Computing: International Conference on Intelligent Computing, ICIC 2005, Hefei, China, 23–26 August 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 878–887. [Google Scholar]
Barua, S.; Islam, M.M.; Yao, X.; Murase, K. MWMOTE–majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. 2012, 26, 405–425. [Google Scholar] [CrossRef]
Zhang, X.; Ma, D.; Gan, L.; Jiang, S.; Agam, G. Cgmos: Certainty guided minority oversampling. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, Indianapolis, IN, USA, 24–28 October 2016; pp. 1623–1631. [Google Scholar]
Douzas, G.; Bacao, F.; Last, F. Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf. Sci. 2018, 465, 1–20. [Google Scholar] [CrossRef]
He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 1322–1328. [Google Scholar]
Maldonado, S.; López, J.; Vairetti, C. An alternative SMOTE oversampling strategy for high-dimensional datasets. Appl. Soft Comput. 2019, 76, 380–389. [Google Scholar] [CrossRef]
Li, M.; Xiong, A.; Wang, L.; Deng, S.; Ye, J. ACO Resampling: Enhancing the performance of oversampling methods for class imbalance classification. Knowl.-Based Syst. 2020, 196, 105818. [Google Scholar] [CrossRef]
Mirzaei, B.; Nikpour, B.; Nezamabadi-pour, H. CDBH: A clustering and density-based hybrid approach for imbalanced data classification. Expert Syst. Appl. 2021, 164, 114035. [Google Scholar] [CrossRef]
Ai-jun, L.; Peng, Z. Research on Unbalanced Data Processing Algorithm Base Tomeklinks-Smote. In Proceedings of the 2020 3rd International Conference on Artificial Intelligence and Pattern Recognition, Xiamen China, 26–28 June 2020; pp. 13–17. [Google Scholar]
Cui, Y.; Jia, M.; Lin, T.Y.; Song, Y.; Belongie, S. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9268–9277. [Google Scholar]
Mullick, S.S.; Datta, S.; Das, S. Generative adversarial minority oversampling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1695–1704. [Google Scholar]
Kim, J.; Jeong, J.; Shin, J. M2m: Imbalanced classification via major-to-minor translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13896–13905. [Google Scholar]
Cui, J.; Zhong, Z.; Liu, S.; Yu, B.; Jia, J. Parametric contrastive learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 715–724. [Google Scholar]
Wang, P.; Han, K.; Wei, X.S.; Zhang, L.; Wang, L. Contrastive learning based hybrid networks for long-tailed image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 943–952. [Google Scholar]
Shi, M.; Tang, Y.; Zhu, X.; Wilson, D.; Liu, J. Multi-class imbalanced graph convolutional network learning. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20), Yokohama, Japan, 11–17 July 2020. [Google Scholar]
Zhao, T.; Zhang, X.; Wang, S. Graphsmote: Imbalanced node classification on graphs with graph neural networks. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, Virtual, 8–12 March 2021; pp. 833–841. [Google Scholar]
Qu, L.; Zhu, H.; Zheng, R.; Shi, Y.; Yin, H. Imgagn: Imbalanced network embedding via generative adversarial graph networks. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Virtual, 14–18 August 2021; pp. 1390–1398. [Google Scholar]
Huynh, T.; Nibali, A.; He, Z. Semi-supervised learning for medical image classification using imbalanced training data. Comput. Methods Programs Biomed. 2022, 216, 106628. [Google Scholar] [CrossRef]
Hyun, M.; Jeong, J.; Kwak, N. Class-imbalanced semi-supervised learning. arXiv 2020, arXiv:2002.06815. [Google Scholar]
Liu, G.; Yang, Y.; Li, B. Fuzzy rule-based oversampling technique for imbalanced and incomplete data learning. Knowl.-Based Syst. 2018, 158, 154–174. [Google Scholar] [CrossRef]
Paz, I.; Nebot, À.; Mugica, F.; Romero, E. On-The-Fly Syntheziser Programming with Fuzzy Rule Learning. Entropy 2020, 22, 969. [Google Scholar] [CrossRef] [PubMed]
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]

Figure 1. The flowchart of Ad-RuLer for imbalanced data processing.

Figure 2. The synthetic data generated via Ad-RuLer for the Yeast3 dataset (a) and Heart Disease dataset (b).

Figure 3. Impact of variations in the d parameter on the performance metrics of the Ad-RuLer algorithm.

Figure 4. Impact of variations in the ratio parameter on the performance metrics of the Ad-RuLer algorithm.

Figure 5. Average ranking of resampling algorithms for all three classifiers on all datasets. (a) Average ranking for G-mean. (b) Average ranking for F1 score. (c) Average ranking for AUC. (d) Average ranking for Precision.

Figure 6. Critical differences between different resampling algorithms on all datasets using Nemenyi post hoc test. (a) Critical differences calculation for G-mean. (b) Critical differences calculation for F1 score. (c) Critical differences calculation for AUC. (d) Critical differences calculation for Precision.

Figure 7. Average rankings of resampling methods in low IR dataset group.

Figure 8. Average rankings of resampling methods in high IR dataset group.

Table 1. Details of the dataset used in the study.

Num	Dataset	S	$S_{\min}$	$S_{maj}$	F	IR
1	Heart	676	265	411	15	1.55
2	Breast	569	212	357	31	1.68
3	Pima	768	268	500	8	1.87
4	Vehicle0	846	199	647	18	3.25
5	Ecoli1	336	77	259	7	3.36
6	Yeast3	1484	163	1321	8	8.10
7	Yeast-0-2	1004	99	905	8	9.14
8	Abalone9	731	42	689	8	16.40
9	Car-good	1728	69	1659	6	24.04
10	Winequality	1599	53	1546	11	29.17

Title Abbreviations: S, Total samples;

S_{\min}

, Minority class samples;

S_{maj}

, Majority class samples; F, Number of features; IR, Imbalance Ratio. Abbreviations for datasets: Heart, Heart Disease; Breast, Breast Cancer; Yeast-0-2, Yeast-0-2-5-7-9vs3-6-8; Abalone9, Abalone9-18; Winequality, winequality-red-4.

Table 2. Performance comparison between different data resampling approaches using logistic regression.

Datasets	Algorithms	G-Mean	F1 Score	AUC	Precision
Heart Disease	Ad-RuLer	$0.769$	$0.721$	$0.841$	0.708
	SMOTE	0.756	0.703	0.824	0.693
	ADASYN	0.761	0.709	0.823	0.690
	Tomek-links	0.753	0.700	0.823	0.693
	Borderline-SMOTE	0.758	0.706	0.821	0.688
	KmeansSMOTE	0.748	0.694	0.821	$0.719$
Breast Cancer	Ad-RuLer	$0.970$	$0.963$	$0.993$	$0.968$
	SMOTE	0.936	0.919	0.988	0.918
	ADASYN	0.936	0.913	0.986	0.879
	Tomek-links	0.932	0.913	0.985	0.909
	Borderline-SMOTE	0.933	0.910	0.985	0.874
	KMeansSMOTE	0.931	0.915	0.986	0.925
Pima	Ad-RuLer	0.733	0.659	$0.827$	0.632
	SMOTE	0.729	0.652	0.817	0.605
	ADASYN	0.726	0.651	0.813	0.586
	Tomek-links	$0.736$	$0.660$	0.818	0.612
	Borderline-SMOTE	0.733	$0.660$	0.812	0.585
	KmeansSMOTE	0.719	0.642	0.826	$0.662$
Vehicle0	Ad-RuLer	$0.977$	$0.941$	$0.996$	0.897
	SMOTE	0.960	0.931	0.993	$0.915$
	ADASYN	0.958	0.922	0.994	0.894
	Tomek-links	0.959	0.929	0.993	0.913
	Borderline-SMOTE	0.959	0.927	0.994	0.905
	KMeansSMOTE	0.959	0.929	0.993	0.913
Ecoli1	Ad-RuLer	0.864	0.749	$0.946$	0.667
	SMOTE	0.860	0.731	$0.946$	0.638
	ADASYN	0.877	0.739	0.943	0.622
	Tomek-links	0.861	0.732	$0.946$	0.637
	Borderline-SMOTE	$0.891$	$0.759$	0.940	0.642
	KmeansSMOTE	0.846	0.728	0.944	$0.669$
Yeast3	Ad-RuLer	0.890	0.685	0.963	0.570
	SMOTE	0.885	0.685	0.964	0.575
	ADASYN	$0.907$	0.655	$0.965$	0.509
	Tomek-links	0.885	0.685	0.964	0.576
	Borderline-SMOTE	0.906	0.650	0.964	0.502
	KMeansSMOTE	0.860	$0.717$	0.952	$0.669$
Yeast-0-2-5-7-9	Ad-RuLer	$0.882$	0.631	$0.929$	0.500
vs3-6-8	SMOTE	$0.882$	0.661	0.923	0.535
	ADASYN	0.855	0.512	0.922	0.365
	Tomek-links	0.876	$0.663$	0.923	0.538
	Borderline-SMOTE	0.864	0.562	0.916	0.430
	KMeansSMOTE	0.596	0.500	0.920	$0.878$
Abalone9-18	Ad-RuLer	$0.834$	$0.425$	$0.937$	0.291
	SMOTE	0.792	0.355	0.884	0.238
	ADASYN	0.794	0.348	0.884	0.231
	Tomek-links	0.791	0.352	0.883	0.235
	Borderline-SMOTE	0.723	0.309	0.851	0.211
	KMeansSMOTE	0.562	0.357	0.801	$0.453$
Car-good	Ad-RuLer	$0.970$	$0.581$	$0.978$	$0.410$
	SMOTE	0.968	0.561	0.976	0.393
	ADASYN	0.967	0.556	0.977	0.389
	Tomek-links	0.968	0.561	0.974	0.393
	Borderline-SMOTE	0.968	0.559	$0.978$	0.391
	KMeansSMOTE	0.941	0.558	0.974	$0.410$
winequality	Ad-RuLer	0.646	0.121	0.705	0.068
-red-4	SMOTE	0.656	0.125	0.695	0.070
	ADASYN	$0.658$	0.124	0.697	0.070
	Tomek-links	0.651	0.124	0.705	0.069
	Borderline-SMOTE	0.646	$0.158$	$0.711$	$0.094$
	KMeansSMOTE	0.642	0.136	0.703	0.078

Table 3. The Wilcoxon Signed-rank test results performed on all datasets with Ad-RuLer serving as the control algorithm. If the p-value for a particular method is less than 0.05, it indicates a significant performance difference between that method and Ad-RuLer.

		SMO	ADA	TL	B-SMO	K-SMO
LR	GM	0.0380	0.2324	$0.0371$	0.2070	$0.0020$
	F1	0.0856	$0.0039$	0.1386	0.0840	0.0644
	AUC	$0.0108$	$0.0010$	$0.0173$	$0.0328$	$0.0020$
	Prec	0.2754	$0.0039$	0.2324	$0.0273$	0.0506
RF	GM	0.6784	0.9219	0.7671	$0.0371$	$0.0039$
	F1	0.2754	0.1601	0.3139	$0.0195$	$0.0195$
	AUC	0.3223	0.2135	0.3223	0.1727	0.1309
	Prec	0.9219	0.3223	0.7695	0.7695	0.5566
XGB	GM	0.4316	0.2754	0.3222	$0.0371$	$0.0371$
	F1	0.6250	0.1309	0.4316	0.1614	0.1602
	AUC	0.4922	0.3429	0.4922	0.3139	0.1933
	Prec	0.6250	0.2324	0.7695	0.4922	0.1055

Approach abbreviations: SMO, SMOTE; ADA, ADASYN; T-L, Tomek-links; B-SMO, Borderline-SMOTE; K-SMO, KMeansSMOTE; LR, Logistic Regression; RF, Random Forest; XGB, XGBoost; GM, G-mean; F1, F1 score; Prec, Precision.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, X.; Paz, I.; Nebot, À.; Mugica, F.; Romero, E. Ad-RuLer: A Novel Rule-Driven Data Synthesis Technique for Imbalanced Classification. Appl. Sci. 2023, 13, 12636. https://doi.org/10.3390/app132312636

AMA Style

Zhang X, Paz I, Nebot À, Mugica F, Romero E. Ad-RuLer: A Novel Rule-Driven Data Synthesis Technique for Imbalanced Classification. Applied Sciences. 2023; 13(23):12636. https://doi.org/10.3390/app132312636

Chicago/Turabian Style

Zhang, Xiao, Iván Paz, Àngela Nebot, Francisco Mugica, and Enrique Romero. 2023. "Ad-RuLer: A Novel Rule-Driven Data Synthesis Technique for Imbalanced Classification" Applied Sciences 13, no. 23: 12636. https://doi.org/10.3390/app132312636

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Ad-RuLer: A Novel Rule-Driven Data Synthesis Technique for Imbalanced Classification

Abstract

1. Introduction

2. Related Work

3. Method

3.1. Overview of Existing Resampling Methods

3.2. Ad-RuLer Oversampling Approach

4. Experiment

4.1. Datasets

4.2. Classifiers and Resampling Approaches

4.3. Evaluation Metrics

4.4. Experimental Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI