1. Introduction
With the rapid development of artificial intelligence technology, extensive data processing and analysis are widely used in various fields of production and life, such as fault detection, medical diagnosis, virus script judgment, and so on. However, due to the difficulty of actual data sampling and the small number of fault information samples, there is a problem of sample classification imbalance in the data processing of fault prediction and diagnosis analysis [
1]. The unbalanced problem of the dataset refers to the large gap in the number of samples of various categories in the sample set. For categories with a significantly small number of samples, it is called a minority (positive) sample, and for categories with a large number of samples, it is called a majority (counterexample) sample [
2]. The problem of imbalanced datasets is characterized as an asymmetric issue due to the substantial disparity in the number of samples among different classes. The asymmetry in the distribution quantity of samples across various classes can significantly impact the performance of artificial intelligence algorithms in tasks such as fault diagnosis and fault prediction. Therefore, when utilizing artificial intelligence algorithms to handle imbalanced datasets, biased judgments may manifest, favoring the majority class and resulting in suboptimal performance for the minority class. However, it is often the case that minority class samples contain crucial information. For example, in the monitoring and maintenance of railway signal equipment, the fault of the equipment belongs to a minority of events, but if the fault prediction and judgment are misjudged as no fault, the best maintenance time may be missed, resulting in significant personnel and property losses [
3].
The main classification methods, such as support vector machine, random forest, K-neighbor, etc., are designed based on the maximum classification accuracy, which often leads to the high classification accuracy of majority classes and the low classification accuracy of minority classes, which reduces the overall performance of the classifier [
4,
5,
6]. Therefore, how to effectively improve the classification performance of the classifier when dealing with unbalanced data has become a hot spot in the field of machine learning and data mining [
7,
8].
There are two main methods to deal with unbalanced dataset classification: algorithm-based improvement and data-based improvement.
In algorithm-based improvement, through analyzing the characteristics of unbalanced datasets, traditional classification methods are improved, or new classification algorithms are designed to meet the performance requirements of unbalanced data classification. Common algorithms include cost-sensitive learning [
9], ensemble learning [
10], single-class learning [
11], and other methods.
For the data-based improvement, the general idea is to change the data distribution and transform the unbalanced dataset into a state with a relatively small imbalance rate through oversampling or undersampling. Oversampling refers to increasing the number of minority class data samples through specific methods, resulting in a new, relatively balanced dataset when combined with the original majority class samples. Undersampling, on the other hand, is the opposite, as it selectively discards a certain number of samples from the existing majority class samples, leaving a new, relatively balanced dataset when combined with the majority and minority class samples.
Improvements at the data level can facilitate a more convenient and flexible enhancement of classification performance for imbalanced datasets. Compared to undersampling, oversampling methods not only achieve comprehensive data retention but also offer greater applicability and reliability. The current approach for most scholars is to increase the number of samples in the minority class for their research. Oversampling can ameliorate the imbalance of the datasets, and the training of the artificial intelligence model using the optimized datasets can greatly improve the accuracy of the trained model in dealing with diagnosis and prediction problems.
The most representative and widely used algorithm is Synthetic Minority Over Sampling Technique (SMOTE) [
12]. However, due to the lack of further screening of minority samples, there is a certain blindness in the process of synthesizing new samples, and the boundary features between majority samples and minority samples cannot be effectively amplified. Therefore, many scholars have improved the algorithm. Han [
13] put forward the Borderline-SMOTE algorithm that only oversampled between the minority class samples at the decision boundary to generate new samples but did not analyze the minority class samples of other types. He [
14] introduced the idea of adaptive algorithms into oversampling and proposed the ADASYN algorithm to generate new samples by giving higher weights to minority samples with more samples of most classes in neighboring points to generate new samples, which also amplified the possibility of generating noise samples. Another approach that introduces the concept of weighting is Sebastián’s FW-SMOTE [
15] method. Assuming that not all features are equally important, it utilizes a weighted Minkowski distance to define the neighborhood of each instance in the minority class. This prioritizes features more relevant to the classification task, attempting to achieve improved predictive performance in both low and high-dimensional datasets. However, this method might not be effectively applicable when the assumptions are not met.
More improvement strategies focus on ensuring the safety of newly generated samples. Bunkhumpornpat [
16] proposed the Safe-Level-SMOTE method, which individually rates the safety of instances in the minority class. By considering the safe level ratio of instances, this approach prevents the generation of synthetic samples in inappropriate positions. The ASN-SMOTE method proposed by Yi [
17] introduces a noise filtering step and incorporates an adaptive neighbor selection mechanism. This method effectively avoids generating synthetic minority class instances in majority class regions. Generating new minority class samples within a safer range can effectively avoid interference from synthetic samples. However, in practice, adding too many minority class samples in regions where minority class samples are already dense may not provide meaningful information value for the classifier.
Based on the above problems, this paper proposes an enhanced reclassification SMOTE algorithm to obtain more effective synthetic balanced datasets: Gaussian interpolation SMOTE (GI-SMOTE). This method stems from the idea of optimizing the interpolation process, leveraging the properties of the Gaussian distribution to bias synthetic samples towards generating near class boundaries, aiding in achieving more precise classifier decisions at class borders. Firstly, the K-Nearest Neighbor (KNN) [
18] algorithm is used to divide minority samples and select minority class samples with characteristic types for subsequent processing. Finally, different data generation strategies are chosen according to the pairwise combination of minority sample types during the interpolation operation, and the simple uniform random distribution in SMOTE is replaced by the probability distribution of Gaussian distribution under specific combinations. The improved algorithm proposed in this paper is compared with the classical SMOTE [
12], Borderline-SMOTE [
13], and ADASYN [
14] algorithms through experiments on imbalanced datasets in the database. The experimental results show that the proposed method can effectively deal with the problems faced by traditional algorithms and improve the effectiveness of classification algorithms on imbalanced datasets. This paper verifies the feasibility of improving the oversampling algorithm by optimizing the interpolation process.
3. GI-SMOTE Algorithm
In order to make the newly generated samples contain more information represented by the minority class samples, this paper proposes a re-classification SMOTE algorithm combined with Gaussian distribution probability features. On the one hand, the idea of the KNN algorithm is used to judge and classify according to the adjacent type of each minority class sample in the overall datasets, and the individual minority class samples that fell into the majority class samples were filtered out to avoid generating noise signals through the minority class samples, solving the first problem caused by the above SMOTE algorithm. On the other hand, the focus is on special treatment of minority class samples that have a higher proportion of majority class samples among their nearest neighbors. The use of Gaussian distribution, instead of uniform random distribution, for interpolation aims to emphasize class boundary features while generating fewer samples that may potentially interfere with classification performance, thereby addressing the second issue of the SMOTE algorithm mentioned above.
3.1. Classification Stage
In dataset D, is used to denote the minority sample set, the majority sample set is , and the number of minority and majority samples are and , respectively.
At this stage Algorithm 1, we classify the minority class samples separately. The distances between each minority class sample and all sample points are calculated, and the KNN algorithm is used to find nearest neighbors of the sample . The value of can be adjusted appropriately according to the distribution characteristics of the dataset. The number of the minority class samples in nearest neighbors is set as , and the number of the majority class samples is set as , which shows that .
According to the ratio of
and
, each minority class sample is divided into three categories, which are denoted as “Noise”, “Danger”, and “Safe”. If
, it means that the sample point is completely surrounded by the majority class samples, and the point is classified as “Noise”. If
, it indicates that the sample point has a small probability of being misclassified, and the sample is regarded as a “Safe” class. If
, then this sample has a non-negligible probability of being misclassified as a majority class sample, and it is regarded as the “Danger” class. The classification strategy is shown in
Figure 1.
Algorithm 1 Pseudo Code for Classification Stage |
- Input:
Minority Sample Set X; Majority Sample Set Y; Number of minority samples; Number of majority samples; Number of nearest neighbors for each minority sample; Number of the sample features - Output:
Noise Sample Set ; Danger Sample Set ; Safe Sample Set - 1:
- 2:
- 3:
- 4:
- 5:
whiledo - 6:
while do - 7:
▹ is a sample in D - 8:
- 9:
end while - 10:
Get the nearest neighbors of by sorting between - 11:
- 12:
end while - 13:
- 14:
while
do - 15:
the number of minority samples in nearest neighbors of - 16:
the number of majority samples in nearest neighbors of - 17:
if then - 18:
- 19:
else if then - 20:
- 21:
else if then - 22:
- 23:
end if - 24:
- 25:
end while
|
3.2. Data Synthesis Stage
In order to avoid the newly synthesized samples carrying interference information, the improved oversampling method does not use “Noise” samples in the data synthesis stage. The “Danger” and “Safe” samples are combined into a new minority sample set for the interpolation process. nearest neighbors are selected for each “danger” sample point in the sample set , and new samples are synthesized by using different strategies according to the type of the chosen nearest neighbor.
At this stage Algorithm 2, referring to the requirement of the SMOTE algorithm that the parameter can be taken to the extreme value 1, the idea of Gaussian distribution was introduced, and the operator with the randomness characteristics of Gaussian distribution was used to replace the operator of uniform random distribution to synthesize new samples.
If the neighbor sample
corresponding to the “Danger” sample
belongs to “Safe”, in order to make the newly synthesized sample point after interpolation of the two have more attribute information of the “danger” sample point, the new sample is synthesized by interpolation using (
2).
In the above formula,
is a random number generated in the interval (0,1) by the Gaussian distribution with 0 as the mean and
as the standard deviation. The mean of the Gaussian distribution is set to 0 because it is hoped that new samples will be generated near the “Danger” sample as much as possible. According to the characteristics of Gaussian distribution,
, the probability of
taking the value in the interval of
is also approximately
.
Algorithm 2 Pseudo Code for Data Synthesis Stage |
- Input:
Safe Sample Set ; Danger Sample Set ; The standard deviations and ; Number of nearest neighbors for each Danger sample - Output:
A new synthetic sample p - 1:
- 2:
Getting from ▹i is a subscript of samples - 3:
Getting from nearest neighbors of , and ▹j is a subscript of nearest neighbors of - 4:
- 5:
- 6:
if
then - 7:
while or do - 8:
random.gauss(0,) - 9:
end while - 10:
- 11:
else if
then - 12:
while or do - 13:
random.randint(0,1) - 14:
if then - 15:
random.gauss(0,) - 16:
else if then - 17:
random.gauss(1,) - 18:
end if - 19:
end while - 20:
- 21:
end if
|
If the nearest neighbor sample
corresponding to the “Danger” sample
also belongs to “Danger”, the new sample will be synthesized by interpolation using (
3).
where
consists of two different Gaussian distributions, one of them has a mean of 0, and the other one has a mean of 1, and they both have a standard deviation of
. Generally, set
. And both Gaussian distributions are adopted with half probability. In this way, the newly synthesized data will be closer to the “Danger” sample and thus carry less disturbing information.
The newly synthetic minority samples are combined with the initial samples to obtain a new balanced dataset .
4. Experiments
In order to verify the effectiveness of the improved algorithm, we compared its performance with no-sampling, SMOTE, Borderline-SMOTE, and ADASYN algorithms when dealing with imbalanced datasets. A number of representative imbalanced datasets from UCI [
19] and KEEL [
20] are collected as data experimental objects. We divide these datasets into two parts, the test set and training set, according to the principle of the “80-20 rule” [
2], for comparison experiments of classification effect. The “80-20 rule” refers to the practice of dividing the available dataset into approximately 80% for training and 20% for testing, used for training and evaluating machine learning models. To overcome the randomness in the synthetic algorithm during experiments, the following results represent the average of 10 trials. In this experiment, according to some oversampling algorithm research [
14,
21,
22] and experimental experience, we set
,
. And when these datasets are oversampling using the GI-SMOTE algorithm, the value of parameter
is
and the value of
is
. The values of
and
are determined subjectively based on experimental experience. It is worth noting that the optimal standard deviation parameters can vary across different datasets, depending on the specific distribution characteristics between minority and majority class samples within the dataset.
4.1. Visual Presentation
In order to more intuitively show the advantages of the GI-SMOTE algorithm in synthesizing new samples, we visualize the characteristics of data distribution after applying no-sampling, SMOTE, and GI-SMOTE to three different datasets. Since the attribute of the dataset itself is high-dimensional, the principal component analysis (PCA) [
23,
24] technique is used as a dimension reduction tool to reduce the distribution of these datasets from high-dimensional to two-dimensional [
25]. The distributions of these samples are shown in
Figure 2, where orange dots represent the minority samples, blue dots represent the majority samples, and green dots represent the synthetic samples. It can be found that the new synthetic samples of the GI-SMOTE method are more compact than the new synthetic samples of the SMOTE method, which will avoid the blurring of the class boundary. Especially for the GI-SMOTE method, the new synthetic samples appear less near the “Safe” minority samples, which will make the new synthetic samples carry more differentiated information between the majority samples and the minority samples. Moreover, for the GI-SMOTE method, no new samples are generated around the danger points.
4.2. Evaluation Indicator
For unbalanced datasets, the traditional accuracy of classification results and evaluation indicators is not applicable [
2]. In this paper, three indexes, G-mean, F-measure, and AUC [
26], based on the confusion matrix, are introduced as the evaluation indicators of this experiment.
The concept of the confusion matrix is shown in
Table 1, where TP is the number of positive samples correctly classified by the classifier, TN is the number of negative samples correctly classified by the classifier, FP is the number of negative samples incorrectly classified by the classifier, and FN is the number of positive samples incorrectly classified by the classifier.
G-mean is the overall classification accuracy of the classifier for positive and negative samples. The larger the index is, the better the classification performance of the classifier is. G-mean is composed of two metrics,
and
[
27]. The calculation formula is shown in (
4)–(
6):
F-measure is the classification accuracy of positive class samples, and the larger the index is, the better the classification performance of the classifier is. F-measure is composed of two metrics,
and
. The calculation formula is shown in (
7) and (
8):
AUC is the area under the receiver operating characteristic curve (ROC). A larger AUC indicates a better classification performance of the classifier. The ordinate of the ROC curve is the probability TPR of correct classification of positive class samples, and the calculation formula is shown in (
9). The abscissa of the ROC curve is the probability of misclassification of negative class samples FPR, and the calculation formula is shown in (
10).
4.3. Preparation of Data
Representative datasets are selected from the database as the data experiment processing objects, which are spambase, ecoli1, glass0, vehicle0, vehicle1, and yeast1. See
Table 2 for a specific description of these datasets. Before performing oversampling experiments, we need to preprocess the data. Firstly, the dataset needs to be classified according to the majority class and the minority class, and labels should be added. Secondly, we need to ensure that the numbers of features of all samples in the experiment with the same dataset are consistent. Finally, the datasets are divided into training sets and test sets.
5. Results and Discussion
Since SMOTE, Borderline-SMOTE, ADASYN, and GI-SMOTE as oversampling algorithms do not have classification functions, in order to complete the verification experiment and obtain G-mean, F-measure, and AUC, three evaluation indicators, referring to research [
17], KNN classifier [
28], and SVM classifier [
29], were used to complete the control experiment. The experimental results are shown in the following
Table 3 and
Table 4.
From
Table 3 and
Table 4 above, it can be concluded that the datasets processed by the GI-SMOTE oversampling algorithm have significant improvements in G-mean, F-measure, and AUC compared with the original datasets. Compared with other oversampling algorithms, the GI-SMOTE algorithm performs better in most cases. Experimental results show that GI-SMOTE achieves the best performance on at least two evaluation indicators on all six public datasets when using KNN classifier and SVM classifier. On the public datasets ecoli1, vehicle0, and yeast1, the three evaluation indicators of GI-SMOTE achieve optimal results under the experimental conditions of the two classifiers. Compared with the traditional SMOTE algorithm, GI-SMOTE can achieve 2∼
optimization in the evaluation indicators performance. On the one hand, the algorithm can more effectively determine the class boundary between the majority class samples and the minority class samples to obtain better performance. On the other hand, the algorithm has better generalization in the face of various datasets, and there is little overfitting phenomenon like the ADASYN algorithm when dealing with individual datasets.
6. Conclusions
For the problem of imbalanced datasets processing widely existing in actual production and life, this paper analyzes the shortcomings of the traditional SMOTE algorithm in performance from the data level, classifies the minority samples, configures the corresponding synthesis scheme, and proposes an improved GI-SMOTE algorithm based on KNN and interpolation process optimization. Experiments on various imbalanced datasets show that the improved algorithm can effectively improve the classification performance of imbalanced datasets. By employing this improved algorithm to address imbalanced datasets, there is a notable enhancement in the accuracy of artificial intelligence models when dealing with issues such as behavior prediction, fraud detection, medical diagnoses, and so on.
This paper demonstrates that optimizing the oversampling algorithm by adjusting the interpolation process is feasible, providing new insights for future improvements in oversampling algorithms. However, the standard deviation parameters and used in the interpolation process in the algorithm program need to be set artificially in advance. The optimal values of the parameters are often different in the face of different datasets. A further research direction of this paper will be how to use the adaptive method to judge the most appropriate standard deviation parameter. In addition, compared with the traditional oversampling algorithm, the proposed method requires more computing time. How to optimize the operation time of the oversampling algorithm will also be one of the future research directions.