Next Article in Journal
Correction: Klimchitskaya et al. Nonequilibrium Casimir–Polder Interaction between Nanoparticles and Substrates Coated with Gapped Graphene. Symmetry 2023, 15, 1580
Previous Article in Journal
Conditional Strong Law of Large Numbers under G-Expectations
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Improved Oversampling Algorithm for Imbalanced Data Based on K-Nearest Neighbor and Interpolation Process Optimization

1
School of Railway Transportation, Shanghai Institute of Technology, Shanghai 201418, China
2
China Railway Siyuan Survey and Design Group Co., Ltd., Wuhan 430063, China
*
Author to whom correspondence should be addressed.
Symmetry 2024, 16(3), 273; https://doi.org/10.3390/sym16030273
Submission received: 25 January 2024 / Revised: 22 February 2024 / Accepted: 23 February 2024 / Published: 26 February 2024
(This article belongs to the Section Computer)

Abstract

:
The problems of imbalanced datasets are generally considered asymmetric issues. In asymmetric problems, artificial intelligence models may exhibit different biases or preferences when dealing with different classes. In the process of addressing class imbalance learning problems, the classification model will pay too much attention to the majority class samples and cannot guarantee the classification performance of the minority class samples, which might be more valuable. By synthesizing the minority class samples and changing the data distribution, unbalanced datasets can be optimized. Traditional oversampling algorithms have problems of blindness and boundary ambiguity when synthesizing new samples. A modified reclassification algorithm based on Gaussian distribution is put forward. First, the minority class samples are reclassified by the KNN algorithm. Then, different synthesis strategies are selected according to the combination of the minority class samples, and the Gaussian distribution is used to replace the uniform random distribution for interpolation operation under certain classification conditions to reduce the possibility of generating noise samples. The experimental results indicate that the proposed oversampling algorithm can achieve a performance improvement of 2∼ 8 % in evaluation metrics, including G-mean, F-measure, and AUC, compared to traditional oversampling algorithms.

1. Introduction

With the rapid development of artificial intelligence technology, extensive data processing and analysis are widely used in various fields of production and life, such as fault detection, medical diagnosis, virus script judgment, and so on. However, due to the difficulty of actual data sampling and the small number of fault information samples, there is a problem of sample classification imbalance in the data processing of fault prediction and diagnosis analysis [1]. The unbalanced problem of the dataset refers to the large gap in the number of samples of various categories in the sample set. For categories with a significantly small number of samples, it is called a minority (positive) sample, and for categories with a large number of samples, it is called a majority (counterexample) sample [2]. The problem of imbalanced datasets is characterized as an asymmetric issue due to the substantial disparity in the number of samples among different classes. The asymmetry in the distribution quantity of samples across various classes can significantly impact the performance of artificial intelligence algorithms in tasks such as fault diagnosis and fault prediction. Therefore, when utilizing artificial intelligence algorithms to handle imbalanced datasets, biased judgments may manifest, favoring the majority class and resulting in suboptimal performance for the minority class. However, it is often the case that minority class samples contain crucial information. For example, in the monitoring and maintenance of railway signal equipment, the fault of the equipment belongs to a minority of events, but if the fault prediction and judgment are misjudged as no fault, the best maintenance time may be missed, resulting in significant personnel and property losses [3].
The main classification methods, such as support vector machine, random forest, K-neighbor, etc., are designed based on the maximum classification accuracy, which often leads to the high classification accuracy of majority classes and the low classification accuracy of minority classes, which reduces the overall performance of the classifier [4,5,6]. Therefore, how to effectively improve the classification performance of the classifier when dealing with unbalanced data has become a hot spot in the field of machine learning and data mining [7,8].
There are two main methods to deal with unbalanced dataset classification: algorithm-based improvement and data-based improvement.
In algorithm-based improvement, through analyzing the characteristics of unbalanced datasets, traditional classification methods are improved, or new classification algorithms are designed to meet the performance requirements of unbalanced data classification. Common algorithms include cost-sensitive learning [9], ensemble learning [10], single-class learning [11], and other methods.
For the data-based improvement, the general idea is to change the data distribution and transform the unbalanced dataset into a state with a relatively small imbalance rate through oversampling or undersampling. Oversampling refers to increasing the number of minority class data samples through specific methods, resulting in a new, relatively balanced dataset when combined with the original majority class samples. Undersampling, on the other hand, is the opposite, as it selectively discards a certain number of samples from the existing majority class samples, leaving a new, relatively balanced dataset when combined with the majority and minority class samples.
Improvements at the data level can facilitate a more convenient and flexible enhancement of classification performance for imbalanced datasets. Compared to undersampling, oversampling methods not only achieve comprehensive data retention but also offer greater applicability and reliability. The current approach for most scholars is to increase the number of samples in the minority class for their research. Oversampling can ameliorate the imbalance of the datasets, and the training of the artificial intelligence model using the optimized datasets can greatly improve the accuracy of the trained model in dealing with diagnosis and prediction problems.
The most representative and widely used algorithm is Synthetic Minority Over Sampling Technique (SMOTE) [12]. However, due to the lack of further screening of minority samples, there is a certain blindness in the process of synthesizing new samples, and the boundary features between majority samples and minority samples cannot be effectively amplified. Therefore, many scholars have improved the algorithm. Han [13] put forward the Borderline-SMOTE algorithm that only oversampled between the minority class samples at the decision boundary to generate new samples but did not analyze the minority class samples of other types. He [14] introduced the idea of adaptive algorithms into oversampling and proposed the ADASYN algorithm to generate new samples by giving higher weights to minority samples with more samples of most classes in neighboring points to generate new samples, which also amplified the possibility of generating noise samples. Another approach that introduces the concept of weighting is Sebastián’s FW-SMOTE [15] method. Assuming that not all features are equally important, it utilizes a weighted Minkowski distance to define the neighborhood of each instance in the minority class. This prioritizes features more relevant to the classification task, attempting to achieve improved predictive performance in both low and high-dimensional datasets. However, this method might not be effectively applicable when the assumptions are not met.
More improvement strategies focus on ensuring the safety of newly generated samples. Bunkhumpornpat [16] proposed the Safe-Level-SMOTE method, which individually rates the safety of instances in the minority class. By considering the safe level ratio of instances, this approach prevents the generation of synthetic samples in inappropriate positions. The ASN-SMOTE method proposed by Yi [17] introduces a noise filtering step and incorporates an adaptive neighbor selection mechanism. This method effectively avoids generating synthetic minority class instances in majority class regions. Generating new minority class samples within a safer range can effectively avoid interference from synthetic samples. However, in practice, adding too many minority class samples in regions where minority class samples are already dense may not provide meaningful information value for the classifier.
Based on the above problems, this paper proposes an enhanced reclassification SMOTE algorithm to obtain more effective synthetic balanced datasets: Gaussian interpolation SMOTE (GI-SMOTE). This method stems from the idea of optimizing the interpolation process, leveraging the properties of the Gaussian distribution to bias synthetic samples towards generating near class boundaries, aiding in achieving more precise classifier decisions at class borders. Firstly, the K-Nearest Neighbor (KNN) [18] algorithm is used to divide minority samples and select minority class samples with characteristic types for subsequent processing. Finally, different data generation strategies are chosen according to the pairwise combination of minority sample types during the interpolation operation, and the simple uniform random distribution in SMOTE is replaced by the probability distribution of Gaussian distribution under specific combinations. The improved algorithm proposed in this paper is compared with the classical SMOTE [12], Borderline-SMOTE [13], and ADASYN [14] algorithms through experiments on imbalanced datasets in the database. The experimental results show that the proposed method can effectively deal with the problems faced by traditional algorithms and improve the effectiveness of classification algorithms on imbalanced datasets. This paper verifies the feasibility of improving the oversampling algorithm by optimizing the interpolation process.

2. Preliminaries

2.1. SMOTE Algorithm

In order to overcome the overfitting caused by the newly added minority class samples in the random oversampling process, Chawla [12] proposed the SMOTE algorithm in 2002. The basic principle of the algorithm is to find the neighbors within the minority class samples and interpolate between them so as to generate new minority class samples. The specific steps are as follows: assuming that oversampling is required by n times, for each minority sample, n neighboring points are selected from the k nearest neighbors of the same minority class, and a new minority sample is synthesized between the two selected minority samples according to (1).
p = x i + r a n d o m ( 0 , 1 ) × ( y i j x i )
where r a n d o m ( 0 , 1 ) represents a random value within the interval ( 0 , 1 ) .
Combining the newly generated minority samples with the original sample set, the imbalance rate of the new dataset obtained will be greatly improved and the performance of subsequent classification algorithms will be more effective. The SMOTE algorithm is widely used in many fields. However, the SMOTE algorithm also has some problems:
Firstly, the newly synthesized samples may belong to the noise samples. From the geometric point of view, if one of the selected sample x i and neighboring point y i j falls into the majority class samples and is surrounded by the majority class samples, the generated new sample is likely to become a noise sample, which does not have the characteristics of the minority class samples and interferes with the accuracy of the classification algorithm.
Secondly, the problem of fuzzy boundaries arises. The SMOTE algorithm ignores the boundary features of imbalanced sample sets. Too many new samples can be generated at locations far from the class boundaries, which may significantly affect the performance of the classification algorithm and blur the boundary between two classes.
Therefore, although the SMOTE algorithm can make up for the deficiency of random over-sampling, it has a certain blindness in the production of new data because it does not consider the distribution characteristics of minority samples in the overall sample set.

2.2. KNN Algorithm

The KNN algorithm proposed by Hwang [18] is a simple and powerful machine learning algorithm for supervised learning, which can be applied to classification problems. The core idea of the KNN algorithm is the following: if most of the k nearest neighbors of a sample in the feature space belong to a certain class, then the sample also belongs to this class. When computing the nearest neighbors of a sample, Euclidean distance is usually chosen to calculate the distance between two samples. When facing a sample for classification, we need to calculate the distance between this sample and every sample in the dataset and find the k closest samples to this sample from the calculated distance. The classification attributes of these k neighbor samples are counted, and the new sample is classified as the class with the most occurrences.

3. GI-SMOTE Algorithm

In order to make the newly generated samples contain more information represented by the minority class samples, this paper proposes a re-classification SMOTE algorithm combined with Gaussian distribution probability features. On the one hand, the idea of the KNN algorithm is used to judge and classify according to the adjacent type of each minority class sample in the overall datasets, and the individual minority class samples that fell into the majority class samples were filtered out to avoid generating noise signals through the minority class samples, solving the first problem caused by the above SMOTE algorithm. On the other hand, the focus is on special treatment of minority class samples that have a higher proportion of majority class samples among their nearest neighbors. The use of Gaussian distribution, instead of uniform random distribution, for interpolation aims to emphasize class boundary features while generating fewer samples that may potentially interfere with classification performance, thereby addressing the second issue of the SMOTE algorithm mentioned above.

3.1. Classification Stage

In dataset D, X = { x 1 , x 2 , , x n 1 } is used to denote the minority sample set, the majority sample set is Y = { y 1 , y 2 , , y n 2 } , and the number of minority and majority samples are n 1 and n 2 , respectively.
At this stage Algorithm 1, we classify the minority class samples separately. The distances between each minority class sample x i and all sample points are calculated, and the KNN algorithm is used to find k 1 nearest neighbors of the sample x i . The value of k 1 can be adjusted appropriately according to the distribution characteristics of the dataset. The number of the minority class samples in k 1 nearest neighbors is set as m 1 , and the number of the majority class samples is set as m 2 , which shows that k 1 = m 1 + m 2 .
According to the ratio of m 1 and m 2 , each minority class sample is divided into three categories, which are denoted as “Noise”, “Danger”, and “Safe”. If m 1 = 0 , it means that the sample point is completely surrounded by the majority class samples, and the point is classified as “Noise”. If  m 1 > m 2 , it indicates that the sample point has a small probability of being misclassified, and the sample is regarded as a “Safe” class. If 0 < m 1 m 2 , then this sample has a non-negligible probability of being misclassified as a majority class sample, and it is regarded as the “Danger” class. The classification strategy is shown in Figure 1.
Algorithm 1 Pseudo Code for Classification Stage
Input: 
Minority Sample Set X; Majority Sample Set Y n 1 Number of minority samples;  n 2 Number of majority samples;  k 1 Number of nearest neighbors for each minority sample;  n Number of the sample features
Output: 
Noise Sample Set D N o i s e ; Danger Sample Set D D a n g e r ; Safe Sample Set D S a f e
1:
D X Y
2:
N s i z e ( D )
3:
i 1
4:
j 1
5:
while i n 1 do
6:
    while  j N  do
7:
         d i s t i j = m = 1 n ( x i m s j m ) 2         ▹ s j m is a sample in D
8:
         j j + 1
9:
    end while
10:
    Get the k 1 nearest neighbors of x i by sorting between d i s t i
11:
     i i + 1
12:
end while
13:
i 1
14:
while  i n 1   do
15:
     m 1 the number of minority samples in k 1 nearest neighbors of x i
16:
     m 2 the number of majority samples in k 1 nearest neighbors of x i
17:
    if  m 1 = = 0  then
18:
         x i D N o i s e
19:
    else if  m 1 > m 2  then
20:
         x i D S a f e
21:
    else if  m 1 m 2  then
22:
         x i D D a n g e r
23:
    end if
24:
     i i + 1
25:
end while

3.2. Data Synthesis Stage

In order to avoid the newly synthesized samples carrying interference information, the improved oversampling method does not use “Noise” samples in the data synthesis stage. The “Danger” and “Safe” samples are combined into a new minority sample set x for the interpolation process. k 2 nearest neighbors are selected for each “danger” sample point in the sample set x , and new samples are synthesized by using different strategies according to the type of the chosen nearest neighbor.
At this stage Algorithm 2, referring to the requirement of the SMOTE algorithm that the parameter r a n d o m ( 0 , 1 ) can be taken to the extreme value 1, the idea of Gaussian distribution was introduced, and the g a u s s operator with the randomness characteristics of Gaussian distribution was used to replace the r a n d o m operator of uniform random distribution to synthesize new samples.
If the neighbor sample y i j corresponding to the “Danger” sample x i belongs to “Safe”, in order to make the newly synthesized sample point after interpolation of the two have more attribute information of the “danger” sample point, the new sample is synthesized by interpolation using (2).
p = x i + g a u s s 1 ( 0 , 1 ) × ( y i j x i )
In the above formula, g a u s s 1 ( 0 , 1 ) is a random number generated in the interval (0,1) by the Gaussian distribution with 0 as the mean and σ 1 as the standard deviation. The mean of the Gaussian distribution is set to 0 because it is hoped that new samples will be generated near the “Danger” sample as much as possible. According to the characteristics of Gaussian distribution, p ( μ σ x μ + σ ) 0.6826 , the probability of g a u s s 1 ( 0 , 1 ) taking the value in the interval of ( 0 , σ 1 ) is also approximately 0.6826 .
Algorithm 2 Pseudo Code for Data Synthesis Stage
Input: 
Safe Sample Set D S a f e ; Danger Sample Set D D a n g e r ; The standard deviations σ 1 and σ 2 ; k 2 Number of nearest neighbors for each Danger sample
Output: 
A new synthetic sample p
1:
D s y n D S a f e D D a n g e r
2:
Getting x i from D D a n g e r                       ▹i is a subscript of samples
3:
Getting y i j from k 2 nearest neighbors of x i , and y i j D s y n       ▹j is a subscript of nearest neighbors of x i
4:
g a u s s 1 0
5:
g a u s s 2 0
6:
if  y i j D S a f e   then
7:
    while  0 g a u s s 1 or g a u s s 1 1  do
8:
         g a u s s 1 = random.gauss(0, σ 1 )
9:
    end while
10:
     p = x i + g a u s s 1 × ( y i j x i )
11:
else if  y i j D D a n g e r   then
12:
    while  0 g a u s s 2 or g a u s s 2 1  do
13:
         a = random.randint(0,1)
14:
        if  a = = 0  then
15:
            g a u s s 2 = random.gauss(0, σ 2 )
16:
        else if  a = = 1  then
17:
            g a u s s 2 = random.gauss(1, σ 2 )
18:
        end if
19:
    end while
20:
     p = x i + g a u s s 2 × ( y i j x i )
21:
end if
If the nearest neighbor sample y i j corresponding to the “Danger” sample x i also belongs to “Danger”, the new sample will be synthesized by interpolation using (3).
p = x i + g a u s s 2 ( 0 , 1 ) × ( y i j x i )
where g a u s s 2 ( 0 , 1 ) consists of two different Gaussian distributions, one of them has a mean of 0, and the other one has a mean of 1, and they both have a standard deviation of σ 2 . Generally, set σ 2 < σ 1 . And both Gaussian distributions are adopted with half probability. In this way, the newly synthesized data will be closer to the “Danger” sample and thus carry less disturbing information.
The newly synthetic minority samples are combined with the initial samples to obtain a new balanced dataset D n e w .

4. Experiments

In order to verify the effectiveness of the improved algorithm, we compared its performance with no-sampling, SMOTE, Borderline-SMOTE, and ADASYN algorithms when dealing with imbalanced datasets. A number of representative imbalanced datasets from UCI [19] and KEEL [20] are collected as data experimental objects. We divide these datasets into two parts, the test set and training set, according to the principle of the “80-20 rule” [2], for comparison experiments of classification effect. The “80-20 rule” refers to the practice of dividing the available dataset into approximately 80% for training and 20% for testing, used for training and evaluating machine learning models. To overcome the randomness in the synthetic algorithm during experiments, the following results represent the average of 10 trials. In this experiment, according to some oversampling algorithm research [14,21,22] and experimental experience, we set k 1 = 5 , k 2 = 5 . And when these datasets are oversampling using the GI-SMOTE algorithm, the value of parameter σ 1 is 0.4 and the value of σ 2 is 0.25 . The values of σ 1 and σ 2 are determined subjectively based on experimental experience. It is worth noting that the optimal standard deviation parameters can vary across different datasets, depending on the specific distribution characteristics between minority and majority class samples within the dataset.

4.1. Visual Presentation

In order to more intuitively show the advantages of the GI-SMOTE algorithm in synthesizing new samples, we visualize the characteristics of data distribution after applying no-sampling, SMOTE, and GI-SMOTE to three different datasets. Since the attribute of the dataset itself is high-dimensional, the principal component analysis (PCA) [23,24] technique is used as a dimension reduction tool to reduce the distribution of these datasets from high-dimensional to two-dimensional [25]. The distributions of these samples are shown in Figure 2, where orange dots represent the minority samples, blue dots represent the majority samples, and green dots represent the synthetic samples. It can be found that the new synthetic samples of the GI-SMOTE method are more compact than the new synthetic samples of the SMOTE method, which will avoid the blurring of the class boundary. Especially for the GI-SMOTE method, the new synthetic samples appear less near the “Safe” minority samples, which will make the new synthetic samples carry more differentiated information between the majority samples and the minority samples. Moreover, for the GI-SMOTE method, no new samples are generated around the danger points.

4.2. Evaluation Indicator

For unbalanced datasets, the traditional accuracy of classification results and evaluation indicators is not applicable [2]. In this paper, three indexes, G-mean, F-measure, and AUC [26], based on the confusion matrix, are introduced as the evaluation indicators of this experiment.
The concept of the confusion matrix is shown in Table 1, where TP is the number of positive samples correctly classified by the classifier, TN is the number of negative samples correctly classified by the classifier, FP is the number of negative samples incorrectly classified by the classifier, and FN is the number of positive samples incorrectly classified by the classifier.
G-mean is the overall classification accuracy of the classifier for positive and negative samples. The larger the index is, the better the classification performance of the classifier is. G-mean is composed of two metrics, S p e c i f i c i t y and R e c a l l [27]. The calculation formula is shown in (4)–(6):
G m e a n = S p e c i f i c i t y × R e c a l l
S p e c i f i c i t y = T N T N + F P
R e c a l l = T P T P + F N
F-measure is the classification accuracy of positive class samples, and the larger the index is, the better the classification performance of the classifier is. F-measure is composed of two metrics, P r e c i s i o n and R e c a l l . The calculation formula is shown in (7) and (8):
F m e a s u r e = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l
P r e c i s i o n = T P T P + F P
AUC is the area under the receiver operating characteristic curve (ROC). A larger AUC indicates a better classification performance of the classifier. The ordinate of the ROC curve is the probability TPR of correct classification of positive class samples, and the calculation formula is shown in (9). The abscissa of the ROC curve is the probability of misclassification of negative class samples FPR, and the calculation formula is shown in (10).
T P R = T P T P + F N
F P R = F P F P + F N

4.3. Preparation of Data

Representative datasets are selected from the database as the data experiment processing objects, which are spambase, ecoli1, glass0, vehicle0, vehicle1, and yeast1. See Table 2 for a specific description of these datasets. Before performing oversampling experiments, we need to preprocess the data. Firstly, the dataset needs to be classified according to the majority class and the minority class, and labels should be added. Secondly, we need to ensure that the numbers of features of all samples in the experiment with the same dataset are consistent. Finally, the datasets are divided into training sets and test sets.

5. Results and Discussion

Since SMOTE, Borderline-SMOTE, ADASYN, and GI-SMOTE as oversampling algorithms do not have classification functions, in order to complete the verification experiment and obtain G-mean, F-measure, and AUC, three evaluation indicators, referring to research [17], KNN classifier [28], and SVM classifier [29], were used to complete the control experiment. The experimental results are shown in the following Table 3 and Table 4.
From Table 3 and Table 4 above, it can be concluded that the datasets processed by the GI-SMOTE oversampling algorithm have significant improvements in G-mean, F-measure, and AUC compared with the original datasets. Compared with other oversampling algorithms, the GI-SMOTE algorithm performs better in most cases. Experimental results show that GI-SMOTE achieves the best performance on at least two evaluation indicators on all six public datasets when using KNN classifier and SVM classifier. On the public datasets ecoli1, vehicle0, and yeast1, the three evaluation indicators of GI-SMOTE achieve optimal results under the experimental conditions of the two classifiers. Compared with the traditional SMOTE algorithm, GI-SMOTE can achieve 2∼ 8 % optimization in the evaluation indicators performance. On the one hand, the algorithm can more effectively determine the class boundary between the majority class samples and the minority class samples to obtain better performance. On the other hand, the algorithm has better generalization in the face of various datasets, and there is little overfitting phenomenon like the ADASYN algorithm when dealing with individual datasets.

6. Conclusions

For the problem of imbalanced datasets processing widely existing in actual production and life, this paper analyzes the shortcomings of the traditional SMOTE algorithm in performance from the data level, classifies the minority samples, configures the corresponding synthesis scheme, and proposes an improved GI-SMOTE algorithm based on KNN and interpolation process optimization. Experiments on various imbalanced datasets show that the improved algorithm can effectively improve the classification performance of imbalanced datasets. By employing this improved algorithm to address imbalanced datasets, there is a notable enhancement in the accuracy of artificial intelligence models when dealing with issues such as behavior prediction, fraud detection, medical diagnoses, and so on.
This paper demonstrates that optimizing the oversampling algorithm by adjusting the interpolation process is feasible, providing new insights for future improvements in oversampling algorithms. However, the standard deviation parameters σ 1 and σ 2 used in the interpolation process in the algorithm program need to be set artificially in advance. The optimal values of the parameters are often different in the face of different datasets. A further research direction of this paper will be how to use the adaptive method to judge the most appropriate standard deviation parameter. In addition, compared with the traditional oversampling algorithm, the proposed method requires more computing time. How to optimize the operation time of the oversampling algorithm will also be one of the future research directions.

Author Contributions

Conceptualization, Y.C. and J.Z.; methodology, Y.C.; software, Y.C.; validation, Y.C. and C.H.; formal analysis, J.Z. and L.L.; investigation, Y.C.; resources, J.Z.; data curation, Y.C.; writing—original draft preparation, Y.C.; writing—review and editing, J.Z.; visualization, Y.C.; supervision, J.Z. and L.L.; project administration, J.Z. and L.L.; funding acquisition, J.Z. and L.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the China National Railway Group Co., Ltd. Technology Research and Development Program Project (No. N2022G048), the Shanghai Science and Technology Commission—“Belt and Road” China-Laos Railway Project International Joint Laboratory (No. 21210750300), and the Shanghai Science and Technology Commission—Research on Key Technologies of Intelligent Operation and Maintenance of Rail Transit (No. 20090503100).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

Author Lihai Liu was employed by the company China Railway Siyuan Survey and Design Group Co. Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. He, H.; Garcia, E.A. Learning from Imbalanced Data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar]
  2. Gao, Y.; Liu, Q. An Over Sampling Method of Unbalanced Data Based on Ant Colony Clustering. IEEE Access 2021, 9, 130990–130996. [Google Scholar]
  3. Lin, H.; Hu, N.; Lu, R.; Yuan, T.; Zhao, Z.; Bai, W.; Lin, Q. Fault Diagnosis of a Switch Machine to Prevent High-Speed Railway Accidents Combining Bi-Directional Long Short-Term Memory with the Multiple Learning Classification Based on Associations Model. Machines 2023, 11, 1027. [Google Scholar] [CrossRef]
  4. Wan, C.; Lee, L.; Rajkumar, R.; Isa, D. A hybrid text classification approach with low dependency on parameter by integrating K-nearest neighbor and support vector machine. Expert Syst. Appl. 2012, 15, 11880–11888. [Google Scholar] [CrossRef]
  5. Zhang, N.; Niu, M.; Wan, F.; Lu, J.; Wang, Y.; Yan, X.; Zhou, C. Hazard Prediction of Water Inrush in Water-Rich Tunnels Based on Random Forest Algorithm. Appl. Sci. 2024, 14, 867. [Google Scholar] [CrossRef]
  6. Li, Y.; Wang, C.; Liu, Y. Classification of Coal Bursting Liability Based on Support Vector Machine and Imbalanced Sample Set. Minerals 2023, 13, 15. [Google Scholar] [CrossRef]
  7. Jason, V.H.; Taghi, K. Knowledge discovery from imbalanced and noisy data. Data Knowl. Eng. 2009, 68, 1513–1542. [Google Scholar]
  8. Lu, H.; Vaidya, J.; Atluri, V.; Hong, Y. Constraint-Aware Role Mining via Extended Boolean Matrix Decomposition. IEEE Trans. Dependable Secur. Comput. 2012, 9, 655–669. [Google Scholar] [CrossRef]
  9. Huang, Y. Cost-sensitive incremental Classification under the MapReduce framework for Mining Imbalanced Massive Data Streams. J. Discret. Math. Sci. Cryptogr. 2015, 18, 177–194. [Google Scholar]
  10. Schapire, R.E. A brief introduction to boosting. IJCAI 1999, 99, 1401–1406. [Google Scholar]
  11. Zhu, W.; Zhong, P. A new one-class SVM based on hidden information. Knowl.-Based Syst. 2014, 60, 35–43. [Google Scholar] [CrossRef]
  12. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  13. Han, H.; Wang, W.; Mao, B. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In Proceedings of the International Conference on Intelligent Computing, Hefei, China, 23–26 August 2005; pp. 878–887. [Google Scholar]
  14. He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar]
  15. Maldonado, S.; Vairetti, C.; Fernandez, A.; Herrera, F. FW-SMOTE: A feature-weighted oversampling approach for imbalanced classification. Pattern Recognit. 2022, 124, 108511. [Google Scholar] [CrossRef]
  16. Bunkhumpornpat, C.; Sinapiromsaran, K.; Lursinsap, C. Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In Proceedings of the Advances in Knowledge Discovery and Data Mining: 13th Pacific-Asia Conference, Bangkok, Thailand, 27–30 April 2009; pp. 475–482. [Google Scholar]
  17. Yi, X.; Xu, Y.; Hu, Q.; Krishnamoorthy, S.; Li, W.; Tang, Z. ASN-SMOTE: A synthetic minority oversampling method with adaptive qualified synthesizer selection. Complex Intell. Syst. 2022, 8, 2247–2272. [Google Scholar] [CrossRef]
  18. Hwang, W.J.; Wen, K.W. Fast kNN classification algorithm based on partial distance search. Electron. Lett. 1998, 34, 2062–2063. [Google Scholar] [CrossRef]
  19. UCI Machine Learning Repository. Available online: https://archive.ics.uci.edu/datasets (accessed on 10 June 2023).
  20. Alcalá-Fdez, J.; Fernndez, A.; Luengo, J.; Derrac, J.; Garca, S.; Snchez, L.; Herrera, F. Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. J. Mult.-Valued Log. Soft Comput. 2011, 17, 255–287. [Google Scholar]
  21. Pradipta, G.A.; Wardoyo, R.; Musdholifah, A.; Sanjaya, I.N.H. Radius-SMOTE: A New Oversampling Technique of Minority Samples Based on Radius Distance for Learning From Imbalanced Data. IEEE Access 2021, 9, 74763–74777. [Google Scholar] [CrossRef]
  22. Naseriparsa, M.; Al-Shammari, A.; Sheng, M.; Zhang, Y.; Zhou, R. RSMOTE: Improving classification performance over imbalanced medical datasets. Health Inf. Sci. Syst. 2020, 8, 22. [Google Scholar] [CrossRef]
  23. Moore, B. Principal component analysis in linear systems: Controllability, observability, and model reduction. IEEE Trans. Autom. Control 1981, 1, 17–32. [Google Scholar] [CrossRef]
  24. Burohman, A.M.; Besselink, B.; Scherpen, J.M.A.; Camlibel, M.K. From Data to Reduced-Order Models via Generalized Balanced Truncation. IEEE Trans. Autom. Control 2023, 68, 6160–6175. [Google Scholar] [CrossRef]
  25. Bao, Y.; Yang, S. Two Novel SMOTE Methods for Solving Imbalanced Classification Problems. IEEE Access 2023, 11, 5816–5823. [Google Scholar] [CrossRef]
  26. Su, C.; Chen, L.; Yih, Y. Knowledge acquisition through information granulation for imbalanced data. Expert Syst. Appl. 2006, 31, 531–541. [Google Scholar] [CrossRef]
  27. Zhang, Z.; Li, J. Synthetic Minority Oversampling Technique Based on Adaptive Local Mean Vectors and Improved Differential Evolution. IEEE Access 2022, 10, 74045–74058. [Google Scholar] [CrossRef]
  28. Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
  29. Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Figure 1. Classification of minority class samples.
Figure 1. Classification of minority class samples.
Symmetry 16 00273 g001
Figure 2. Discrete point models based on three oversampling methods in two-dimension. Orange dots denote minority samples, blue dots denote majority samples, and green dots are new synthetic samples.
Figure 2. Discrete point models based on three oversampling methods in two-dimension. Orange dots denote minority samples, blue dots denote majority samples, and green dots are new synthetic samples.
Symmetry 16 00273 g002
Table 1. Confusion matrix.
Table 1. Confusion matrix.
Predictive Positive ClassPredictive Negative Class
Actual Positive ClassTPFN
Actual Negative ClassFPTN
Table 2. Datasets Characteristics.
Table 2. Datasets Characteristics.
NameInstancesFeaturesImbalanced Ratio (IR)
spambase4601571.54
ecoli133673.36
glass021492.06
vehicle0846183.25
vehicle1846182.90
yeast1148482.46
Table 3. Comparison between the GI-SMOTE algorithm and other oversampling algorithms with classification algorithm of KNN.
Table 3. Comparison between the GI-SMOTE algorithm and other oversampling algorithms with classification algorithm of KNN.
DatasetAlgorithmG-MeanF-MeasureAUC
spambaseNo-sampling0.8270.5700.684
SMOTE0.8440.6220.713
Borderline-SMOTE0.8580.6520.737
ADASYN0.8740.6830.765
GI-SMOTE0.8860.7080.786
ecoli1No-sampling0.8920.7130.795
SMOTE0.9000.7270.81
Borderline-SMOTE0.8920.7130.795
ADASYN0.9060.7370.820
GI-SMOTE0.9110.9460.930
glass0No-sampling0.8370.5880.700
SMOTE0.9010.7500.775
Borderline-SMOTE0.8800.7000.775
ADASYN0.9010.7500.812
GI-SMOTE0.9350.8290.875
vehicle0No-sampling0.9250.8040.855
SMOTE0.9430.8320.890
Borderline-SMOTE0.9330.8070.870
ADASYN0.9490.8380.900
GI-SMOTE0.9490.8420.900
vehicle1No-sampling0.8650.6340.748
SMOTE0.9130.7050.833
Borderline-SMOTE0.9100.7080.829
ADASYN0.9110.7460.830
GI-SMOTE0.9220.7150.849
yeast1No-sampling0.8110.5480.657
SMOTE0.8190.5860.670
Borderline-SMOTE0.8290.6130.688
ADASYN0.8760.6850.729
GI-SMOTE0.8960.7290.802
Table 4. Comparison between the GI-SMOTE algorithm and other oversampling algorithms with classification algorithm of SVM.
Table 4. Comparison between the GI-SMOTE algorithm and other oversampling algorithms with classification algorithm of SVM.
DatasetAlgorithmG-MeanF-MeasureAUC
spambaseNo-sampling0.9550.8810.913
SMOTE0.9620.9220.925
Borderline-SMOTE0.9660.9030.933
ADASYN0.9370.8290.878
GI-SMOTE0.9700.9160.940
ecoli1No-sampling0.7280.4810.530
SMOTE0.7280.4810.530
Borderline-SMOTE0.7620.5190.580
ADASYN0.7340.5060.539
GI-SMOTE0.7680.5320.590
glass0No-sampling0.7750.3330.600
SMOTE0.9490.8890.900
Borderline-SMOTE0.9550.8780.913
ADASYN0.9550.8950.912
GI-SMOTE0.9550.8950.912
vehicle0No-sampling0.9750.9310.950
SMOTE0.9800.9410.960
Borderline-SMOTE0.9810.9460.962
ADASYN0.9850.9510.970
GI-SMOTE0.9870.9700.975
vehicle1No-sampling0.8650.6320.748
SMOTE0.9010.6880.812
Borderline-SMOTE0.8910.6710.793
ADASYN0.9030.7390.815
GI-SMOTE0.9110.7040.831
yeast1No-sampling0.7810.4160.610
SMOTE0.8150.5850.665
Borderline-SMOTE0.8140.5940.662
ADASYN0.8350.6140.697
GI-SMOTE0.8460.6270.715
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, Y.; Zou, J.; Liu, L.; Hu, C. Improved Oversampling Algorithm for Imbalanced Data Based on K-Nearest Neighbor and Interpolation Process Optimization. Symmetry 2024, 16, 273. https://doi.org/10.3390/sym16030273

AMA Style

Chen Y, Zou J, Liu L, Hu C. Improved Oversampling Algorithm for Imbalanced Data Based on K-Nearest Neighbor and Interpolation Process Optimization. Symmetry. 2024; 16(3):273. https://doi.org/10.3390/sym16030273

Chicago/Turabian Style

Chen, Yiheng, Jinbai Zou, Lihai Liu, and Chuanbo Hu. 2024. "Improved Oversampling Algorithm for Imbalanced Data Based on K-Nearest Neighbor and Interpolation Process Optimization" Symmetry 16, no. 3: 273. https://doi.org/10.3390/sym16030273

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop