Improved Oversampling Algorithm for Imbalanced Data Based on K-Nearest Neighbor and Interpolation Process Optimization

Chen, Yiheng; Zou, Jinbai; Liu, Lihai; Hu, Chuanbo

doi:10.3390/sym16030273

Open AccessArticle

Improved Oversampling Algorithm for Imbalanced Data Based on K-Nearest Neighbor and Interpolation Process Optimization

¹

School of Railway Transportation, Shanghai Institute of Technology, Shanghai 201418, China

²

China Railway Siyuan Survey and Design Group Co., Ltd., Wuhan 430063, China

^*

Author to whom correspondence should be addressed.

Symmetry 2024, 16(3), 273; https://doi.org/10.3390/sym16030273

Submission received: 25 January 2024 / Revised: 22 February 2024 / Accepted: 23 February 2024 / Published: 26 February 2024

(This article belongs to the Section Computer)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The problems of imbalanced datasets are generally considered asymmetric issues. In asymmetric problems, artificial intelligence models may exhibit different biases or preferences when dealing with different classes. In the process of addressing class imbalance learning problems, the classification model will pay too much attention to the majority class samples and cannot guarantee the classification performance of the minority class samples, which might be more valuable. By synthesizing the minority class samples and changing the data distribution, unbalanced datasets can be optimized. Traditional oversampling algorithms have problems of blindness and boundary ambiguity when synthesizing new samples. A modified reclassification algorithm based on Gaussian distribution is put forward. First, the minority class samples are reclassified by the KNN algorithm. Then, different synthesis strategies are selected according to the combination of the minority class samples, and the Gaussian distribution is used to replace the uniform random distribution for interpolation operation under certain classification conditions to reduce the possibility of generating noise samples. The experimental results indicate that the proposed oversampling algorithm can achieve a performance improvement of 2∼

8 %

in evaluation metrics, including G-mean, F-measure, and AUC, compared to traditional oversampling algorithms.

Keywords:

imbalanced data; KNN; oversampling; SMOTE

1. Introduction

With the rapid development of artificial intelligence technology, extensive data processing and analysis are widely used in various fields of production and life, such as fault detection, medical diagnosis, virus script judgment, and so on. However, due to the difficulty of actual data sampling and the small number of fault information samples, there is a problem of sample classification imbalance in the data processing of fault prediction and diagnosis analysis [1]. The unbalanced problem of the dataset refers to the large gap in the number of samples of various categories in the sample set. For categories with a significantly small number of samples, it is called a minority (positive) sample, and for categories with a large number of samples, it is called a majority (counterexample) sample [2]. The problem of imbalanced datasets is characterized as an asymmetric issue due to the substantial disparity in the number of samples among different classes. The asymmetry in the distribution quantity of samples across various classes can significantly impact the performance of artificial intelligence algorithms in tasks such as fault diagnosis and fault prediction. Therefore, when utilizing artificial intelligence algorithms to handle imbalanced datasets, biased judgments may manifest, favoring the majority class and resulting in suboptimal performance for the minority class. However, it is often the case that minority class samples contain crucial information. For example, in the monitoring and maintenance of railway signal equipment, the fault of the equipment belongs to a minority of events, but if the fault prediction and judgment are misjudged as no fault, the best maintenance time may be missed, resulting in significant personnel and property losses [3].

The main classification methods, such as support vector machine, random forest, K-neighbor, etc., are designed based on the maximum classification accuracy, which often leads to the high classification accuracy of majority classes and the low classification accuracy of minority classes, which reduces the overall performance of the classifier [4,5,6]. Therefore, how to effectively improve the classification performance of the classifier when dealing with unbalanced data has become a hot spot in the field of machine learning and data mining [7,8].

There are two main methods to deal with unbalanced dataset classification: algorithm-based improvement and data-based improvement.

In algorithm-based improvement, through analyzing the characteristics of unbalanced datasets, traditional classification methods are improved, or new classification algorithms are designed to meet the performance requirements of unbalanced data classification. Common algorithms include cost-sensitive learning [9], ensemble learning [10], single-class learning [11], and other methods.

For the data-based improvement, the general idea is to change the data distribution and transform the unbalanced dataset into a state with a relatively small imbalance rate through oversampling or undersampling. Oversampling refers to increasing the number of minority class data samples through specific methods, resulting in a new, relatively balanced dataset when combined with the original majority class samples. Undersampling, on the other hand, is the opposite, as it selectively discards a certain number of samples from the existing majority class samples, leaving a new, relatively balanced dataset when combined with the majority and minority class samples.

Improvements at the data level can facilitate a more convenient and flexible enhancement of classification performance for imbalanced datasets. Compared to undersampling, oversampling methods not only achieve comprehensive data retention but also offer greater applicability and reliability. The current approach for most scholars is to increase the number of samples in the minority class for their research. Oversampling can ameliorate the imbalance of the datasets, and the training of the artificial intelligence model using the optimized datasets can greatly improve the accuracy of the trained model in dealing with diagnosis and prediction problems.

The most representative and widely used algorithm is Synthetic Minority Over Sampling Technique (SMOTE) [12]. However, due to the lack of further screening of minority samples, there is a certain blindness in the process of synthesizing new samples, and the boundary features between majority samples and minority samples cannot be effectively amplified. Therefore, many scholars have improved the algorithm. Han [13] put forward the Borderline-SMOTE algorithm that only oversampled between the minority class samples at the decision boundary to generate new samples but did not analyze the minority class samples of other types. He [14] introduced the idea of adaptive algorithms into oversampling and proposed the ADASYN algorithm to generate new samples by giving higher weights to minority samples with more samples of most classes in neighboring points to generate new samples, which also amplified the possibility of generating noise samples. Another approach that introduces the concept of weighting is Sebastián’s FW-SMOTE [15] method. Assuming that not all features are equally important, it utilizes a weighted Minkowski distance to define the neighborhood of each instance in the minority class. This prioritizes features more relevant to the classification task, attempting to achieve improved predictive performance in both low and high-dimensional datasets. However, this method might not be effectively applicable when the assumptions are not met.

More improvement strategies focus on ensuring the safety of newly generated samples. Bunkhumpornpat [16] proposed the Safe-Level-SMOTE method, which individually rates the safety of instances in the minority class. By considering the safe level ratio of instances, this approach prevents the generation of synthetic samples in inappropriate positions. The ASN-SMOTE method proposed by Yi [17] introduces a noise filtering step and incorporates an adaptive neighbor selection mechanism. This method effectively avoids generating synthetic minority class instances in majority class regions. Generating new minority class samples within a safer range can effectively avoid interference from synthetic samples. However, in practice, adding too many minority class samples in regions where minority class samples are already dense may not provide meaningful information value for the classifier.

Based on the above problems, this paper proposes an enhanced reclassification SMOTE algorithm to obtain more effective synthetic balanced datasets: Gaussian interpolation SMOTE (GI-SMOTE). This method stems from the idea of optimizing the interpolation process, leveraging the properties of the Gaussian distribution to bias synthetic samples towards generating near class boundaries, aiding in achieving more precise classifier decisions at class borders. Firstly, the K-Nearest Neighbor (KNN) [18] algorithm is used to divide minority samples and select minority class samples with characteristic types for subsequent processing. Finally, different data generation strategies are chosen according to the pairwise combination of minority sample types during the interpolation operation, and the simple uniform random distribution in SMOTE is replaced by the probability distribution of Gaussian distribution under specific combinations. The improved algorithm proposed in this paper is compared with the classical SMOTE [12], Borderline-SMOTE [13], and ADASYN [14] algorithms through experiments on imbalanced datasets in the database. The experimental results show that the proposed method can effectively deal with the problems faced by traditional algorithms and improve the effectiveness of classification algorithms on imbalanced datasets. This paper verifies the feasibility of improving the oversampling algorithm by optimizing the interpolation process.

2. Preliminaries

2.1. SMOTE Algorithm

In order to overcome the overfitting caused by the newly added minority class samples in the random oversampling process, Chawla [12] proposed the SMOTE algorithm in 2002. The basic principle of the algorithm is to find the neighbors within the minority class samples and interpolate between them so as to generate new minority class samples. The specific steps are as follows: assuming that oversampling is required by n times, for each minority sample, n neighboring points are selected from the k nearest neighbors of the same minority class, and a new minority sample is synthesized between the two selected minority samples according to (1).

p = x_{i} + r a n d o m (0, 1) \times (y_{i j} - x_{i})

(1)

where

r a n d o m (0, 1)

represents a random value within the interval

(0, 1)

.

Combining the newly generated minority samples with the original sample set, the imbalance rate of the new dataset obtained will be greatly improved and the performance of subsequent classification algorithms will be more effective. The SMOTE algorithm is widely used in many fields. However, the SMOTE algorithm also has some problems:

Firstly, the newly synthesized samples may belong to the noise samples. From the geometric point of view, if one of the selected sample

x_{i}

and neighboring point

y_{i j}

falls into the majority class samples and is surrounded by the majority class samples, the generated new sample is likely to become a noise sample, which does not have the characteristics of the minority class samples and interferes with the accuracy of the classification algorithm.

Secondly, the problem of fuzzy boundaries arises. The SMOTE algorithm ignores the boundary features of imbalanced sample sets. Too many new samples can be generated at locations far from the class boundaries, which may significantly affect the performance of the classification algorithm and blur the boundary between two classes.

Therefore, although the SMOTE algorithm can make up for the deficiency of random over-sampling, it has a certain blindness in the production of new data because it does not consider the distribution characteristics of minority samples in the overall sample set.

2.2. KNN Algorithm

The KNN algorithm proposed by Hwang [18] is a simple and powerful machine learning algorithm for supervised learning, which can be applied to classification problems. The core idea of the KNN algorithm is the following: if most of the k nearest neighbors of a sample in the feature space belong to a certain class, then the sample also belongs to this class. When computing the nearest neighbors of a sample, Euclidean distance is usually chosen to calculate the distance between two samples. When facing a sample for classification, we need to calculate the distance between this sample and every sample in the dataset and find the k closest samples to this sample from the calculated distance. The classification attributes of these k neighbor samples are counted, and the new sample is classified as the class with the most occurrences.

3. GI-SMOTE Algorithm

In order to make the newly generated samples contain more information represented by the minority class samples, this paper proposes a re-classification SMOTE algorithm combined with Gaussian distribution probability features. On the one hand, the idea of the KNN algorithm is used to judge and classify according to the adjacent type of each minority class sample in the overall datasets, and the individual minority class samples that fell into the majority class samples were filtered out to avoid generating noise signals through the minority class samples, solving the first problem caused by the above SMOTE algorithm. On the other hand, the focus is on special treatment of minority class samples that have a higher proportion of majority class samples among their nearest neighbors. The use of Gaussian distribution, instead of uniform random distribution, for interpolation aims to emphasize class boundary features while generating fewer samples that may potentially interfere with classification performance, thereby addressing the second issue of the SMOTE algorithm mentioned above.

3.1. Classification Stage

In dataset D,

X = {x_{1}, x_{2}, \dots, x_{n_{1}}}

is used to denote the minority sample set, the majority sample set is

Y = {y_{1}, y_{2}, \dots, y_{n_{2}}}

, and the number of minority and majority samples are

n_{1}

and

n_{2}

, respectively.

At this stage Algorithm 1, we classify the minority class samples separately. The distances between each minority class sample

x_{i}

and all sample points are calculated, and the KNN algorithm is used to find

k_{1}

nearest neighbors of the sample

x_{i}

. The value of

k_{1}

can be adjusted appropriately according to the distribution characteristics of the dataset. The number of the minority class samples in

k_{1}

nearest neighbors is set as

m_{1}

, and the number of the majority class samples is set as

m_{2}

, which shows that

k_{1} = m_{1} + m_{2}

.

According to the ratio of

m_{1}

and

m_{2}

, each minority class sample is divided into three categories, which are denoted as “Noise”, “Danger”, and “Safe”. If

m_{1} = 0

, it means that the sample point is completely surrounded by the majority class samples, and the point is classified as “Noise”. If

m_{1} > m_{2}

, it indicates that the sample point has a small probability of being misclassified, and the sample is regarded as a “Safe” class. If

0 < m_{1} \leq m_{2}

, then this sample has a non-negligible probability of being misclassified as a majority class sample, and it is regarded as the “Danger” class. The classification strategy is shown in Figure 1.

Algorithm 1 Pseudo Code for Classification Stage

Input:: Minority Sample Set X; Majority Sample Set Y; $n_{1}$ Number of minority samples; $n_{2}$ Number of majority samples; $k_{1}$ Number of nearest neighbors for each minority sample; $n^{'}$ Number of the sample features
Output:: Noise Sample Set $D_{N o i s e}$ ; Danger Sample Set $D_{D a n g e r}$ ; Safe Sample Set $D_{S a f e}$
1:: $D \leftarrow X \cup Y$
2:: $N \leftarrow s i z e (D)$
3:: $i \leftarrow 1$
4:: $j \leftarrow 1$
5:: while $i \leq n_{1}$ do
6:: while $j \leq N$ do
7:: $d i s t_{i j} = \sqrt{\sum_{m = 1}^{n^{'}} {(x_{i m} - s_{j m})}^{2}}$ ▹ $s_{j m}$ is a sample in D
8:: $j \leftarrow j + 1$
9:: end while
10:: Get the $k_{1}$ nearest neighbors of $x_{i}$ by sorting between $d i s t_{i}$
11:: $i \leftarrow i + 1$
12:: end while
13:: $i \leftarrow 1$
14:: while $i \leq n_{1}$ do
15:: $m_{1} \leftarrow$ the number of minority samples in $k_{1}$ nearest neighbors of $x_{i}$
16:: $m_{2} \leftarrow$ the number of majority samples in $k_{1}$ nearest neighbors of $x_{i}$
17:: if $m_{1} = = 0$ then
18:: $x_{i} \in D_{N o i s e}$
19:: else if $m_{1} > m_{2}$ then
20:: $x_{i} \in D_{S a f e}$
21:: else if $m_{1} \leq m_{2}$ then
22:: $x_{i} \in D_{D a n g e r}$
23:: end if
24:: $i \leftarrow i + 1$
25:: end while

3.2. Data Synthesis Stage

In order to avoid the newly synthesized samples carrying interference information, the improved oversampling method does not use “Noise” samples in the data synthesis stage. The “Danger” and “Safe” samples are combined into a new minority sample set

x^{'}

for the interpolation process.

k_{2}

nearest neighbors are selected for each “danger” sample point in the sample set

x^{'}

, and new samples are synthesized by using different strategies according to the type of the chosen nearest neighbor.

At this stage Algorithm 2, referring to the requirement of the SMOTE algorithm that the parameter

r a n d o m (0, 1)

can be taken to the extreme value 1, the idea of Gaussian distribution was introduced, and the

g a u s s

operator with the randomness characteristics of Gaussian distribution was used to replace the

r a n d o m

operator of uniform random distribution to synthesize new samples.

If the neighbor sample

y_{i j}

corresponding to the “Danger” sample

x_{i}

belongs to “Safe”, in order to make the newly synthesized sample point after interpolation of the two have more attribute information of the “danger” sample point, the new sample is synthesized by interpolation using (2).

p = x_{i} + g a u s s_{1} (0, 1) \times (y_{i j} - x_{i})

(2)

In the above formula,

g a u s s_{1} (0, 1)

is a random number generated in the interval (0,1) by the Gaussian distribution with 0 as the mean and

σ_{1}

as the standard deviation. The mean of the Gaussian distribution is set to 0 because it is hoped that new samples will be generated near the “Danger” sample as much as possible. According to the characteristics of Gaussian distribution,

p (μ - σ \leq x \leq μ + σ) \approx 0.6826

, the probability of

g a u s s_{1} (0, 1)

taking the value in the interval of

(0, σ_{1})

is also approximately

0.6826

.

Algorithm 2 Pseudo Code for Data Synthesis Stage

Input:: Safe Sample Set $D_{S a f e}$ ; Danger Sample Set $D_{D a n g e r}$ ; The standard deviations $σ_{1}$ and $σ_{2}$ ; $k_{2}$ Number of nearest neighbors for each Danger sample
Output:: A new synthetic sample p
1:: $D_{s y n} \leftarrow D_{S a f e} \cup D_{D a n g e r}$
2:: Getting $x_{i}$ from $D_{D a n g e r}$ ▹i is a subscript of samples
3:: Getting $y_{i j}$ from $k_{2}$ nearest neighbors of $x_{i}$ , and $y_{i j} \in D_{s y n}$ ▹j is a subscript of nearest neighbors of $x_{i}$
4:: $g a u s s_{1} \leftarrow 0$
5:: $g a u s s_{2} \leftarrow 0$
6:: if $y_{i j} \in D_{S a f e}$ then
7:: while $0 \geq g a u s s_{1}$ or $g a u s s_{1} \geq 1$ do
8:: $g a u s s_{1} =$ random.gauss(0, $σ_{1}$ )
9:: end while
10:: $p = x_{i} + g a u s s_{1} \times (y_{i j} - x_{i})$
11:: else if $y_{i j} \in D_{D a n g e r}$ then
12:: while $0 \geq g a u s s_{2}$ or $g a u s s_{2} \geq 1$ do
13:: $a =$ random.randint(0,1)
14:: if $a = = 0$ then
15:: $g a u s s_{2} =$ random.gauss(0, $σ_{2}$ )
16:: else if $a = = 1$ then
17:: $g a u s s_{2} =$ random.gauss(1, $σ_{2}$ )
18:: end if
19:: end while
20:: $p = x_{i} + g a u s s_{2} \times (y_{i j} - x_{i})$
21:: end if

If the nearest neighbor sample

y_{i j}

corresponding to the “Danger” sample

x_{i}

also belongs to “Danger”, the new sample will be synthesized by interpolation using (3).

p = x_{i} + g a u s s_{2} (0, 1) \times (y_{i j} - x_{i})

(3)

where

g a u s s_{2} (0, 1)

consists of two different Gaussian distributions, one of them has a mean of 0, and the other one has a mean of 1, and they both have a standard deviation of

σ_{2}

. Generally, set

σ_{2} < σ_{1}

. And both Gaussian distributions are adopted with half probability. In this way, the newly synthesized data will be closer to the “Danger” sample and thus carry less disturbing information.

The newly synthetic minority samples are combined with the initial samples to obtain a new balanced dataset

D_{n e w}

.

4. Experiments

In order to verify the effectiveness of the improved algorithm, we compared its performance with no-sampling, SMOTE, Borderline-SMOTE, and ADASYN algorithms when dealing with imbalanced datasets. A number of representative imbalanced datasets from UCI [19] and KEEL [20] are collected as data experimental objects. We divide these datasets into two parts, the test set and training set, according to the principle of the “80-20 rule” [2], for comparison experiments of classification effect. The “80-20 rule” refers to the practice of dividing the available dataset into approximately 80% for training and 20% for testing, used for training and evaluating machine learning models. To overcome the randomness in the synthetic algorithm during experiments, the following results represent the average of 10 trials. In this experiment, according to some oversampling algorithm research [14,21,22] and experimental experience, we set

k_{1} = 5

,

k_{2} = 5

. And when these datasets are oversampling using the GI-SMOTE algorithm, the value of parameter

σ_{1}

is

0.4

and the value of

σ_{2}

is

0.25

. The values of

σ_{1}

and

σ_{2}

are determined subjectively based on experimental experience. It is worth noting that the optimal standard deviation parameters can vary across different datasets, depending on the specific distribution characteristics between minority and majority class samples within the dataset.

4.1. Visual Presentation

In order to more intuitively show the advantages of the GI-SMOTE algorithm in synthesizing new samples, we visualize the characteristics of data distribution after applying no-sampling, SMOTE, and GI-SMOTE to three different datasets. Since the attribute of the dataset itself is high-dimensional, the principal component analysis (PCA) [23,24] technique is used as a dimension reduction tool to reduce the distribution of these datasets from high-dimensional to two-dimensional [25]. The distributions of these samples are shown in Figure 2, where orange dots represent the minority samples, blue dots represent the majority samples, and green dots represent the synthetic samples. It can be found that the new synthetic samples of the GI-SMOTE method are more compact than the new synthetic samples of the SMOTE method, which will avoid the blurring of the class boundary. Especially for the GI-SMOTE method, the new synthetic samples appear less near the “Safe” minority samples, which will make the new synthetic samples carry more differentiated information between the majority samples and the minority samples. Moreover, for the GI-SMOTE method, no new samples are generated around the danger points.

4.2. Evaluation Indicator

For unbalanced datasets, the traditional accuracy of classification results and evaluation indicators is not applicable [2]. In this paper, three indexes, G-mean, F-measure, and AUC [26], based on the confusion matrix, are introduced as the evaluation indicators of this experiment.

The concept of the confusion matrix is shown in Table 1, where TP is the number of positive samples correctly classified by the classifier, TN is the number of negative samples correctly classified by the classifier, FP is the number of negative samples incorrectly classified by the classifier, and FN is the number of positive samples incorrectly classified by the classifier.

G-mean is the overall classification accuracy of the classifier for positive and negative samples. The larger the index is, the better the classification performance of the classifier is. G-mean is composed of two metrics,

S p e c i f i c i t y

and

R e c a l l

[27]. The calculation formula is shown in (4)–(6):

G - m e a n = \sqrt{S p e c i f i c i t y \times R e c a l l}

(4)

S p e c i f i c i t y = \frac{T N}{T N + F P}

(5)

R e c a l l = \frac{T P}{T P + F N}

(6)

F-measure is the classification accuracy of positive class samples, and the larger the index is, the better the classification performance of the classifier is. F-measure is composed of two metrics,

P r e c i s i o n

and

R e c a l l

. The calculation formula is shown in (7) and (8):

F - m e a s u r e = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(7)

P r e c i s i o n = \frac{T P}{T P + F P}

(8)

AUC is the area under the receiver operating characteristic curve (ROC). A larger AUC indicates a better classification performance of the classifier. The ordinate of the ROC curve is the probability TPR of correct classification of positive class samples, and the calculation formula is shown in (9). The abscissa of the ROC curve is the probability of misclassification of negative class samples FPR, and the calculation formula is shown in (10).

T P R = \frac{T P}{T P + F N}

(9)

F P R = \frac{F P}{F P + F N}

(10)

4.3. Preparation of Data

Representative datasets are selected from the database as the data experiment processing objects, which are spambase, ecoli1, glass0, vehicle0, vehicle1, and yeast1. See Table 2 for a specific description of these datasets. Before performing oversampling experiments, we need to preprocess the data. Firstly, the dataset needs to be classified according to the majority class and the minority class, and labels should be added. Secondly, we need to ensure that the numbers of features of all samples in the experiment with the same dataset are consistent. Finally, the datasets are divided into training sets and test sets.

5. Results and Discussion

Since SMOTE, Borderline-SMOTE, ADASYN, and GI-SMOTE as oversampling algorithms do not have classification functions, in order to complete the verification experiment and obtain G-mean, F-measure, and AUC, three evaluation indicators, referring to research [17], KNN classifier [28], and SVM classifier [29], were used to complete the control experiment. The experimental results are shown in the following Table 3 and Table 4.

From Table 3 and Table 4 above, it can be concluded that the datasets processed by the GI-SMOTE oversampling algorithm have significant improvements in G-mean, F-measure, and AUC compared with the original datasets. Compared with other oversampling algorithms, the GI-SMOTE algorithm performs better in most cases. Experimental results show that GI-SMOTE achieves the best performance on at least two evaluation indicators on all six public datasets when using KNN classifier and SVM classifier. On the public datasets ecoli1, vehicle0, and yeast1, the three evaluation indicators of GI-SMOTE achieve optimal results under the experimental conditions of the two classifiers. Compared with the traditional SMOTE algorithm, GI-SMOTE can achieve 2∼

8 %

optimization in the evaluation indicators performance. On the one hand, the algorithm can more effectively determine the class boundary between the majority class samples and the minority class samples to obtain better performance. On the other hand, the algorithm has better generalization in the face of various datasets, and there is little overfitting phenomenon like the ADASYN algorithm when dealing with individual datasets.

6. Conclusions

For the problem of imbalanced datasets processing widely existing in actual production and life, this paper analyzes the shortcomings of the traditional SMOTE algorithm in performance from the data level, classifies the minority samples, configures the corresponding synthesis scheme, and proposes an improved GI-SMOTE algorithm based on KNN and interpolation process optimization. Experiments on various imbalanced datasets show that the improved algorithm can effectively improve the classification performance of imbalanced datasets. By employing this improved algorithm to address imbalanced datasets, there is a notable enhancement in the accuracy of artificial intelligence models when dealing with issues such as behavior prediction, fraud detection, medical diagnoses, and so on.

This paper demonstrates that optimizing the oversampling algorithm by adjusting the interpolation process is feasible, providing new insights for future improvements in oversampling algorithms. However, the standard deviation parameters

σ_{1}

and

σ_{2}

used in the interpolation process in the algorithm program need to be set artificially in advance. The optimal values of the parameters are often different in the face of different datasets. A further research direction of this paper will be how to use the adaptive method to judge the most appropriate standard deviation parameter. In addition, compared with the traditional oversampling algorithm, the proposed method requires more computing time. How to optimize the operation time of the oversampling algorithm will also be one of the future research directions.

Author Contributions

Conceptualization, Y.C. and J.Z.; methodology, Y.C.; software, Y.C.; validation, Y.C. and C.H.; formal analysis, J.Z. and L.L.; investigation, Y.C.; resources, J.Z.; data curation, Y.C.; writing—original draft preparation, Y.C.; writing—review and editing, J.Z.; visualization, Y.C.; supervision, J.Z. and L.L.; project administration, J.Z. and L.L.; funding acquisition, J.Z. and L.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the China National Railway Group Co., Ltd. Technology Research and Development Program Project (No. N2022G048), the Shanghai Science and Technology Commission—“Belt and Road” China-Laos Railway Project International Joint Laboratory (No. 21210750300), and the Shanghai Science and Technology Commission—Research on Key Technologies of Intelligent Operation and Maintenance of Rail Transit (No. 20090503100).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

Author Lihai Liu was employed by the company China Railway Siyuan Survey and Design Group Co. Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

He, H.; Garcia, E.A. Learning from Imbalanced Data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar]
Gao, Y.; Liu, Q. An Over Sampling Method of Unbalanced Data Based on Ant Colony Clustering. IEEE Access 2021, 9, 130990–130996. [Google Scholar]
Lin, H.; Hu, N.; Lu, R.; Yuan, T.; Zhao, Z.; Bai, W.; Lin, Q. Fault Diagnosis of a Switch Machine to Prevent High-Speed Railway Accidents Combining Bi-Directional Long Short-Term Memory with the Multiple Learning Classification Based on Associations Model. Machines 2023, 11, 1027. [Google Scholar] [CrossRef]
Wan, C.; Lee, L.; Rajkumar, R.; Isa, D. A hybrid text classification approach with low dependency on parameter by integrating K-nearest neighbor and support vector machine. Expert Syst. Appl. 2012, 15, 11880–11888. [Google Scholar] [CrossRef]
Zhang, N.; Niu, M.; Wan, F.; Lu, J.; Wang, Y.; Yan, X.; Zhou, C. Hazard Prediction of Water Inrush in Water-Rich Tunnels Based on Random Forest Algorithm. Appl. Sci. 2024, 14, 867. [Google Scholar] [CrossRef]
Li, Y.; Wang, C.; Liu, Y. Classification of Coal Bursting Liability Based on Support Vector Machine and Imbalanced Sample Set. Minerals 2023, 13, 15. [Google Scholar] [CrossRef]
Jason, V.H.; Taghi, K. Knowledge discovery from imbalanced and noisy data. Data Knowl. Eng. 2009, 68, 1513–1542. [Google Scholar]
Lu, H.; Vaidya, J.; Atluri, V.; Hong, Y. Constraint-Aware Role Mining via Extended Boolean Matrix Decomposition. IEEE Trans. Dependable Secur. Comput. 2012, 9, 655–669. [Google Scholar] [CrossRef]
Huang, Y. Cost-sensitive incremental Classification under the MapReduce framework for Mining Imbalanced Massive Data Streams. J. Discret. Math. Sci. Cryptogr. 2015, 18, 177–194. [Google Scholar]
Schapire, R.E. A brief introduction to boosting. IJCAI 1999, 99, 1401–1406. [Google Scholar]
Zhu, W.; Zhong, P. A new one-class SVM based on hidden information. Knowl.-Based Syst. 2014, 60, 35–43. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Han, H.; Wang, W.; Mao, B. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In Proceedings of the International Conference on Intelligent Computing, Hefei, China, 23–26 August 2005; pp. 878–887. [Google Scholar]
He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar]
Maldonado, S.; Vairetti, C.; Fernandez, A.; Herrera, F. FW-SMOTE: A feature-weighted oversampling approach for imbalanced classification. Pattern Recognit. 2022, 124, 108511. [Google Scholar] [CrossRef]
Bunkhumpornpat, C.; Sinapiromsaran, K.; Lursinsap, C. Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In Proceedings of the Advances in Knowledge Discovery and Data Mining: 13th Pacific-Asia Conference, Bangkok, Thailand, 27–30 April 2009; pp. 475–482. [Google Scholar]
Yi, X.; Xu, Y.; Hu, Q.; Krishnamoorthy, S.; Li, W.; Tang, Z. ASN-SMOTE: A synthetic minority oversampling method with adaptive qualified synthesizer selection. Complex Intell. Syst. 2022, 8, 2247–2272. [Google Scholar] [CrossRef]
Hwang, W.J.; Wen, K.W. Fast kNN classification algorithm based on partial distance search. Electron. Lett. 1998, 34, 2062–2063. [Google Scholar] [CrossRef]
UCI Machine Learning Repository. Available online: https://archive.ics.uci.edu/datasets (accessed on 10 June 2023).
Alcalá-Fdez, J.; Fernndez, A.; Luengo, J.; Derrac, J.; Garca, S.; Snchez, L.; Herrera, F. Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. J. Mult.-Valued Log. Soft Comput. 2011, 17, 255–287. [Google Scholar]
Pradipta, G.A.; Wardoyo, R.; Musdholifah, A.; Sanjaya, I.N.H. Radius-SMOTE: A New Oversampling Technique of Minority Samples Based on Radius Distance for Learning From Imbalanced Data. IEEE Access 2021, 9, 74763–74777. [Google Scholar] [CrossRef]
Naseriparsa, M.; Al-Shammari, A.; Sheng, M.; Zhang, Y.; Zhou, R. RSMOTE: Improving classification performance over imbalanced medical datasets. Health Inf. Sci. Syst. 2020, 8, 22. [Google Scholar] [CrossRef]
Moore, B. Principal component analysis in linear systems: Controllability, observability, and model reduction. IEEE Trans. Autom. Control 1981, 1, 17–32. [Google Scholar] [CrossRef]
Burohman, A.M.; Besselink, B.; Scherpen, J.M.A.; Camlibel, M.K. From Data to Reduced-Order Models via Generalized Balanced Truncation. IEEE Trans. Autom. Control 2023, 68, 6160–6175. [Google Scholar] [CrossRef]
Bao, Y.; Yang, S. Two Novel SMOTE Methods for Solving Imbalanced Classification Problems. IEEE Access 2023, 11, 5816–5823. [Google Scholar] [CrossRef]
Su, C.; Chen, L.; Yih, Y. Knowledge acquisition through information granulation for imbalanced data. Expert Syst. Appl. 2006, 31, 531–541. [Google Scholar] [CrossRef]
Zhang, Z.; Li, J. Synthetic Minority Oversampling Technique Based on Adaptive Local Mean Vectors and Improved Differential Evolution. IEEE Access 2022, 10, 74045–74058. [Google Scholar] [CrossRef]
Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]

Figure 1. Classification of minority class samples.

Figure 2. Discrete point models based on three oversampling methods in two-dimension. Orange dots denote minority samples, blue dots denote majority samples, and green dots are new synthetic samples.

Table 1. Confusion matrix.

	Predictive Positive Class	Predictive Negative Class
Actual Positive Class	TP	FN
Actual Negative Class	FP	TN

Table 2. Datasets Characteristics.

Name	Instances	Features	Imbalanced Ratio (IR)
spambase	4601	57	1.54
ecoli1	336	7	3.36
glass0	214	9	2.06
vehicle0	846	18	3.25
vehicle1	846	18	2.90
yeast1	1484	8	2.46

Table 3. Comparison between the GI-SMOTE algorithm and other oversampling algorithms with classification algorithm of KNN.

Dataset	Algorithm	G-Mean	F-Measure	AUC
spambase	No-sampling	0.827	0.570	0.684
	SMOTE	0.844	0.622	0.713
	Borderline-SMOTE	0.858	0.652	0.737
	ADASYN	0.874	0.683	0.765
	GI-SMOTE	0.886	0.708	0.786
ecoli1	No-sampling	0.892	0.713	0.795
	SMOTE	0.900	0.727	0.81
	Borderline-SMOTE	0.892	0.713	0.795
	ADASYN	0.906	0.737	0.820
	GI-SMOTE	0.911	0.946	0.930
glass0	No-sampling	0.837	0.588	0.700
	SMOTE	0.901	0.750	0.775
	Borderline-SMOTE	0.880	0.700	0.775
	ADASYN	0.901	0.750	0.812
	GI-SMOTE	0.935	0.829	0.875
vehicle0	No-sampling	0.925	0.804	0.855
	SMOTE	0.943	0.832	0.890
	Borderline-SMOTE	0.933	0.807	0.870
	ADASYN	0.949	0.838	0.900
	GI-SMOTE	0.949	0.842	0.900
vehicle1	No-sampling	0.865	0.634	0.748
	SMOTE	0.913	0.705	0.833
	Borderline-SMOTE	0.910	0.708	0.829
	ADASYN	0.911	0.746	0.830
	GI-SMOTE	0.922	0.715	0.849
yeast1	No-sampling	0.811	0.548	0.657
	SMOTE	0.819	0.586	0.670
	Borderline-SMOTE	0.829	0.613	0.688
	ADASYN	0.876	0.685	0.729
	GI-SMOTE	0.896	0.729	0.802

Table 4. Comparison between the GI-SMOTE algorithm and other oversampling algorithms with classification algorithm of SVM.

Dataset	Algorithm	G-Mean	F-Measure	AUC
spambase	No-sampling	0.955	0.881	0.913
	SMOTE	0.962	0.922	0.925
	Borderline-SMOTE	0.966	0.903	0.933
	ADASYN	0.937	0.829	0.878
	GI-SMOTE	0.970	0.916	0.940
ecoli1	No-sampling	0.728	0.481	0.530
	SMOTE	0.728	0.481	0.530
	Borderline-SMOTE	0.762	0.519	0.580
	ADASYN	0.734	0.506	0.539
	GI-SMOTE	0.768	0.532	0.590
glass0	No-sampling	0.775	0.333	0.600
	SMOTE	0.949	0.889	0.900
	Borderline-SMOTE	0.955	0.878	0.913
	ADASYN	0.955	0.895	0.912
	GI-SMOTE	0.955	0.895	0.912
vehicle0	No-sampling	0.975	0.931	0.950
	SMOTE	0.980	0.941	0.960
	Borderline-SMOTE	0.981	0.946	0.962
	ADASYN	0.985	0.951	0.970
	GI-SMOTE	0.987	0.970	0.975
vehicle1	No-sampling	0.865	0.632	0.748
	SMOTE	0.901	0.688	0.812
	Borderline-SMOTE	0.891	0.671	0.793
	ADASYN	0.903	0.739	0.815
	GI-SMOTE	0.911	0.704	0.831
yeast1	No-sampling	0.781	0.416	0.610
	SMOTE	0.815	0.585	0.665
	Borderline-SMOTE	0.814	0.594	0.662
	ADASYN	0.835	0.614	0.697
	GI-SMOTE	0.846	0.627	0.715

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Y.; Zou, J.; Liu, L.; Hu, C. Improved Oversampling Algorithm for Imbalanced Data Based on K-Nearest Neighbor and Interpolation Process Optimization. Symmetry 2024, 16, 273. https://doi.org/10.3390/sym16030273

AMA Style

Chen Y, Zou J, Liu L, Hu C. Improved Oversampling Algorithm for Imbalanced Data Based on K-Nearest Neighbor and Interpolation Process Optimization. Symmetry. 2024; 16(3):273. https://doi.org/10.3390/sym16030273

Chicago/Turabian Style

Chen, Yiheng, Jinbai Zou, Lihai Liu, and Chuanbo Hu. 2024. "Improved Oversampling Algorithm for Imbalanced Data Based on K-Nearest Neighbor and Interpolation Process Optimization" Symmetry 16, no. 3: 273. https://doi.org/10.3390/sym16030273

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improved Oversampling Algorithm for Imbalanced Data Based on K-Nearest Neighbor and Interpolation Process Optimization

Abstract

1. Introduction

2. Preliminaries

2.1. SMOTE Algorithm

2.2. KNN Algorithm

3. GI-SMOTE Algorithm

3.1. Classification Stage

3.2. Data Synthesis Stage

4. Experiments

4.1. Visual Presentation

4.2. Evaluation Indicator

4.3. Preparation of Data

5. Results and Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI