Next Article in Journal
Common Neighborhood Energy of Commuting Graphs of Finite Groups
Previous Article in Journal
Vibrational Analysis of Paraelectric–Ferroelectric Transition of LiNbO3: An Ab-Initio Quantum Mechanical Treatment
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Addressing Class Overlap under Imbalanced Distribution: An Improved Method and Two Metrics

1
School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing 100083, China
2
School of Mechanical Engineering, University of Science and Technology Beijing, Beijing 100083, China
3
Shunde Graduate School, University of Science and Technology Beijing, Foshan 528000, China
4
Beijing Advanced Innovation Center for Materials Genome Engineering, University of Science and Technology Beijing, Beijing 100083, China
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Symmetry 2021, 13(9), 1649; https://doi.org/10.3390/sym13091649
Submission received: 3 August 2021 / Revised: 31 August 2021 / Accepted: 3 September 2021 / Published: 7 September 2021
(This article belongs to the Section Mathematics)

Abstract

:
Class imbalance, as a phenomenon of asymmetry, has an adverse effect on the performance of most machine learning and overlap is another important factor that affects the classification performance of machine learning algorithms. This paper deals with the two factors simultaneously, addressing the class overlap under imbalanced distribution. In this paper, a theoretical analysis is firstly conducted on the existing class overlap metrics. Then, an improved method and the corresponding metrics to evaluate the class overlap under imbalance distributions are proposed based on the theoretical analysis. A well-known collection of the imbalanced datasets is used to compare the performance of different metrics and the performance is evaluated based on the Pearson correlation coefficient and the ξ correlation coefficient. The experimental results demonstrate that the proposed class overlap metrics outperform other compared metrics for the imbalanced datasets and the Pearson correlation coefficient with the AUC metric of eight algorithms can be improved by 34.7488% in average.

1. Introduction

Machine learning has been widely applied to solve problems in various fields. One of the common and important challenges in solving these problems by machine learning is the classification under imbalanced distribution [1]. The imbalance is encountered by a large number of applications where the concerned samples are rare, such as disease diagnosis, financial fraud detection, network intrusion detection, and so on [2]. The data distributions in these fields are asymmetry that the number of concerned positive samples are smaller than that of negative samples. Most standard classification algorithms are designed based on the concept of symmetry, relatively balanced class distribution or equal cost of misclassification [3]. The classification performances of these algorithms are degraded for handling the imbalance problem to some extent. Hence, building symmetry in machine learning for data under asymmetry distribution is an important research topic [4]. In [5], a novel class imbalance reduction algorithm is proposed to build a symmetry by considering distribution properties of the dataset to improve the performance in software defect prediction. In addition, there are also a lot of methods are proposed to handle the imbalance problem, which can be referred in [2,6].
Besides the imbalance, class overlap is also an important factor that affects the performance of classification [7]. In addition, the research of Liu et al. [8] demonstrated that the sample is often misclassified if it is in a class overlapping boundary. Oh [9] proposed the R value based on the ratio of overlapping areas to the whole dataset and the experimental results show that the R value is strongly correlated with the classification accuracy. In addition, Denil [10] has given a systematic analysis on the imbalance and overlap. The analysis shows that the overlap problem has a greater influence on classification performance than the imbalance in isolation and the classification performance is decreased significantly when the overlap and imbalance are both exist. To deal with the classification of datasets with class overlap and imbalance, some research works have also been conducted [7,11].
The classification methods for the datasets with class overlap and imbalance are important, but so are the quantitative estimation methods of the class overlap level for imbalanced datasets [9]. It can make contributions to understand the characteristic of the datasets and then help to design suitable methods for better classification performance. Klomsae et al. [12] adopts the R value to indicate the classification performance of the dataset and propose a string grammar fuzzy-possibilistic C-medians algorithm to handle the overlapping data problem. In addition, some methods based on the R value to conduct feature selection [13,14], feature construction [15] and data sampling [16] are proposed for achieving better classification performance. Later, Borsos et al. [17] analyzed the problem of the R value for estimating the overlap level of imbalanced datasets and extended the R value to the R a u g value for imbalanced datasets. The experimental results demonstrate that the R a u g value has a stronger correlation with the classification performance, and it can also achieve better performance in algorithm selection for better classification performance. In addition, some feature selection research works are also conducted based on the R a u g value [18,19].
The R a u g value has achieved great performance for addressing the class overlap under imbalanced distribution. However, the experimental results in [17] show that the absolute value of the Pearson correlation coefficients of the R a u g with the classification performances are lower than 0.7 and the correlation coefficients are varied to different algorithms. Therefore, both correlation coefficients with the classification performances and the generalization ability for different classification algorithms need to be improved. For this purpose, a theoretical analysis on the existing class overlap metrics is firstly conducted and then an improved method is proposed to measure the class overlap for imbalanced datasets in this paper. Based on the proposed method, the R and R a u g are extended to I m R and I m R a u g for better estimating the class overlap level for imbalanced datasets. The comparison experiments conducted on a well-known collection of imbalanced datasets and eight commonly used classification algorithms are adopted to obtain the classification performance. In addition, the performances of different overlap metrics are evaluated based on the Pearson correlation coefficient and the ξ correlation coefficient with the classification performance. The experimental results demonstrate the excellent performances of the proposed metrics, which indicates the superiority of the proposed method.
The contributions of this paper can be summarized as follows:
  • A theoretical analysis on the existing class overlap measure R value is presented.
  • A novel method along with two metrics for estimating the class overlap of the imbalanced datasets is proposed based on the theoretical analysis.
  • The proposed two class overlap metrics are verified to be in higher correlations with the classification performance of imbalanced datasets.
The rest of the paper is organized as follows. The existing overlap metrics, the R and R a u g values, are introduced in Section 2. Section 3 presents a theoretical analysis on the R value. Then, an improved method and two corresponding overlap metrics for imbalanced datasets are proposed based on the theoretical analysis. In addition, Section 4 describes the information about the experiments, such as experiment setup, adopted datasets, and performance evaluation. The experimental results and discussions are given in Section 5. Finally, the conclusions are drawn in Section 6.

2. The Existing Overlap Metrics

To estimate the level of class overlap, there are two metrics, the R value and the R a u g value. These two metrics are introduced in this section. Before introducing the two metrics, some notations are presented as follows:
  • N: the number of the samples
  • n: the number of the classes
  • C l : the set of samples belonging to class l , l [ 1 , n ]
  • U: the set of all samples, U = C 1 C i C n
  • x i l : the i t h sample in C l
  • k- N N ( x i l , U C l ) : the set of samples in k-nearest neighbor samples of x i l that belong to the class different from C l
  • λ ( z ) : a 0-1 function that returns 1 when z > 0 and returns 0 otherwise
  • θ : a threshold value within the range [0, k / 2 ]

2.1. The R Value

The original R value is proposed by Oh [9] based on the assumption that a sample from class C l is overlapped with other samples if the number of the samples that is in its k nearest neighbors and also belongs to a class rather than C l is at least θ + 1 . In addition, the R values of class C l and the whole dataset f are defined as Equations (1) and (2), respectively:
R ( C l ) = 1 | C l | i = 1 | C l | λ ( | k - N N ( x i l , U C l ) | θ )
R ( f ) = 1 | U | l = 1 n i = 1 | C l | λ ( | k - N N ( x i l , U C l ) | θ )
According to the definition, the R value can be considered as the ratio of samples in the overlapping area. The range of the R value is [0, 1]. In addition, there are two parameters, k , θ needed to be predefined to calculate the R value. According to [9], the R value is strongly correlated with the accuracy of Support Vector Machine (SVM), Artificial Neural Network (ANN), and K Nearest Neighbor (KNN) algorithms when k = 7 and θ = 3 . With this parameter setting, a sample is considered to be in the overlapping area if at least four samples in its seven nearest neighbors belong to another class.

2.2. The R a u g Value

In [9], the results also show that the R value is most strongly correlated with the classification accuracy of the majority class. In addition, Borsos et al. [17] evaluated R value on some imbalanced data sets with different imbalanced ratios and the experiment results showed that R value is almost constant, while the classification performance decreases with a larger imbalance ratio (IR). Therefore, they conducted an analysis of the R value in a simple case with only two classes, the majority class and minority class. The majority class and minority class are denoted as C N and C P , respectively. Then, the R value of the whole dataset can be calculated as Equation (3):
R = 1 | C N | + | C P | ( i = 1 | C N | λ ( | k - N N ( x i N , C P ) | θ ) + i = 1 | C P | λ ( | k - N N ( x i P , C N ) | θ ) ) = 1 | C P | + | C N | ( | C N | · R ( C N ) + | C P | · R ( C P ) )
By introducing the imbalance ratio I R = | C N | / | C P | into Equation (3), the equation can be simplified as Equation (4):
R = 1 I R + 1 ( I R · R ( C N ) + R ( C P ) )
As the weight of the R value of the majority class is IR, the R value of the whole dataset will be dominated by R value of the majority class when the dataset has a large IR. Actually, the sample in the majority class has a low probability to be recognized in the overlap area, while the sample in the minority class has a high probability. Therefore, the weight of the R value of the majority class should be smaller than that of the minority class. Based on this analysis, the augmented R value defined as Equation (5) is proposed by Borsos et al. [17]:
R a u g = 1 I R + 1 ( R ( C N ) + I R · R ( C P ) )
It can be seen that the R a u g value is dominated by the R value of the minority class for datasets with large IR, and it is equal to R value for balanced datasets with I R = 1 . The experimental results in [17] demonstrated that it could achieve a stronger correlation with the classification performance evaluated by the metric of the area under the Receiver Operating Characteristic (ROC) curve.

3. The Proposed Method and Corresponding Overlap Metrics

In this section, the theoretical analysis of the R value is firstly conducted. Then, an improved method to recognize the overlap area for imbalanced datasets is proposed and the corresponding overlap metrics are introduced.

3.1. Theoretical Analysis of the R Value

Consider a dataset with N samples X = { x 1 , , x i , x N } , the number of samples in the same class with each sample x i is denoted as r i . In the ideal non-overlapping data distribution, all samples are distributed very well so that the r i nearest samples of the sample x i along with the sample x i itself are all in the same class, while there may be some samples in the r i nearest samples of the sample x i that are in a different class to the class of x i in real data distribution.
Let’s keep it simple; only consider the k i nearest data samples ( k i < r i ) of the sample x i in real data distribution. As shown in Figure 1, let P i be denoted as the set of r i nearest samples of the x i in the ideal data distribution, and Q i represents the set of k i nearest samples of the x i in the real data distribution. For any sample except x i , it can be represented by x T P , x F P , x F N , and x T N . The meanings of these four kinds samples are presented as follows:
  • x T P : the sample which is in both P i and Q i
  • x T N : the sample in neither P i nor Q i
  • x F N : the sample which is in P i but not in Q i
  • x F P : the sample that are not in P i but in Q i
Denote the number of x T P samples as N i T P , the number of x F N samples that as N i F N , the number of x F P samples that as N i F P and the number of x T N samples that are as N i T N . According to the definition of R value, the contribution of the sample x i to the R value is determined by N i T P and N i F P as k i = N i T P + N i F P . Actually, the contribution of x i to the R value can be determined based on the probability distribution. In the following, the contribution is analyzed based on the distance between the real data distribution and the ideal data distribution from the perspective of probability.
To measure the distance between distributions, the Kullback–Leibler (K-L) divergence is commonly used. In addition, to conduct the calculation of K-L divergence, denote p j | i and q j | i as the conditional probabilities of sample x j given x i for the ideal data distribution and the real data distribution. In addition, the conditional probability of the samples in the P i and Q i can be defined to be an equal non-zero probability and the conditional probability of the other samples can be defined to a near zero probability according to [20]. Then, the detailed conditional probabilities of p j | i and q j | i are shown in Equations (6) and (7):
p j | i = 1 δ r i , i f x i P i δ N r i 1 , o t h e r w i s e
q j | i = 1 δ k i , i f x i Q i δ N k i 1 , o t h e r w i s e
Based on the conditional probabilities, the distance, the K-L divergence D ( q j | i , p j | i ) , can be defined as Equation (8) shows. Due to the near zero value of δ , the distance can be simplified to Equation (9). According to Equation (9), N i T P and N i F P dominate the distance, which is consistent with the definition of the R value:
D ( q j | i , p j | i ) = j i , x j P i , x j Q i ( q j | i log q j | i p j | i ) + j i , x j P i , x j Q i ( q j | i log q j | i p j | i ) + j i , x j P i , x j Q i ( q j | i log q j | i p j | i ) + j i , x j P i , x j Q i ( q j | i log q j | i p j | i )
D ( q j | i , p j | i ) N i T P { 1 δ k i log ( 1 δ k i r i 1 δ ) } + N i F P { 1 δ k i log ( 1 δ k i N r i 1 δ ) }
Equation (9) can be further simplified to Equation (10) because of the near zero value of δ . It can be seen that the distance actually is dominated by the ratio of N i F P and k i ( k i = N i F P + N i T P ) . Therefore, a reasonable threshold for judging whether x i has a contribution to the R value is 0.5. If N i F P / k i > 0.5 , it indicates that x i is in the overlap area. For k i = 7 , if N i F P / k i > 0.5 , N i F P should be at least 4. It is consistent with the implementation of the R value in the experiments of [9]:
D ( q j | i , p j | i ) N i T P k i ( 1 δ ) log r i k i + N i F P k i ( 1 δ ) ( log 1 δ δ + log N r i 1 k i ) N i F P k i ( 1 δ ) log 1 δ δ
In addition, the sample x i is correctly classified to a class if N i T P > N i F P in the k nearest neighbor for the KNN algorithm, which is contrary to the definition of the R value. Therefore, the R value can be strongly and negatively correlated with the accuracy of the KNN algorithm.

3.2. The Proposed Method

From the above theoretical analysis, it can be seen that the class overlap is actually not only determined by N i F P but also N i T P , while the contribution of N i T P is ignored for the R value. As Equation (10) shows, it can be omitted as the coefficient log r i k i is constant for balanced data sets with the same k i for all samples. However, when the same k i is adopted, the coefficient is varied from different classes due to the different r i for imbalanced data sets. To make the coefficients of N i T P equal for different classes in an imbalanced dataset, the condition shown in Equation (11) should be satisfied. Then, Equation (12) can be obtained. Besides, the same result will be obtained if the deduction is conducted based on Hellinger distance, which can be found in Appendix A. It indicates that the adopted value of k should be in proportion to the number of samples in the class and the smaller value of k m i n should be adopted for the minority class:
log r N k N = log r P k P
k N k P = r N r P = I R
According to Equation (12), if k is used for the majority class, k / I R should be used for the minority class as it must be a positive integer. Based on this analysis, an improved method to calculate the overlap of different classes is proposed as Equation (13) shows, where C N is the number of samples in the majority class. In this way, the samples in the minority class will not be considered in an overlap area easily:
R ( C l ) = 1 | C l | i = 1 | C l | λ ( | k · | C l | | C N | - N N ( x i l , U C l ) | θ ) )
An intuitive demonstration of how the proposed method works is presented in Figure 2. For the sample x i N in the majority class, both k = 3 and k = 5 are suitable to decide whether x i N is in the class overlap region or not, while the sample x i P in the minority class will be recognized to be in the overlap region when k = 5 . When k = 3 = 5 / 2 , the x i P can be correctly recognized to be in the non-overlap region.

3.3. The Proposed Metrics

According to the proposed method, two improved overlap metrics based on the R and R a u g for binary imbalance datasets can be introduced. The two metrics, denoted as I m R and I m R a u g , are defined as Equations (14) and (15):
I m R = 1 I R + 1 ( I R · R ( C N ) + R ( C P ) ) = 1 | C N | + | C P | ( i = 1 | C N | λ ( | k - N N ( x i N , C P ) | θ ) + i = 1 | C P | λ ( | k I R - N N ( x i P , C N ) | θ ) )
I m R a u g = 1 I R + 1 ( R ( C N ) + I R · R ( C P ) ) = 1 | C N | + | C P | ( 1 I R i = 1 | C N | λ ( | k - N N ( x i N , C P ) | θ ) + I R i = 1 | C P | λ ( | k I R - N N ( x i P , C N ) | θ ) )
According to the definition of the two metrics, they can both be equal to the original R value when the dataset is balanced ( I R = 1 ). In addition, the experimental results in [9,17] demonstrate that the R and R a u g are strongly correlated with the accuracy and the area under the ROC curve (AUC) respectively. Therefore, it is expected that I m R is more strongly correlated with the accuracy of the imbalanced datasets and I m R a u g is more strongly correlated with the AUC of the imbalanced datasets.

4. Experiment Design

In this section, the experiment setup is firstly introduced. Then, the datasets adopted in the experiments are presented. Finally, the evaluation metric for the comparison of different class overlap metrics is described.

4.1. Experiment Setup

The experiments are conducted to prove the effectiveness of the proposed method and the two metrics for addressing the class overlap of the imbalanced datasets. To evaluate the effectiveness, not only the correlation of different overlap metrics with the classification performance but also the time consuming of the overlap metrics and the classification modeling are compared. The Pearson correlation coefficient and the ξ correlation coefficient are adopted to obtain the correlation result. The Pearson correlation coefficient can only handle the linear correlations, while the ξ correlation coefficient can deal with both the linear and nonlinear correlations.
According to the investigation of Guo et al. in [2], the AUC and accuracy are the most frequently used metrics for evaluating the classification performance. The AUC is obtained based on the ROC curve which consists of a series points of (false positive rate, true positive rate) [21]. In addition, the points are generated by varying different thresholds for the prediction probability of the classifier. As the AUC is robust to the imbalanced datasets [22], it is recognized as an objective metric and widely utilized to evaluate the classification performance for imbalanced problems. The accuracy is defined as the rate of the number of correctly predicted samples to the number of samples in the whole dataset. Although the accuracy has been proved to be biased to the majority class, it is still frequently used in the research on imbalance learning as it is the most general and intuitive metric [2]. In addition, the proposed metrics I m R and I m R a u g are expected to be strongly correlated with the accuracy and AUC respectively based on the analysis in Section 3. Therefore, the two metrics are both adopted to evaluate the classification performance for the better evaluation of the proposed overlap metrics.
Moreover, for the comparison of the generalization ability, eight commonly used algorithms are adopted to obtain the classification performance and the performances are obtained based on the 5-fold cross validation. The eight classification algorithms are k-nearest neighbor (KNN), Naive Bayesian (NB), Support Vector Machine with linear kernel (SVM-L), Support Vector Machine with radial basis kernel (SVM-R), Decision Tree (DT), Multiple Layer Perceptron (MLP), Random Forest (RF), and Adaptive Boosting (AdaB). All methods in the experiments are implemented in python based on some packages like scikit-learn [23] and so on. The parameters of the eight classification algorithms are set to default in the experiment. In addition, the same parameter setting, k = 7 , θ = k / 2 , is adopted for all overlap metrics.

4.2. Datasets

In the experiments, a relatively well-known collection of 66 datasets for imbalanced classification is utilized. This collection can be obtained from the KEEL repository and has been adopted in [17,24]. The descriptions of these datasets are shown in Table 1, where #Inst. and #Attrs indicate the number of samples and attributes, respectively, and I R means the imbalance ratio. The imbalance ratios of the datasets are in a very wide range. The minimum imbalance ratio is 1.82, while the maximum imbalance ratio is 128.87.

4.3. Evaluation of Correlation

Pearson correlation coefficient [25] is defined to measure the strength of the relationship between two variables in statistics. The equation of Pearson correlation coefficient is shown in Equation (16), where X and Y are two variables, X ¯ and Y ¯ are the mean value of the two variables, c o v ( . , . ) is the covariance, and σ is the standard deviation:
ρ X , Y = c o v ( X , Y ) σ X σ Y = i = 1 n ( X i X ¯ ) ( Y i Y ¯ ) i = 1 n ( X i X ¯ ) 2 i = 1 n ( Y i Y ¯ ) 2
The Pearson correlation coefficient is widely utilized to calculate the linear correlation of two variables. It has been used to compare the performance of R and R a u g in [17], and it is also used in the experiments of this paper. In addition, the range of Pearson correlation coefficient ρ is [−1, 1]. The bigger of the absolute value of ρ , the stronger correlation of the two variables. ρ = 1 indicates that there is a perfect positive correlation between the two variables, while ρ = 1 means a perfect negative correlation. In addition, when ρ = 0 , it indicates that the two variables are dependent and there is no correlation can be found between them.
To further verify the linear correlation of the proposed metrics and the classification performance, the probabilities of the Pearson correlation results are also compared. The probability can be indicated by p-value, where the smaller p-value indicates the stronger support for the result of Pearson correlation coefficient. Generally, a p-value smaller than 0.05 means that the result of linear correlation is solid. In addition, the result is significantly solid when the p-value is smaller than 0.01.
Besides the Pearson correlation coefficient, the ξ correlation coefficient is also used to the evaluate the relationships of different overlap metrics with the classification performance. ξ correlation coefficient, which was proposed in [26], can not only measure the linear correlation but also the nonlinear correlation. To calculate the ξ correlation coefficient of a pair of variables (X, Y), the data should be rearranged as ( X 1 , Y 1 ) , , ( X n , Y n ) such that X 1 ··· X n . Let h i be the number of j such that Y j Y i and l i are the number of j such that Y j Y i , and the ξ correlation coefficient is defined as Equation (17) shown. It is in range of [ 0 , 1 ] , where ξ ( X , Y ) = 0 indicates that X and Y are independent and ξ ( X , Y ) = 1 indicates that Y is a measurable function of X:
ξ ( X , Y ) = 1 n i = 1 n 1 | h i + 1 h i | 2 i = 1 n l i ( n l i )

5. Results and Discussion

To verify the efficiency of the proposed method to measure class overlap, both the correlation results of different metrics with the AUC and the accuracy are compared. As the AUC metric is more objective than the accuracy for imbalance learning, the comparison of the correlations of different overlap metrics and the AUC of different algorithms is firstly conducted. In addition, then the correlation results of different metrics with the accuracy are compared. Finally, the overall results are summarized.

5.1. The Correlation Results for the AUC Metric

Figure 3 shows the correlation results of different overlap metrics with the AUCs of classification algorithms. The results demonstrate that the I m R a u g metric achieved the best performance among these metrics for all classification algorithms. In addition, the I m R metric also obtained better performance than the original R metric based on the AUC metric of classification algorithms. In addition, both the original R and R a u g value have low correlation coefficients with the AUC of the NB algorithm. It can be seen that the generalization ability of the existing overlap metrics is not good. While, the correlation coefficients of the R and R a u g with the NB algorithm are both largely improved by the proposed method. The I m R and I m R a u g seem to have better generalization abilities for these algorithms than the R and R a u g , respectively.
As expected, the R a u g and I m R a u g achieve much stronger correlations with the AUC of different algorithms. It is consistent with the result in [17]. In the following, the detailed correlation results of the R a u g and I m R a u g with different classification algorithms are compared. The detailed correlation coefficients along with the p-values of R a u g and I m R a u g with the AUC of different algorithms are presented in Table 2. It can be seen that the correlation coefficients of the R a u g with the AUC of the DT, MLP, RF, and AdaB algorithms are all lower than 0.7, while the correlation coefficients with the AUC of these algorithms are all improved to more than 0.8 by I m R a u g . On average, the I m R a u g achieves a 34.7488% improvement to the R a u g . An illustration of the correlation coefficient improvements of I m R a u g over R a u g is shown in Figure 4. The correlation coefficient between R a u g and the AUC of the NB algorithm is largely improved by the proposed I m R a u g .
In addition, all the p-values of I m R a u g are far less than 0.01, which indicates that I m R a u g does have linear correlations with the AUC of these algorithms. In addition, the p-values of I m R a u g are also much less than that of R a u g . Therefore, the I m R a u g has a much better performance than the R a u g for estimating the level of class overlap under imbalanced distribution. Moreover, I m R a u g can not only achieve the better mean value of correlations with the AUCs of all classification algorithms, but also achieve the smaller standard deviation. It demonstrates that I m R a u g also has a better generalization ability to these algorithms.
Figure 5 demonstrates the ξ correlation coefficients of the AUCs of different algorithms with the R a u g and I m R a u g . It can be seen that the result of the ξ correlation coefficient is similar to the result of the Pearson correlation coefficient. The x i correlation coefficient of the AUC of the RF algorithm with the I m R a u g is the highest and the correlation coefficient of the NB algorithm with R a u g is largely improved by I m R a u g . In addition, the ξ correlation coefficients of the AUCs of different algorithms with the I m R a u g are all higher than that with the R a u g . Therefore, the comparison of ξ correlation coefficient also demonstrates the superior of I m R a u g .

5.2. The Correlation Results for the Accuracy Metric

Figure 6 shows the correlation results of R and I m R with the accuracy of several classification algorithms. It can be seen that the I m R achieves better performance than the original R for more classification algorithms. In addition, the I m R a u g measure also obtained better performance than the R a u g measure based on the accuracy of classification algorithms.
The detailed correlation coefficients and p-values of the R and I m R with the accuracy of different algorithms are shown in Table 3. As expected, the I m R is strongly correlated with the accuracy of the KNN algorithm. The Pearson correlation coefficient of the I m R with the accuracy of the KNN algorithm is more than 0.9. In addition, the I m R also achieves high correlation coefficients with the accuracy of the SVM-R, DT, MLP, RF, and AdaB algorithms. Although the correlation coefficients of the R and I m R with the accuracy of the NB algorithm are very low, the correlation coefficient is greatly improved by the I m R . On average, the I m R achieves a 8.0898% improvement of the Pearson correlation coefficient to the R. Therefore, the I m R has a better performance than the R for estimating the level of class overlap under imbalanced distribution.
In addition, all the p-values of I m R except for the NB algorithm are far less than 0.01, which indicates that I m R does have linear correlations with the ACC of most algorithms. In addition, the p-values of I m R except for SVM-L and MLP algorithms are much less than that of R. Therefore, the I m R has a much better performance than the R for estimating the level of class overlap under imbalanced distribution. Moreover, I m R can not only achieve the better mean value of correlation coefficients with the accuracy of all classification algorithms, but also achieve the smaller standard deviation. It demonstrates that I m R also has a better generalization ability to these algorithms.
Figure 7 presents the ξ correlation coefficients of the accuracies of different algorithms with the R a u g and I m R a u g . It can be seen that the result of the ξ correlation coefficient is also similar to the result of the Pearson correlation coefficient. The x i correlation coefficient of the accuracy of the NB algorithm with the R is much smaller than that of the accuracies of other algorithms with the R, and the coefficient is largely improved by the I m R . The x i correlation coefficient of the accuracy of the SVM-R algorithm with the I m R is also the highest. Meanwhile, the ξ correlation coefficients of the accuracies of different algorithms with the I m R are all higher than that with the R. Therefore, the comparison of ξ correlation coefficient also shows that the I m R has a better performance than that of the R.

5.3. The Comparison of Time Consumption

The average time-consuming comparison of different overlap metrics and 5-fold cross validation of different classification algorithms is shown in Figure 8. It can be seen that the MLP algorithm has the most time-consuming performance due to the backpropagation. In addition, the RF and AdaB algorithms are also very time consuming because of the ensemble learning. In addition, these overlap metrics have a similar time consuming performance and the consuming time of the KNN algorithm is approximately four times that of the overlap metric. The main reason is that the k-nearest neighbors searching is the most time-consuming process for these overlap metrics and the KNN algorithm. The searching process will be conducted in the range of n samples for a dataset, and it will be conducted five times in the range of 4 n 5 samples for the 5-fold cross validation of the KNN algorithm. Therefore, the result is consistent to the analysis and the proposed overlap metrics are also superior in terms of time consumption.
To sum up, the metrics I m R and I m R a u g , which are proposed based on the proposed method, can achieve better performance than the original R and R a u g respectively for estimating the level of class overlap of imbalanced datasets. Therefore, the conclusion that the proposed method and metrics are superior to address the class overlap under imbalanced distribution can be drawn.

6. Conclusions

In this paper, a theoretical analysis is conducted on the existing class overlap metrics and an improved method to address the class overlap under imbalanced distribution is proposed based on the theoretical analysis. Then, the corresponding metrics for estimating the class overlap of imbalanced datasets are also introduced. A well-known collection of the imbalanced datasets is used to compare the Pearson correlation coefficients and the ξ correlation coefficients of different overlap metrics with the classification performance. In addition, the experimental results demonstrate that the proposed data overlap metrics outperform other compared metrics for the imbalanced datasets. The Pearson correlation coefficients with the AUC metric and the accuracy metric can be improved by 34.7488% and 8.0898% on average, respectively. Therefore, the proposed method and metrics can much better estimate the class overlap under imbalanced distribution.
In the future, the proposed metrics can be applied to feature selection and feature construction. In addition, they can also be used as meta-features in meta-learning for algorithm selection and parameters optimization.

Author Contributions

Conceptualization, methodology, formal analysis, and original draft preparation were part of Z.L. and J.Q.; data curation, review and editing and visualization were part of X.Z. and Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partly supported by the National Natural Science Foundation of China (No. 61971031), the National Key R&D Program of China (No.2018YFB0704300), the Scientific and Technological Innovation Foundation of Shunde Graduate School, USTB (BK20BF009), and interdisciplinary research project of USTB(FRF-IDRY-19-019).

Data Availability Statement

Data not included in the paper are available from the corresponding author upon reasonable request.

Acknowledgments

This work is partly supported by the National Natural Science Foundation of China (No. 61971031), the National Key R&D Program of China (No.2018YFB0704300), the Scientific and Technological Innovation Foundation of Shunde Graduate School, USTB (BK20BF009), and the interdisciplinary research project of USTB (FRF-IDRY-19-019).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

If the distance between the real data distribution and the ideal data distribution is replaced by Hellinger distance, the distance can be shown in Equation (A1). In addition, it can be simplified to Equation (A2) as δ is a near zero value:
H ( q j | i , p j | i ) = 1 2 ( q j | i p j | i ) 2 = j i , x j P i , x j Q i 1 2 ( q j | i p j | i ) 2 + j i , x j P i , x j Q i 1 2 ( q j | i p j | i ) 2 + j i , x j P i , x j Q i 1 2 ( q j | i p j | i ) 2 + j i , x j P i , x j Q i 1 2 ( q j | i p j | i ) 2
H ( q j | i , p j | i ) N i T P 2 ( 1 δ k i 1 δ r i ) 2 + N i F P 2 ( 1 δ k i δ N r i 1 ) 2 N i T P 2 k i ( 1 k i r i ) 2 + N i F P 2 k i
To make the distance fair to different classes in an imbalanced dataset, the coefficients of N i T P for different classes are equal as shown in Equation (A3). Then, it can be seen that the result is consistent to the result obtained by K-L divergence:
k N r N = k P r P

References

  1. Japkowicz, N.; Stephen, S. The class imbalance problem: A systematic study. Intell. Data Anal. 2002, 6, 429–449. [Google Scholar] [CrossRef]
  2. Guo, H.; Li, Y.; Jennifer, S.; Gu, M.; Huang, Y.; Gong, B. Learning from class-imbalanced data: Review of methods and applications. Expert Syst. Appl. 2017, 73, 220–239. [Google Scholar]
  3. Sun, Y.; Wong, A.K.C.; Kamel, M.S. Classification of imbalanced data: A review. Int. J. Pattern Recognit. Artif. Intell. 2009, 23, 687–719. [Google Scholar] [CrossRef]
  4. Dogo, E.M.; Nwulu, N.I.; Twala, B.; Aigbavboa, C. Accessing Imbalance Learning Using Dynamic Selection Approach in Water Quality Anomaly Detection. Symmetry 2021, 13, 818. [Google Scholar] [CrossRef]
  5. Bejjanki, K.K.; Gyani, J.; Gugulothu, N. Class Imbalance Reduction (CIR): A Novel Approach to Software Defect Prediction in the Presence of Class Imbalance. Symmetry 2020, 12, 407. [Google Scholar] [CrossRef] [Green Version]
  6. He, H.; Garcia, E.A. Learning from Imbalanced Data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar]
  7. Xiong, H.; Wu, J.; Liu, L. Classification with Class Overlapping: A Systematic Study. In Proceedings of the 1st International Conference on E-Business Intelligence (ICEBI 2010), Guangzhou, China, 19–21 December 2010; Atlantis Press: Dordrecht, The Netherlands, 2010; pp. 491–497. [Google Scholar]
  8. Liu, C.L. Partial discriminative training for classification of overlapping classes in document analysis. Int. J. Doc. Anal. Recognit. 2008, 11, 53–56. [Google Scholar] [CrossRef]
  9. Oh, S. A new dataset evaluation method based on category overlap. Comput. Biol. Med. 2011, 41, 115–122. [Google Scholar] [CrossRef]
  10. Denil, M.; Trappenberg, T. Overlap versus imbalance. In Advances in Artificial Intelligence; Canadian AI 2010. Lecture Notes Computer Science; Farzindar, A., Keselj, V., Eds.; Springer: Berlin/Heidelberg, Germany, 2010; Volume 6085, pp. 220–231. [Google Scholar]
  11. Lee, H.K.; Kim, S.B. An overlap-sensitive margin classifier for imbalanced and overlapping data. Expert Syst. Appl. 2018, 98, 72–83. [Google Scholar] [CrossRef]
  12. Klomsae, A.; Auephanwiriyakul, S.; Theera-Umpon, N. A string grammar fuzzy-possibilistic c-medians. Appl. Soft Comput. 2017, 57, 684–695. [Google Scholar] [CrossRef]
  13. Lee, J.; Batnyam, N.; Oh, S. RFS: Efficient feature selection method based on R-value. Comput. Biol. Med. 2013, 43, 91–99. [Google Scholar] [CrossRef]
  14. Wang, X.; Lin, X.; Huang, X.; Yang, Y. Ensemble unsupervised feature selection based on permutation and R-value. In Proceedings of the 2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), Zhangjiajie, China, 15–17 August 2015. [Google Scholar]
  15. Li, Z.; He, J.; Zhang, X.; He, J.; Qin, J. Toward high accuracy and visualization: An interpretable feature extraction method based on genetic programming and non-overlap degree. In Proceedings of the 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Seoul, Korea, 16–19 December 2020; pp. 299–304. [Google Scholar]
  16. Kang, D.; Oh, S. Balanced Training/Test Set Sampling for Proper Evaluation of Classification Models. Intell. Data Anal. 2020, 24, 5–18. [Google Scholar] [CrossRef]
  17. Borsos, Z.; Lemnaru, C.; Potolea, R. Dealing with overlap and imbalance: A new metric and approach. Pattern Anal. Appl. 2018, 21, 381–395. [Google Scholar] [CrossRef]
  18. Fu, G.; Wu, Y.; Zong, M.; Yi, L. Feature selection and classification by minimizing overlap degree for class-imbalanced data in metabolomics. Chemom. Intell. Lab. Syst. 2020, 196, 103906. [Google Scholar] [CrossRef]
  19. Fatima, E.B.; Omar, B.; Abdelmajid, E.M.; Rustam, F.; Mehmood, A.; Choi, G.S. Minimizing the overlapping degree to improve class-imbalanced learning under sparse feature selection. Application to fraud detection. IEEE Access 2021, 9, 28101–28110. [Google Scholar] [CrossRef]
  20. Venna, J.; Peltonen, J.; Nybo, K.; Aidos, H.; Kaski, S. Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization. J. Mach. Learn. Res. 2010, 11, 451–490. [Google Scholar]
  21. Bradley, A.P. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognit. 1997, 30, 1145–1159. [Google Scholar] [CrossRef] [Green Version]
  22. Luque, A.; Carrasco, A.; Martín, A.; Lama, J.R. Exploring Symmetry of Binary Classification Performance Metrics. Symmetry 2019, 11, 47. [Google Scholar] [CrossRef] [Green Version]
  23. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  24. López, V.; Fernández, A.; García, S.; Palade, V.; Herrera, F. An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Inform. Sci. 2013, 250, 113–141. [Google Scholar] [CrossRef]
  25. Pearson, K. Notes on Regression and Inheritance in the Case of Two Parents. Proc. R. Soc. Lond. 1895, 58, 240–242. [Google Scholar]
  26. Sourav, C. A New Coefficient of Correlation. J. Am. Stat. Assoc. 2020, 1–14. [Google Scholar]
Figure 1. The comparison analysis of the data distribution.
Figure 1. The comparison analysis of the data distribution.
Symmetry 13 01649 g001
Figure 2. An intuitive demonstration of how the proposed method works.
Figure 2. An intuitive demonstration of how the proposed method works.
Symmetry 13 01649 g002
Figure 3. The Pearson correlation comparison between different overlap metrics and the AUCs of different algorithms on all data sets.
Figure 3. The Pearson correlation comparison between different overlap metrics and the AUCs of different algorithms on all data sets.
Symmetry 13 01649 g003
Figure 4. Correlation coefficient improvements of the I m R a u g over the R a u g with the AUCs of different algorithms on all datasets.
Figure 4. Correlation coefficient improvements of the I m R a u g over the R a u g with the AUCs of different algorithms on all datasets.
Symmetry 13 01649 g004
Figure 5. ξ Correlation coefficients of the AUCs of different algorithms with the R a u g and I m R a u g .
Figure 5. ξ Correlation coefficients of the AUCs of different algorithms with the R a u g and I m R a u g .
Symmetry 13 01649 g005
Figure 6. The correlation comparison of different overlap metrics and classification accuracies of different algorithms on all datasets.
Figure 6. The correlation comparison of different overlap metrics and classification accuracies of different algorithms on all datasets.
Symmetry 13 01649 g006
Figure 7. ξ correlation coefficients of the accuracies of different algorithms with the R and I m R .
Figure 7. ξ correlation coefficients of the accuracies of different algorithms with the R and I m R .
Symmetry 13 01649 g007
Figure 8. The average time-consuming comparison of different overlap metrics and 5-fold cross validation of different algorithms.
Figure 8. The average time-consuming comparison of different overlap metrics and 5-fold cross validation of different algorithms.
Symmetry 13 01649 g008
Table 1. The information about the imbalanced datasets.
Table 1. The information about the imbalanced datasets.
Dataset#Inst.#AttrsIRDataset#Inst.#AttrsIR
Glass121491.82Glass04vs59299.22
Ecoli0vs122071.86Ecoli0346vs520579.25
Wisconsin68391.86Ecoli0347vs5625779.28
Pima76881.9Yeast05679vs452889.35
Iris015042Ecoli067vs5220610
Glass021492.06Vowel09881310.1
Yeast1148482.46Glass016vs2192910.29
Vehicle1846182.52Glass2214910.39
Vehicle2846182.52Ecoli0147vs2356336710.59
Vehicle3846182.52Led7digit02456789vs1443710.97
Haberman30632.68Glass06vs5108911
Glass0123vs45621493.19Ecoli01vs5240611
Vehicle0846183.23Glass0146vs2205911.06
Ecoli133673.36Ecoli0147vs56332612.28
New-thyroid221554.92Cleveland0vs41771312.62
New-thyroid121555.14Ecoli0146vs5280613
Ecoli233675.46Ecoli4336713.84
Segment02308196.01Yeast1vs7459813.87
Glass621496.38Shuttle0vs41829913.87
Yeast3148488.11Glass4214915.47
Ecoli333678.19Page-blocks13vs24721015.85
Page-blocks05472108.77Abalone9vs18731816.68
Ecoli034vs520079Glass016vs5184919.44
Yeast2vs451489.08Shuttle2vs4129920.5
Ecoli067vs3522279.09Yeast1458vs7693822.1
Ecoli0234vs520279.1Glass5214922.81
Glass015vs217299.12Yeast2vs8482823.1
Yeast0359vs7850689.12Yeast41484828.41
Yeast02579vs368100489.14Yeast1289vs7947830.56
Yeast0256vs3789100489.14Yeast51484832.78
Ecoli046vs520369.15Ecoli0137vs26281739.15
Ecoli01vs23524479.17Yeast61484839.15
Ecoli0267vs3522479.18Abalone1941748128.87
Table 2. The detailed Pearson correlation coefficient results of R a u g and I m R a u g with the AUCs of different algorithms on all data sets.
Table 2. The detailed Pearson correlation coefficient results of R a u g and I m R a u g with the AUCs of different algorithms on all data sets.
Algorithm R aug ImR aug Improvement
KNN−0.7425 ( 9.62 × 10 13 )−0.8525 ( 1.09 × 10 19 )14.8108%
NB−0.2900 ( 1.82 × 10 02 )−0.6125 ( 4.60 × 10 08 )111.2128%
SVM-L−0.6985 ( 7.03 × 10 11 )−0.7718 ( 3.30 × 10 14 )10.5068%
SVM-R−0.8176 ( 5.48 × 10 17 )−0.8783 ( 3.51 × 10 22 )7.4199%
DT−0.6253 ( 2.00 × 10 08 )−0.8598 ( 2.38 × 10 20 )37.5162%
MLP−0.6695 ( 7.94 × 10 10 )−0.8477 ( 2.81 × 10 19 )26.6104%
RF−0.6986 ( 6.92 × 10 11 )−0.8936 ( 5.99 × 10 24 )27.9070%
AdaB−0.6163 ( 3.62 × 10 08 )−0.8752 ( 7.49 × 10 22 )42.0063%
Mean−0.6448 (-)−0.8239 (-)34.7488%
SD0.1571 (-)0.0930 (-)-
The bold values means better results and the values in the brackets are the p-value.
Table 3. The detailed Pearson correlation coefficient results of R and I m R with the accuracies of different algorithms on all data sets.
Table 3. The detailed Pearson correlation coefficient results of R and I m R with the accuracies of different algorithms on all data sets.
AlgorithmRImRImprovement
KNN−0.9085 ( 6.11 × 10 26 )−0.9125 ( 1.53 × 10 26 )0.4441%
NB−0.1091 ( 3.83 × 10 01 )−0.1742 ( 1.62 × 10 01 )59.6436%
SVM-L−0.7787 ( 1.40 × 10 14 )−0.7674 ( 5.67 × 10 14 )−1.4452%
SVM-R−0.9145 ( 7.76 × 10 27 )−0.9192 ( 1.33 × 10 27 )0.5234%
DT−0.8376 ( 1.86 × 10 18 )−0.8498 ( 1.84 × 10 19 )1.4656%
MLP−0.8229 ( 2.31 × 10 17 )−0.8160 ( 7.05 × 10 17 )−0.8434
RF−0.8672 ( 4.77 × 10 21 )−0.9014 ( 6.04 × 10 25 )3.9400%
AdaB−0.8451 ( 4.56 × 10 19 )−0.8535 ( 8.83 × 10 20 )0.9902%
Mean−0.7393 (-)−0.7545 (-)8.0898%
SD0.2809 (-)0.2609 (-)-
The bold values means better results and the values in the brackets are the p-value.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Li, Z.; Qin, J.; Zhang, X.; Wan, Y. Addressing Class Overlap under Imbalanced Distribution: An Improved Method and Two Metrics. Symmetry 2021, 13, 1649. https://doi.org/10.3390/sym13091649

AMA Style

Li Z, Qin J, Zhang X, Wan Y. Addressing Class Overlap under Imbalanced Distribution: An Improved Method and Two Metrics. Symmetry. 2021; 13(9):1649. https://doi.org/10.3390/sym13091649

Chicago/Turabian Style

Li, Zhuang, Jingyan Qin, Xiaotong Zhang, and Yadong Wan. 2021. "Addressing Class Overlap under Imbalanced Distribution: An Improved Method and Two Metrics" Symmetry 13, no. 9: 1649. https://doi.org/10.3390/sym13091649

APA Style

Li, Z., Qin, J., Zhang, X., & Wan, Y. (2021). Addressing Class Overlap under Imbalanced Distribution: An Improved Method and Two Metrics. Symmetry, 13(9), 1649. https://doi.org/10.3390/sym13091649

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop