Next Article in Journal
A Mandarin Tone Recognition Algorithm Based on Random Forest and Feature Fusion
Previous Article in Journal
Harmonics Signal Feature Extraction Techniques: A Review
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An AdaBoost Method with K′K-Means Bayes Classifier for Imbalanced Data

Department of Statistics, Beijing Jiaotong University, Beijing 100044, China
*
Author to whom correspondence should be addressed.
Mathematics 2023, 11(8), 1878; https://doi.org/10.3390/math11081878
Submission received: 24 March 2023 / Revised: 14 April 2023 / Accepted: 14 April 2023 / Published: 15 April 2023
(This article belongs to the Section Computational and Applied Mathematics)

Abstract

:
This article proposes a new AdaBoost method with k k-means Bayes classifier for imbalanced data. It reduces the imbalance degree of training data through the k k-means Bayes method and then deals with the imbalanced classification problem using multiple iterations with weight control, achieving a good effect without losing any raw data information or needing to generate more relevant data manually. The effectiveness of the proposed method is verified by comparing it with other traditional methods based on numerical experiments. In the NSL-KDD data experiment, the F-score values of each minority class are also greater than the other methods.

1. Introduction

Classification technology is widely used in finance [1], medicine [2], meteorology [3], physics [4], road safety [5,6,7], and many other fields. The problem of imbalanced classification has attracted much attention for a long time in machine learning and artificial intelligence. In fact, imbalanced data occurs more often than balanced data, and the minority category is more of a concern. For example, the number of fraudulent transactions in a bank is far lower than that of normal, but abnormal transactions bring losses to the bank and have to be detected. The current number of COVID-19 patients is far lower than healthy people, but it is particularly important to detect the COVID-19 patients accurately as much as possible. In the field of road safety, fatal crashes represent roughly 2–3% of the total crashes, however, the correct classification of the minority class makes more sense. In the face of imbalanced data, the traditional algorithms cannot detect the minority classes accurately; they can only get high accuracy and recall on the majority classes. Hence, we need to improve the traditional algorithms and pay more attention to the detection of the minority classes rather than the majority classes.
A lot of improvement methods based on data and algorithms have been proposed. Resampling methods based on data mainly include over-sampling [8], under-sampling [9], and data synthesis [10]. The over-sampling method reduces the imbalanced degree by copying the minority data, but it is easy to cause over-fitting. The under-sampling method deletes most of the majority of data to achieve the balance between different categories, which makes it easy to lose important data information. The data synthesis methods like SMOTE(Synthetic Minority Oversampling Technique) [11] and ADASYN(Adaptive Synthetic Sampling) [12] methods synthesize new data based on the raw data, which avoid duplication and loss of data compared to the over-sampling method and the under-sampling method, and get better results. Based on an algorithm, there are mainly cost-sensitive learning approaches [13], boosting algorithms [14], bagging algorithms [15], and random forest algorithms [16]. The cost-sensitive learning approach obtains more recall and higher accuracy by giving wrong classification costs. The boosting method increases the weight of the wrong sample to improve the prediction results. The random forest improves the prediction result by integrating the decision results of multiple decision trees. However, most of the existing methods have good performance in binary problems but work badly in multi-class problems.
A new method that makes improvements to both the algorithm and data is proposed in this paper. First, we reduce the imbalance degree between categories by clustering method without changing the data information; second, we increase the weight training for the samples that are difficult to predict to make them more likely to be accurately predicted using the AdaBoost algorithm. Additionally, based on the advantage of naive Bayes [17] in handling multiple classifications, this algorithm can solve both binary and multi-classification problems.
Given an imbalanced data set, we follow the k k-means method proposed in 2021 in [18] to reduce the imbalance degree between categories to a reasonable level, then use the AdaBoost [19] method based on the naive Bayes method as the based classifier to improve the prediction effect of the minority. The key to this method is to restore the new categories generated by clustering to the raw labels at each iteration. Because we do not care whether the predictions of the newly generated categories are correct or not, what we really care about is whether the raw minority categories can be accurately identified.
In the following sections, we introduce three methods in Section 2, they are the naive Bayes method, the k k-means method, and the proposed AdaBoost Bayes method with a k k-means Bayes classifier. Then, In Section 3, we prove the advanced nature of the proposed method by numerical experiments. Finally, the last section summarizes the conclusions.

2. The AdaBoost k k-Means Bayes Method

2.1. The Naive Bayes Classification

There is a training set T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , , ( x N , y N ) } , where the number of samples is N, x i R m represents a dimensional array, y j represents the class label with y j { 0 , 1 , 2 , , t } . From the Bayes theorem, a test sample x ( x R m ) will be predicted to belong to the class y j when it has the highest posterior probability conditional on x . The predicted posterior probability is denoted as follows,
ψ ( x ) = arg max j P ( y j | x ) = arg max j P ( y j ) P ( x | y j ) P ( x ) ,
where P ( y j ) refers to the prior probability of class y j , P ( x | y j ) refers to the priori probability of x conditional on y j , and P ( x ) refers to the prior probability of x also named the normalization factor.
Making the naive assumption of class condition independence can reduce the computation in evaluating P ( y j ) P ( x | y j ) . Mathematically, it means that
P ( x | y j ) = i = 1 n P ( x i | y j ) ,
where x i denotes the value of the i-th attribute of test sample x and P ( x i | y j ) denotes the conditional probability of the i-th attribute on y j .
To calculate each P ( x i | y j ) , the numerical variable is usually typically supposed to follow a normal distribution with the following probability density function:
P ( x i | y j ) = 1 2 π σ ( i | y j ) exp { [ x i μ ( i | y j ) ] 2 2 σ ( i | y j ) 2 } ,
where μ ( i | y j ) is the mean and σ ( i | y j ) 2 is the variance of the i-th variable for class y j .
Hence, the classification result of the naive Bayes method can be achieved as follows:
ψ ( x ) = arg max j P ( y j ) i = 1 n P ( x i | y j ) .

2.2. The k k-Means Bayes Method

For any classification problems, the number of normal categories is usually very large, where the largest category is named the majority; the number of abnormal categories that we really care about is very small, in which the smallest category is called the minority. Let N ( y j ) represent the number of samples in the n-th class, then the number of the majority category is max { N ( y j ) } , the number of the minority class is min { N ( y j ) } . The degree of imbalance ζ is defined as follows:
ζ = max { N ( y j ) } / min { N ( y j ) } .
Obviously, the large value of ζ means that the number of each category varies significantly. First, after the majority is updated into k subsets using the k-means method, the imbalance degree ζ is calculated. When ζ falls within a reasonable range, the k k-means method stops. Generally speaking, an imbalance degree greater than 99 is considered an extreme imbalance classification problem. If ζ 99 , the threshold is set to 50; otherwise it is set to ζ 2 . Specifically, max { N ( y j ) } needs to be recalculated after each update. However, min { N ( y j ) } is fixed. Due to the k k-means method can eliminate the influence caused by imbalanced data, smaller categories do not have to be divided from the raw category into smaller categories which will effectively prevent invalid loops in the program. Finally, after the k k-means method, the raw data set is updated into T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , , ( x N , y N ) } where y j { 1 , 2 , , ( k k k + t ) } contains ( k k k + t ) categories. A definite one-to-many mapping relationship f between y j and y j is denoted as y j = f ( y j ) . In the process of data classification being updated, the Bayes classifier can calculate the probability value of the test sample belonging to each category, and the sample will be predicted to belong to the corresponding category with the maximum probability value. For a test sample x 0 , the final predicted label y ^ can be achieved as
y ^ ( x 0 ) = f ( ψ ( x 0 ) ) .
The procedure of the k k-means Bayes method is presented as follows.
The proposed k k-means Bayes method keeps the complete information of raw data without changing the samples, such as deleting or adding samples. Using the k times k-means method eliminates the impact of data imbalance on the model. The classification boundary is obtained by training in a relatively balanced sample environment. It deals with imbalanced data effectively from the perspective of data.

2.3. The AdaBoost k k-Means Bayes Classification

AdaBoost (Adaptive Boosting) is an iterative algorithm. It assigns higher weights to hard-to-train samples by updating again to obtain a series of base classifiers. Let ψ ( x ) obtained using k k-means Bayes in Section 2.2 act as the basic classifier in AdaBoost. First, initialize the weight distribution of the training data T as follows:
D 1 = ( θ 11 , θ 12 , , θ 1 i , , θ 1 N ) = ( 1 N , 1 N , , 1 N ) .
For the m-th iterative, the weight of training data is denoted as D m . The k k-means Bayes classifier is represented as ψ m ( x ) whose classification error on the training data is given as
α m = 1 2 log 1 e m e m , e m = P ( ψ m ( x i ) y i ) .
Then, the weight allocation of the data is further updated as follows
D m + 1 = ( θ ( m + 1 ) , 1 , θ ( m + 1 ) , 2 , , θ ( m + 1 ) , i , , θ ( m + 1 ) , N ) , θ ( m + 1 ) , i = θ m i Z m exp ( α m y i ψ m ( x i ) ) ,
where Z m = i = 1 N θ m i exp ( α m y i ψ m ( x i ) ) is the regularization factor.
When the classification result is wrong, it is clear that y i ψ m ( x i ) < 0 , which can make θ ( m + 1 ) , i bigger. This is in line with our expectation of increasing the weight of the misclassified samples in the next classifier.
Finally, the AdaBoost classifier based on a k′k-means Bayes is represented as
H ( x ) = sign ( m = 1 M α m ψ m ( x ) ) .
In this section, the prediction results of difficult training samples can be improved by increasing the weight of wrong score data from the perspective of the algorithm. Let the naive Bayes method be used as the based classifier when the imbalance degree is effectively reduced using the k k-means method; the method combined with AdaBoost can significantly improve the classification of imbalanced sample data.

3. Numerical Experiments

In this section, the effectiveness and superiority of the proposed new method will be demonstrated in contrast to the traditional naive Bayes method, the under-sampling method, the over-sampling method, the k-means Bayes method [20], and the AdaBoost Bayes method using simulated data and real data respectively. It is obvious that the prediction results depend on feature engineering largely, while in this paper, without considering the optimization scheme caused by variable selection, we discuss the progressiveness and effectiveness of the proposed method when compared with the traditional method. First, in Section 3.1, a multi-class simulation imbalanced data set is considered, and the prediction results of each method are compared. Then, the prediction results of the proposed method and other traditional methods are compared and analyzed based on network intrusion NSL-KDD data.

3.1. Performance Metrics

In this section, we introduce some common metrics to evaluate the performance of imbalanced classification methods. A confusion matrix is a visual matrix that can represent the relationship between predicted and true values, as shown in the figure. The elements on the main diagonal represent the number of samples correctly predicted, and the remaining elements represent cases where the predicted values and true values do not match. According to the elements in the matrix, several common evaluation metrics of the imbalanced classification algorithm can be calculated according to Table 1.
Using class c 0 as an example, we can calculate the following metrics where FPR refers to the false positive rate.
R e c a l l = a 11 a 11 + a 12 + a 13 P r e c i s i o n = a 11 a 11 + a 21 + a 31 F P R = a 21 + a 31 a 11 + a 21 + a 31 F s c o r e = 2 P r e c i s i o n · R e c a l l P r e c i s i o n + R e c a l l

3.2. On Simulation Study

Without loss of generality, a data set containing 20 characteristic variables from the Gaussian distribution is constructed, the number of which is 30,000, including three categories. The sample sizes of the three categories are presented as N ( c 0 ) = 25 , 000 ,   N ( c 1 ) = 4950 ,   and   N ( c 2 ) = 50 , respectively. It is obvious that it is an imbalanced multi-classification problem, and the imbalance degree is as high as ζ = 500 . The samples are divided into a training set and test set according to 5:1, and the prediction results obtained using the six methods are shown in Table 2.
From Table 2, it can be seen that when faced with imbalanced data, the naive Bayes method predicts all samples to be the majority class. Since the number of the minority takes up a small proportion, the method can still achieve a higher index result. However, this is not the method that we want. Both the k-means Bayes method and AdaBoost Bayes method fail to detect the “ c 2 ” class. As can be seen from the confusion matrix in Figure 1, the under-sampling Bayes method has a high recall rate, but also has a very high false positive rate. The over-sampling Bayes method has not only a stable false positive rate but also a moderate number of minority classes. The proposed AdaBoost k k-means Bayes method has the largest recall and the lowest false positive rate. Among the performances of the six methods, the AdaBoost k k-means Bayes method is the best.

3.3. On Real Data Study

In this subsection, real imbalanced multi-class data will be used to demonstrate the effectiveness of the proposed method. The data is the famous NSL-KDD data set which comes from Canadian Institute for Cybersecurity (https://www.unb.ca/cic/datasets/nsl.html, accessed on 27 March 2023).
We will determine whether a piece of network data is normal behavior or a cyber-attack, and predict what type of attack it is. There are 42 variables in the data test where the target variable attack type contains ”normal,” ”DOS” (Denial of Service), ”U2R” (User-To-Root), ”R2L” (Remote-to-Local), and ”probe” 5 values shown in Table 2. The total sample size of the training data is 125,973, where the number of normal samples is 67,343, accounting for 53.46%, and the number of class ”U2R” is only 52, accounting for 0.041%. The imbalance degree ζ is 1295 and can be calculated clearly so that it can be analyzed as a typical imbalanced multi-classification problem. Then, we will simultaneously compare the proposed method with the traditional naive Bayes, under-sampling, over-sampling, k-means Bayes, and AdaBoost Bayes methods based on the NSL-KDD data simultaneously to discuss which one has superiority (Table 3). The results are provided in Table 4.
For this data case, we are more concerned about how much more the attack data is detected than the normal type. As shown in Table 4, the type “normal” always performs well, and the type “DOS” with a large amount of data performs well in all of the six methods; however, what we are more concerned about is the performance of the other three types with a smaller amount of data. The AdaBoost Bayes method has the highest recall on the “U2R” type, but the precision is very low. The method with the best performance on the “probe” is the under-sampling; it is disappointing that it performs very poorly on the other types of indicators and loses a lot of raw data information in order to achieve the balance with “probe.” The new method proposed in this paper is slightly worse than the AdaBoost Bayes method in the recall of “U2R,” which has the best performance among the six methods for the other categories. In terms of the results of the F-score for each category, the proposed method is the best.
Next, we specifically analyze the results obtained from each method using the confusion matrices presented in Figure 2. The lager values on the diagonal suggest better prediction. It is obvious that (a), (b), and (c) do not satisfy this rule. The greater the value beyond the diagonal, the higher the false positive rate. The results show that the proposed method has the highest precision and the best recall in each index.
There are four parameters that need to be determined in two steps for the proposed method. First, the base classifier can be determined by determining the appropriate k and k . It is well known that one of the difficulties in the k-means clustering method is determining the value of k. However, in the application of this paper, we only divide the majority class into many smaller subsets using the k-means method to reduce the imbalance degree, and do not care about the rationality and accuracy of the classification result, which will not affect our prediction results. So the determination is not as difficult and complicated as in the k-means clustering method. The values of k and ζ are interdependent. We cannot determine the value of k subjectively, but we can obtain a reasonable value k by adjusting the imbalance parameter of the data. Hence, we can give a reasonable ζ as 50 empirically, an initial k as ζ 50 , and then adjust it.
Second, the two parameters n-estimators and learning-rate need to be determined in AdaBoost method. The parameter n-estimators in the method is the maximum number of iterations of the base classifier, and the larger the value, the more likely it is to be overfitting. The parameter learning-rate represents the weight reduction coefficient of the base classifier, whose meaning is equivalent to that of the regularization term, and the value range is in [0, 1]. These two parameters require joint debugging. In the NSL-KDD data, the values of the four parameters are respectively k = 18, k = 50, n-estimators = 10, and learning-rate = 0.05.
Then, after combining the test data and training data of KDD-NSL into a new data set, the generalized method of the proposed method is discussed through multiple cross-validation experiments. The data sets for cross-validation are obtained by changing the percentage of test samples and random seeds. Table 5 shows the results of 6 cross-validation experiments. The test data percentages and random seeds are (30%, 100), (35%, 180), (20%, 260), (40%, 40), (20%, 47), and (30%, 106). It can be found that the result fluctuation of each experiment is very small, and the prediction ability is stable. It is shown that the proposed method is generalized.

3.4. Experiments Result

In the simulation data experiment, the imbalance degree of the generated data is 500. There are three methods that can not recognize any “ c 2 ” class, while the AdaBoost k k-means Bayes method has the largest F-score of each class of the remaining three methods.
In the real data experiment, the imbalance degree of the NSL-KDD data is 1295. Three methods can not identify the “probe” class at all; however, the AdaBoost k k-means Bayes method has the largest F values of each of the six classes.
Based on the classification results of different methods and cross-experiments, it can be concluded that the proposed method is more effective than other methods and generalized as well.

4. Conclusions

For the imbalanced multi-classification problem, a new AdaBoost method with a k k-means Bayse classifier is proposed, which makes improvements on both data and algorithm to achieve better results compared to the traditional classification methods. It reduces the imbalance degree of training data using the k k-means Bayes method, which avoids the change of the raw information caused by adding or removing samples in the resampling method. It increases the weights of samples that are difficult to predict using a multi-iterative method by improving its prediction accuracy. It is worth mentioning that in each iteration, the new categories generated with the k k-means method need to be mapped to the raw categories to calculate the prediction results. The performances of the proposed method are proven to be the best using some detailed comparisons with some other traditional methods based on simulation data and the NSL-KDD data.
We have proposed a more effective solution for multi-class imbalanced classification for both continuous and discrete variables, which can be used for both binary classification and multi-classification. In the future, it may be possible to generalize this idea to combine it with other classical algorithms to solve more complex problems.

Author Contributions

Methodology, Y.Z. and L.W.; software, Y.Z. and L.W.; writing—original draft preparation, Y.Z. and L.W.; writing—review and editing, Y.Z. and L.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. 11371051).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Huang, J.; Chai, J.; Cho, S. Deep learning in finance and banking: A literature review and classification. Front. Bus. Res. China 2020, 14, 1–24. [Google Scholar] [CrossRef]
  2. Singh, S.; Jangir, S.; Kumar, M.; Verma, M.; Kumar, S.; Walia, T.; Kamal, S. Feature Importance Score-Based Functional Link Artificial Neural Networks for Breast Cancer Classification. BioMed Res. Int. 2022, 2022, 1–24. [Google Scholar] [CrossRef]
  3. Kumar, S.; Sastry, H.; Marriboyina, V. Information extraction from the agricultural and weather domains using deep learning approaches. Int. J. Softw. Innov. 2022, 10, 1–112. [Google Scholar] [CrossRef]
  4. Lombacher, J.; Hahn, M.; Dickmann, J.; Wöhler, C. Object classification in radar using ensemble methods. Int. J. Softw. Innov. 2017, 87–90. [Google Scholar] [CrossRef]
  5. Rella, R.; Mauriello, F.; Sarkar, S.; Galante, F.; Scarano, A.; Montella, A. Parametric and non-parametric analyses for pedestrian crash severity prediction in Great Britain. Sustainability 2022, 14, 3188. [Google Scholar] [CrossRef]
  6. Gao, L.; Lu, P.; Ren, Y. A deep learning approach for imbalanced crash data in predicting highway-rail grade crossings accidents. Reliab. Eng. Syst. Saf. 2021, 216, 108019. [Google Scholar] [CrossRef]
  7. Yahaya, M.; Jiang, X.; Fu, C.; Bashir, K.; Fan, W. Enhancing crash injury severity prediction on imbalanced crash data by sampling technique with variable selection. In Proceedings of the 2019 IEEE Intelligent Transportation Systems Conference, Auckland, New Zealan, 27–30 October 2019; pp. 363–368. [Google Scholar]
  8. Junsomboon, N.; Phienthrakul, T. Combining over-sampling and under-sampling techniques for imbalance dataset. In Proceedings of the 9th International Conference on Machine Learning and Computing, Singapore, 24–26 February 2017; pp. 243–247. [Google Scholar]
  9. Tsai, C.; Lin, W.; Hu, Y.; Yao, G. Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Inf. Sci. 2019, 477, 47–54. [Google Scholar] [CrossRef]
  10. Rees, E.; Nightingale, E.; Jafari, Y. COVID-19 length of hospital stay: A systematic review and data synthesis. BMC Med. 2020, 18, 270. [Google Scholar] [CrossRef] [PubMed]
  11. Dablain, D.; Krawczyk, B.; Chawla, N. DeepSMOTE: Fusing deep learning and SMOTE for imbalanced data. IEEE Trans. Neural Netw. Learn. Syst. 2022, 1–15. [Google Scholar] [CrossRef] [PubMed]
  12. Lu, C.; Lin, S.; Liu, X.; Shi, H. Telecom fraud identification based on ADASYN and random forest. In Proceedings of the International Conference on Computer and Communication Systems, Shanghai, China, 15–18 May 2020; pp. 447–452. [Google Scholar]
  13. Mienye, I.; Sun, Y. Performance analysis of cost-sensitive learning methods with application to imbalanced medical data. Inform. Med. Unlocked 2021, 25, 100690. [Google Scholar] [CrossRef]
  14. Tyralis, H.; Papacharalampous, G. Boosting algorithms in energy research: A systematic review. Neural Comput. Appl. 2021, 33, 14101–14117. [Google Scholar] [CrossRef]
  15. Andiojaya, A.; Demirhan, H. A bagging algorithm for the imputation of missing values in time series. Expert Syst. Appl. 2019, 129, 10–26. [Google Scholar] [CrossRef]
  16. Tyralis, H.; Papacharalampous, G.; Langousis, A. A brief review of random forests for water scientists and practitioners and their recent history in water resources. Water 2019, 11, 910. [Google Scholar] [CrossRef] [Green Version]
  17. Salmi, N.; Rustam, Z. Naive Bayes classifier models for predicting the colon cancer. Mater. Sci. Eng. 2019, 546, 052068. [Google Scholar] [CrossRef]
  18. Zhang, Y.; Wang, L. K′ times k-means logistic regression algorithm for imbalanced classification. Commun.-Stat.-Simul. Comput. 2021, 1–8. [Google Scholar] [CrossRef]
  19. Wang, W.; Sun, D. The improved AdaBoost algorithms for imbalanced data classification. Inf. Sci. 2021, 563, 358–374. [Google Scholar] [CrossRef]
  20. Chen, G.; Liu, Y.; Ge, Z. K-means Bayes algorithm for imbalanced fault classification and big data application. J. Process Control 2019, 81, 54–64. [Google Scholar] [CrossRef]
Figure 1. The confusion matrices of the six methods.
Figure 1. The confusion matrices of the six methods.
Mathematics 11 01878 g001
Figure 2. The confusion matrices of the six methods.
Figure 2. The confusion matrices of the six methods.
Mathematics 11 01878 g002
Table 1. The results of different methods.
Table 1. The results of different methods.
True ValuePredict c 0 Predict c 1 Predict c 2
c 0 a 11 a 12 a 13
c 1 a 21 a 22 a 23
c 2 a 31 a 32 a 33
Table 2. The results of different methods.
Table 2. The results of different methods.
MethodClassificationFPRRecallPrecisionF-Score
Naive Bayes c 0 0.16910.8310.908
c 1 -000
c 2 -000
Under-sampling Bayes c 0 0.1660.3710.8340.513
c 1 0.8390.2760.1610.203
c 2 0.9980.50.0020.005
Over-sampling Bayes c 0 0.1690.5440.8310.657
c 1 0.8350.2610.1650.202
c 2 0.9980.2860.0020.005
K-means Bayes c 0 0.1390.9250.8610.891
c 1 0.5840.2670.4160.325
c 2 -000
AdaBoost Bayes c 0 0.1690.5640.8310.672
c 1 0.8410.4320.1670.24
c 2 1.000000
AdaBoost K′K-means Bayes c 0 0.1680.7130.8320.768
c 1 0.8290.2740.1710.211
c 2 0.9940.0710.0060.011
Table 3. The distribution of attack type.
Table 3. The distribution of attack type.
Data SetClassificationTraining DataPercentageTesting DataPercentage
NSL-KDDnormal67,34353.46%971143.08%
DOS45,92636.46%763533.87%
U2R520.04%2000.89%
R2L9950.79%257411.42%
probe11,6569.25%242310.75%
Table 4. The results of different methods.
Table 4. The results of different methods.
MethodClassificationFPRRecallPrecisionF-Score
Naive Bayesnormal0.7800.0320.220.056
DOS0.6700.8280.330.472
U2R0.9920.060.0080.014
R2L0.99700.0030.001
probe1.000000
Under-sampling Bayesnormal0.1160.4980.8140.618
DOS0.7800.1630.220.187
U2R0.9630.30.0370.067
R2L0.9410.050.0590.009
probe0.7460.960.2540.402
Over-sampling Bayesnormal0.8610.0170.1390.031
DOS0.5100.8160.490.612
U2R0.9900.410.010.02
R2L0.9960.0010.0040.001
probe1.000000
K-means Bayesnormal0.3730.8110.6470.72
DOS0.3480.0090.1840.016
U2R0.7970.5450.0140.028
R2L0.9890.0290.3950.054
probe1.0000.1360.1550.145
AdaBoost Bayesnormal0.3530.8740.630.732
DOS0.8160.7280.6520.688
U2R0.9860.1750.2030.188
R2L0.6050.0020.0110.003
probe0.845000
AdaBoost K′K-means Bayesnormal0.1880.9190.8170.865
DOS0.0930.7080.840.768
U2R0.2730.0550.3670.096
R2L0.5330.3790.4630.417
probe0.4660.6420.5120.57
Table 5. The results of 6 cross-validation experiments.
Table 5. The results of 6 cross-validation experiments.
ClassificationCross Validation NumberRecallPrecisionF-ScoreTesting Data
normal10.7340.9450.82623,134
20.7690.9360.84426,939
30.7770.9480.85415,488
40.7760.9380.8530,607
50.7820.9420.85515,555
60.7650.9360.84223,221
DOS10.8870.7620.8216,022
20.9110.7480.82118,749
30.8730.7740.8210,692
40.8430.7420.78921,604
50.8740.7260.79310,595
60.8890.7550.81616,017
U2R10.0240.0280.02682
20.0460.0620.05387
30.0730.10.08455
40.0310.0550.03998
50.0430.0450.04446
60.0410.0730.05274
R2L10.680.2580.3741128
20.6560.290.4021225
30.660.3130.424688
40.6670.3140.4271418
50.6460.3180.426673
60.620.2850.391067
probe10.4910.4190.4524189
20.3960.4710.434981
30.5380.4330.482780
40.3920.3440.3675679
50.3390.3660.3522834
60.4110.3950.4034176
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, Y.; Wang, L. An AdaBoost Method with K′K-Means Bayes Classifier for Imbalanced Data. Mathematics 2023, 11, 1878. https://doi.org/10.3390/math11081878

AMA Style

Zhang Y, Wang L. An AdaBoost Method with K′K-Means Bayes Classifier for Imbalanced Data. Mathematics. 2023; 11(8):1878. https://doi.org/10.3390/math11081878

Chicago/Turabian Style

Zhang, Yanfeng, and Lichun Wang. 2023. "An AdaBoost Method with K′K-Means Bayes Classifier for Imbalanced Data" Mathematics 11, no. 8: 1878. https://doi.org/10.3390/math11081878

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop