Next Article in Journal
Optimization of Microwave and Ultrasound Extraction Methods of Açai Berries in Terms of Highest Content of Phenolic Compounds and Antioxidant Activity
Next Article in Special Issue
Automatic Colorization of Anime Style Illustrations Using a Two-Stage Generator
Previous Article in Journal
A Clustering-Based Hybrid Support Vector Regression Model to Predict Container Volume at Seaport Sanitary Facilities
Previous Article in Special Issue
Credibility Based Imbalance Boosting Method for Software Defect Proneness Prediction
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

LIMCR: Less-Informative Majorities Cleaning Rule Based on Naïve Bayes for Imbalance Learning in Software Defect Prediction

1
School of Reliability and Systems Engineering, Beihang University, Beijing 100191, China
2
Beijing Institute of Remote Sensing Equipment, Beijing 100039, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Appl. Sci. 2020, 10(23), 8324; https://doi.org/10.3390/app10238324
Submission received: 16 October 2020 / Revised: 10 November 2020 / Accepted: 19 November 2020 / Published: 24 November 2020
(This article belongs to the Special Issue Artificial Intelligence and Machine Learning in Software Engineering)

Abstract

:
Software defect prediction (SDP) is an effective technique to lower software module testing costs. However, the imbalanced distribution almost exists in all SDP datasets and restricts the accuracy of defect prediction. In order to balance the data distribution reasonably, we propose a novel resampling method LIMCR on the basis of Naïve Bayes to optimize and improve the SDP performance. The main idea of LIMCR is to remove less-informative majorities for rebalancing the data distribution after evaluating the degree of being informative for every sample from the majority class. We employ 29 SDP datasets from the PROMISE and NASA dataset and divide them into two parts, the small sample size (the amount of data is smaller than 1100) and the large sample size (larger than 1100). Then we conduct experiments by comparing the matching of classifiers and imbalance learning methods on small datasets and large datasets, respectively. The results show the effectiveness of LIMCR, and LIMCR+GNB performs better than other methods on small datasets while not brilliant on large datasets.

1. Introduction

Software defect prediction (SDP) is an effective technique to lower software module testing costs. It can efficiently identify defect-prone software modules by learning information from defect datasets of the previous release. Existing SDP studies can be divided into four categories: (1) Classification, (2) Regression, (3) Mining association rules, (4) Ranking [1]. Studies in the first category use classified algorithms (also called classifiers) as the prediction algorithms to classify software modules into defect-prone classes (positive or minority class) and non-defective classes (negative or majority class) or various levels of defect severity. The imbalance learning we focus on is based on binary classification study in this paper.
Commonly in software defect datasets, the number of samples (samples usually refer to software modules in SDP) in defect-prone classes is naturally smaller than the number of samples in non-defective classes. However, most prediction algorithms assume that the number of samples in any class are equally balanced. This contradiction makes the prediction algorithms trained in imbalanced software defect datasets are generally biased towards the samples in non-defect-prone classes and ignore the samples in defect-prone classes, i.e., many defect-prone samples might be classified into non-defect-prone class based on prediction algorithms trained by imbalanced datasets. This problem widely occurs in SDP and it has proved that reducing the influence of the imbalance problem can improve prediction performance efficiently.
Numerous methods [2,3] are proposed for tackling imbalance problems in SDP. In imbalance learning research [4,5], methods are divided into two categories: data level and algorithm level. Methods within the data level mainly consist of various data resampling techniques. The resampling technique is a kind of method to rebalance datasets by adding minorities (over-sampling methods) or removing majorities (under-sampling methods). For instance, SMOTE [6] is an over-sampling method to generate synthetic samples in minority class, NCL [7] is an under-sampling method to remove samples in majority class. Methods within the algorithm level improve classification algorithms based on existing algorithms to make them no more biased towards the samples in majority classes and ignore the samples in minority classes. Cost-sensitive methods combine both algorithm and data level methods. They consider the different misclassified cost of samples in different classes. For instance, RAMOBoost [8] is an improvement of Boosting algorithm [9], NBBag [5] is an improved algorithm based on Bagging algorithm [10], AdaCost [11] modifies the weight updated by adding a cost adjustment function based on AdaBoost algorithm [12]. Boosting, bagging and other ensemble classifiers are frequently selected as the basic classification algorithms for improving because of the high performance in classification they have [4,5]. A proper basic prediction algorithm could perform better with the imbalanced dataset after improving by imbalance learning. Obviously, basic classification algorithm selection is one of the most important steps for imbalance learning methods within algorithm level.
Different from algorithm level methods, methods in data level can choose and change classifiers flexibly. With the increasing number of imbalance learning studies, researchers notice the influence of classifier selection. Numerous empirical studies compare the performance of different techniques to find the rules of classifier selection, and the influence factors within consideration include the researcher group [13], levels of class imbalance [14], diversity [15,16], and others [17,18].
Most empirical studies focus on the comparison between resampling methods and the influence factors while less notice on the applicability of resampling methods and the connection between classifiers and them. In addition, there is almost no resampling method by quantifying the sample information. Motivated by this, we aim to investigate how resampling methods work in datasets with different sample size, and how they cooperate with various classifiers. Moreover, we aim to propose a novel and effective resampling method to remove less-informative samples of majority class for rebalancing the data distribution. The main contributions of this paper are divided into the following three aspects:
  • We perform an empirical study to investigate the influence of datasets sample size on popular comment classifiers.
  • We present a novel resampling method LIMCR based on Naïve Bayes to solve the class imbalance problem in SDP datasets. The new method outperforms other resampling methods on datasets with small sample size.
  • We evaluate and compare the proposed method with existing well performance imbalance learning methods including both methods from data level and algorithm level. The experiment presents the effective performance of specified methods on different datasets, respectively.
The remainder of this paper is organized as follows. Section 2 summarizes related work in the area of imbalance learning. In Section 3, we describe the methodology and procedure of our LIMCR. In Section 4, the experimental setup and results are explained, respectively. Finally, the discussion and conclusion are presented in Section 5 and Section 6.

2. Related Works

2.1. Imbalanced Learning Methods

A large number of methods have been developed to address imbalance problems currently, and all of these methods are classified into two basic categories: data level and algorithm level. The methods in data level mainly study the effect of changing class distribution to deal with imbalanced datasets. It has been empirically proved that the application of a preprocessing process for rebalancing class distribution is usually a positive solution [19]. The main advantage of methods in data level is that they are independent of the classifier [4]. Moreover, data level methods can be easily embedded in ensemble learning algorithms as algorithm level methods. Hereafter, representative imbalanced learning methods will be introduced in this section.
Among all the data resampling methods, random over-sampling and random under-sampling are the simplest resampling methods for rebalancing datasets [20]. Although random sampling methods have some defects, they indeed improve the performance of classifiers. To avoid the drawbacks caused by random methods, researchers attempt to generate new synthetic samples based on original dataset and attain to great success. SMOTE [6] is one of the most classical methods among synthetic over-sampling methods. Numerous methods are proposed based on SMOTE, like ADASYN [21], Borderline-SMOTE [22], MWMOTE [23], Safe-level-SMOTE [24]. The generated samples add essential information to the original dataset so that the additional bias to classifier can be alleviated and the overfitting problem to which random over-sampling might lead can be avoided [25].
On the other side, resampling methods we introduced above have been proved to show remarkable performance after being embedded in an ensemble algorithm [26,27]. So researchers integrate an oversampling method with an appropriate ensemble method to achieve a stronger approach for solving class imbalance problems. The most widely used ensemble learning algorithms are AdaBoost and Bagging which are usually combined with the resmapling methods to form new typical algorithms, such as SMOTEBoost [28], SMOTEBagging [15], RAMOBoost [8], etc., which perform well in the imbalanced dataset.
Undersampling methods are also widely used in imbalance learning, especially in SDP, research [29] has proved static code features have limited information content and undersampling performs better than others. In earlier studies, researchers prefer identifying redundant samples by cluster or K-nearest neighborhoods algorithms, for instance, Condensed Nearest Neighbor Rule (CNN) [30], Tomek links [31], Edited Nearest Neighbor Rule (ENN) [32], One-Sided Selection (OSS) [33], Neighborhood Cleaning Rule (NCL) [7]. With the increasing number of distribution problems are found in datasets, more new stronger undersampling methods are proposed currently. Research [34] proposes a set of sample hardness measurements to understand why some samples are harder to classify correctly and remove samples that are suspected hard to learn. A similar study [35] is also proved effective for imbalance learning. Undersampling can be also embedded in ensemble algorithms. Two algorithms embedded by undersampling methods: EasyEnsemble and BalanceCascade [36] are proposed for preserving information to a maximum degree and reducing the data complexity for efficient computation.

2.2. Software Defect Prediction

The classification problem in SDP is a typical learning problem. Bohem and Basili pointed out that in most cases, 20% of the modules can result in 80% of the software defects [37], this means software defect data has a natural imbalanced distribution.
SDP research starts with software defective metrics selection. The original defect data is obtained by using specified static software metrics [38]. For instance, McCabe [38] and Halstead [39] metrics are widely used, Chidamber and Kemerer’s (CK) metrics are proposed for fitting the demand of object-orientation (OO) software. Lots of empirical studies are conducted for the imbalance problem in SDP. A comprehensive experiment to study the effect of imbalance learning in SDP emphasizes the importance of method selection [40]. The result of the study [41] advocates resampling method for effective imbalance learning. Meanwhile, many new imbalance leaning methods are proposed for SDP. L. Chen et al. [2] consider the class imbalance problem together with class overlap and integrate neighbor cleaning learning (NCL) and ensemble random under-sampling (ERUS) methods as a novel approach for SDP. H. N. Tong et al. [1] propose a novel ensemble learning approach for imbalance and overfitting problems, ensemble it with the deep learning algorithm, and solve the imbalance problem and high dimensionality simultaneously. S. Kim et al. [42] propose an approach to detect and eliminate noises in defect data. N. Limsettho et al. [3] propose a novel approach named Class Distribution Estimation with Synthetic Minority Oversampling Technique (CDE-SMOTE) to modify the distribution of the training data for a balanced distribution.

2.3. Classification Algorithms for Class Imbalance

Classification is a form of data analysis that can be used to build a model that minimizes the number of classification errors on a training dataset [43]. Some classifiers are commonly used because of their outstanding performance, e.g., Naïve Bayes [44], multilayer perceptron [45], K-nearest neighbors [46], and logistic regression [47], decision trees [48], support vector machines [49], backpropagation neural networks [50]. However, it is confirmed that ensemble algorithms by a few weak classifiers outperform a common classifier [4,16] when the training dataset has a class imbalance problem. Random forest is a frequently used ensemble method in machine learning, which ensemble a certain amount of decision trees together for classification. However, it is still negatively influenced by imbalanced class distribution [51]. Facing the imbalance problem, F. Herrera et al. [52] evaluate the performance of the diverse approaches for imbalanced classification and use the MapReduce framework to solve the imbalance problem in big data. J. Xiao et al. [51] propose a dynamic classifier ensemble method for imbalanced data (DCEID). This method combines ensemble learning with cost-sensitive learning which improves classification accuracy effectively.
All of these methods have been proved to improve the classifier performance efficiently, but the sample size of SDP defect data and its relationship with the classifiers are unexplored. Moreover, the cooperation between resampling methods and classifiers is less to be noticed. Therefore, in this paper, we firstly empirically study the influence of sampling size on classifiers and resampling methods; then, we investigate the cooperation between resampling methods and classifiers. Finally, based on the results of the empirical study, we propose a novel resampling method for imbalanced learning in software defect prediction, which can improve prediction results for SDP datasets.

3. Research Methodology

3.1. Overall Structure

In order to solve the class imbalance problem rationally and effectively, we choose to remove less-informative samples of majority class instead of randomly deleting for rebalancing the data distribution. Furthermore, we define the informative degree of a specified sample by measure the difference of samples with same feature values between defective and non-defective classes, which is the main idea of LIMCR. The proposed LIMCR involves three key phases. In the first phase, LIMCR defines the sample information calculating rule on one feature based on Naïve Bayes. In the second phase, LIMCR summarizes the variable of sample informative degree on one feature and proposes a new variable for describing sample informative degree. In the third phase, LIMCR analyzes the relationship between variable and sample distribution and proposes the definition of less informative majorities. The structure of the proposed method LIMCR is shown in Figure 1.

3.2. Assumption of Porposed Methods

In order to make the calculation of LIMCR more efficient and applicable to more datasets, the method we proposed is based on the following assumptions:
  • All features are independent for the given class label;
  • All features are continuous variables and the likelihood of the features is assumed to be Gaussian;
  • There is only one majority class in datasets.

3.3. Variable of Sample Information for One Feature

A sample E is represented by a set of feature values X ( x 1 , x 2 , , x m ) and a class label Y, the value of Y can only be 1 or 0. According to Bayes probability, posterior probability of Y can be calculated as
p ( Y | X ) = p ( X | Y ) p ( Y ) p ( X )
Then posterior probability of a sample E i with m features being class y can be calculated as
p ( Y = y | X i ) = p ( X i = ( X i 1 = x i 1 , X i 2 = x i 2 , , X i m = x i m ) | Y = y ) p ( Y = y ) p ( X i = ( X i 1 = x i 1 , X i 2 = x i 2 , , X i m = x i m ) )
Because of the assumption that all features are independent for the given class label, the conditional probability p ( X i Y = y ) can be calculated as
p ( X i = ( X i 1 = x i 1 , X i 2 = x i 2 , , X i m = x i m ) | Y = y ) = j = 1 m p ( x i j y )
The Naïve Bayes classifier is expressed as
f b ( X i ) = p ( Y = 1 ) p ( Y = 0 ) i = 1 n p ( x i j Y = 1 ) p ( x i j Y = 0 ) = I R i = 1 n p ( x i j Y = 1 ) p ( x i j Y = 0 )
where n is the number of samples and I R represents the imbalance radio. Then the class label Y of a sample with feature X i is predicted according to f b ( X i )
Y = 1 , f b ( X i ) 1 Y = 0 , f b ( X i ) < 1
For the single feature, the bigger the gap between p ( x i j Y = 1 ) and p ( x i j Y = 0 ) , the larger the value of p ( x i j Y = 1 ) p ( x i j Y = 0 ) is, simultaneously the easier the sample X i can be correctly classified by Naïve Bayes classifier. Correspondingly, the easier a sample is misclassified the more informative it is. Generally, the value of conditional probabilities p ( x i j Y = 0 ) and p ( x i j Y = 1 ) are calculated based on samples in dataset, and the likelihood of the features is assumed to be Gaussian. Then when there is only one feature, the conditional probabilities are calculated as
p ( x i y ) = 1 2 π σ y 2 exp ( ( x i μ y ) 2 2 σ y 2 )
If a sample with one feature can be precisely classified into corresponding class (for instance y = 0 ), the conditional probability p ( x i y = 0 ) should be close to 1 and p ( x i y = 1 ) should be close to 0, then the difference between them should be close to 1. This kind of samples cannot provide much effective information for classifier except making sample variance larger.
Figure 2 presents two curves in each figure which mean the probability density functions of two probabilities in one dimension, respectively. It is known that samples in overlapping area are hard to be classified correctly and may disturb model training. The data distribution of original dataset in one dimension is like curves in Figure 2a, the distribution of majorities is dispersive and creates a large overlapping area with minorities. After removing less informative samples in majorities, the variance of majorities turns small and the overlapping area turns smaller as well in Figure 2b. The two figures illustrate that the increase of sample variance of one class would lead sample overlapping area larger and make the learning phase harder to classify. Considering the amount of majority samples in datasets with imbalanced distribution is larger than minority samples, we define an informative variable D for evaluating how informative a majority sample is for one feature.
Definition 1.
Informative variable D. In imbalanced datasets, the variable for evaluating information of a majority sample for one feature is defined as its difference between conditional probabilities D i k = p ( x i k y i = 0 ) p ( x i k y i = 1 ) , where i represents the ith sample and k represents the kth feature.

3.4. Variable of Sample Informative Degree

In the feature space, variable D can only indicate the distribution of one feature which cannot refer to sample distribution characteristics. Since the Naïve Bayes algorithm assumes that all features are independent for the given class label, the relationship between features are not involved in our method. Under this assumption, we propose a rule to sum up the informative variable D of each feature distribution to get the informative degree S U M _ D of each majority sample.
The construction of the informative degree S U M _ D mainly consider two aspects. One is the difference between two conditional probabilities p ( X Y = 0 ) and p ( X Y = 1 ) might be too small to split samples with different labels. The other is D from different features might be offset after summation. To avoid the above two possible problems, we sum up the rank values of D instead of the variable itself. The steps proposed to calculate informative degree S U M _ D of each majority sample are as follows.
  • Order m variables D i to D o r d e r = { D i 1 , D i 2 , , D i m } by absolute values from the smallest absolute difference to the largest, where D i 1 D i 2 D i m . Let A B S _ v e c t o r denote the absolute values and S I G N _ v e c t o r denote the signs of m variables.
  • Rank the m variables of D o r d e r with the smallest as 1, if there are elements which have the same value, calculate average rank of these elements. Let R a n k _ v e c t o r denotes the rank.
  • Sum the product of S I G N _ v e c t o r and R a n k _ v e c t o r as S U M _ D . Let S U M _ D ( X i ) denote the informative degree of majority sample X i .
S U M _ D ( X i ) quantifies how informative a sample is, especially when the classifier is Naïve Bayes, this variable denotes how difficult a Naïve Bayes classifier learns information for classification from a sample. The rank value recorded in R a n k _ v e c t o r can differ variable D from different features clearly and the product of R a n k _ v e c t o r and S I G N _ v e c t o r can avoid offset of value D from different features efficiently.

3.5. Finding the Less Informative Majorities

Generally, the bigger the S U M _ D is, the less informative the majority sample is, so we try to find out and remove the majority samples with big S U M _ D values. However, there is another situation to be noticed, when S U M _ D value is negative, it means this majority sample is in overlapping area or even in the minority class area. Samples like these are overlapping samples or noises, both possible results might have bad influence on performance of classification. Summarizing the rules above, we give the definition of less informative majorities.
Definition 2.
Majority samples in datasets which have a too large or too small S U M _ D value are defined as the less informative samples.
Order majorities with S U M _ D , remove specified number of the first few and last few samples of the sequence from majorities. After removing, recalculate data distribution variables and repeat procedures introduced above until the imbalance problem is solved. The main components of LIMCR are described in Algorithm 1.
Algorithm 1:LIMCR: Less-informative majorities cleaning rule.
Applsci 10 08324 i001

4. Experiments

4.1. Benchmark Datasets

Datasets we choose in this research are software defect datasets from Marian Jureczko Datasets [53], NASA MDP datasets from Tim Menzies [54] and Eclipse bug datasets from Thomas Zimmermann [55] as benchmark data. Datasets in the first two research can be obtained from the website (https://zenodo.org/search?page=1&size=20&q=software%20defect%20predictio) and the Eclipse bug datasets are downloaded from Eclipse bug repository (https://www.st.cs.uni-saarland.de/softevo/bug-data/eclipse/). We investigate the sample size and IR value of each set (totally 108), and statistical results are shown in pie graph in Figure 3. Two basic distribution characteristics are listed as follows:
  • Most IR (imbalanced ratio) of SDP datasets range from 2 to 100.
  • The sample size of SDP datasets in different projects has a huge disparity. Some small datasets are less than 100, while some large datasets are more than 10,000.
In experiments, we choose 29 datasets from SDP datasets we investigated above for the following experiments as benchmark data. The information of selected datasets are presented in Table 1. To solve binary classification problem, we regard the samples of which the label is the number of bugs and greater than 1 as the same class and redefine the label as “1”, meanwhile, the samples with the “0” label are not changed.
The definition of imbalanced ratio is the ratio of number of negative samples to number of positive samples, and the sample size is the number of samples of a dataset.
I R = n u m b e r o f n e g a t i v e s a m p l e s n u m b e r o f p o s i t i v e s a m p l e s

4.2. Performance Metrics

In experiments, we exploit four common performance metrics: r e c a l l , G - m e a n , A U C and Balanced Accuracy Score ( b a l a n c e d s c o r e ) [56]. The larger value of the four metrics is, the better the performance of the classifier is. All these metrics are based on confusion matrix (Table 2).
Where the defective modules are regarded as buggy (or positive) samples and non-defective modules as clean (or negative) samples. According to confusion matrix, the definition of P D (the probability of detection, also called r e c a l l , T P R ), P F (the probability of false alarm, also called F P R ) and p r e c i s i o n are as follows.
P D = r e c a l l = T P T P + F N
P F = F P F P + T N
p r e c i s i o n = T P T P + F P
r e c a l l and G - m e a n are proved more suitable for imbalanced learning [20].
G m e a n = r e c a l l × p r e c i s i o n
A U C measures the area under the R O C curve which describes the trade-off between P D and P F , it can be calculated as follows: [1]
A U C = b u g g y i r a n k ( b u g g y i ) M ( M + 1 ) 2 M * N
where b u g g y i r a n k ( b u g g y i ) represents the sum of ranks of all buggy(or positive) samples, M and N are the number of buggy samples and clean samples, respectively.
Balanced Accuracy Score (called b a l a n c e d s c o r e ) as another accuracy metric is defined as the average recall obtained from each class, the metric can avoid inflated accuracy resulted from imbalanced class. Assume that y i is the true value of the ith sample, and ω i is the corresponding sample weight, then we adjust the sample weight to ω i ^ = ω i l ( y = y i ) ω i , where l ( x ) is the indicator function. Given predicted y i ^ , b a l a n c e d s c o r e is defined as
b a l a n c e d s c o r e ( y , y ^ , ω ^ ) = 1 ω i ^ l ( y i ^ = y i ) ω i ^

4.3. Research Question and Results

RQ1: Which baseline classifiers do we choose to match the imbalance learning methods with different sample sizes?
Motivation: Classification effect can be affected by classifiers, imbalance learning methods, sample size and number of features. To improve the efficiency of the experiments, we perform an empirical study to give priority to classifiers that perform well on SDP datasets. On the other hand, we need explore the impact of different classifers on different sample size.
Approach: We first do a preliminary experiment to show which baseline classifier performs better without any resampling method on 29 benchmark datasets. We choose nine baseline classifiers with parameters which are listed in Table 3. All baseline classifiers [57] are implemented in scikit-learn v0.20.3 [58]. The parameters of each classifier are decided by pre-experiments, meanwhile, parameters which make no influence on classification performance, are used as default value.
Results: Table 4 summarize the results of nine classifiers on 29 datasets. The best result is highlighted with bold-face type. The differences between results of each classifier are analyzed by using Friedman and Wilcoxon statistical test [59].
The performance of classifiers is measured by recall, G-mean and AUC, the result of these three metrics are quite similar so we present recall value of classifiers in Table 4. The average values of each algorithm are listed after the result of each datasets, and the average ranks calculated as in the Friedman test are followed with it, the lower the average rank is, the better the classifier is. From recall value of each dataset, we can see clearly that GNB performs better than other basic classifiers in most of the datasets when the size of datasets is around 100 to 1100, and when sample size is larger than 1100, ABC and DTC perform better in most of datasets. Moreover, from the average result in the last two rows in Table 4, GNB attains to the highest average recall value from all datasets but ABC gets the best Friedman rank value among all nine classifiers comparison. We present result of G-mean and AUC together with recall in Figure 4 and Figure 5, the figure shows the similar trend of the three metrics.
For more details, we divide the datasets into two parts according to sample size, and analyze the differences among these classifiers on two parts of datasets, respectively. The datasets with sample size smaller than 1100 are called small datasets, otherwise, we call them large dataset. The average value and Friedman rank are recalculated for the two parts of datasets in Table 5.
From Table 5 we discover law of these classifiers clearly that GNB performs best among the nine selected classifiers when the sample size of a dataset is small, and the exact boundary of small and large sample size is 1100. ABC and DTC perform best in datasets of large sample size, the differences among classifier average results are analyzed in Table 6.
From the results on small datasets, in the Friedman test, we reject the null hypothesis that the nine classifiers have no significant difference (p-values of the three metrics are all smaller than 0.00001). Carrying out the Nemenyi post hoc analysis (Critical Difference of three metrics CD = 3. 003, α = 0.05) shows that ABC, Bgg, GNB, LR and DTC are significantly better than others. According to average rank and results of datasets, ABC and GNB seem to be slightly better than others and GNB is slightly better than ABC intuitively. In order to verify whether GNB is significantly better than ABC, we perform a further paired Wilcoxon test which null hypothesis is no significant difference between ABC and GNB. The p-value of three metrics are 0.017, 0.053, 0.088, respectively. According to the Wilcoxon test, for defective sample detecting metric (recall) GNB performs better than ABC significantly, and for overall accuracy metrics (G-mean and AUC) GNB performs a little better than ABC. Considering the cost of false negative is much more than false positive, we attach more importance to recall. Therefore, we regard GNB as the best classifier among nine basic classifiers and better than other eight classifiers on datasets with small sample size. However, we also see the bad performance of all the nine classifiers, even GNB have a relatively low AUC on small datasets. This indicates that performance of classifiers are restricted by the class imbalance problem, and there is a great space of improvement in performance of basic classifiers after overcoming the class imbalance problem reasonably. From the results on large datasets, we can extract observations that GNB performs worse than DTC and ABC. In the Friedman test, all the p-values of three metrics are smaller than 0.00001, which shows there is significant difference between nine classifiers, then the Nemenyi post hoc analysis (critical difference of three metric CD = 3.332, α = 0.05) shows that LR, MLP, RF and SVC perform significantly worse than other classifiers, these classifiers are unsuitable for SDP datasets with a large sample size. Then the paired Wilcoxon test turns out that differences between ABC and DTC are not significant because the null hypothesis on no significant difference between ABC and DTC cannot be rejected (p-value of recall, G-mean and AUC are 0.433, 0.396, 0.753, respectively). Furthermore, combined with the average rank we have found that ABC performs well in both small and large datasets (rank second in small-sample-size datasets and rank first in large-sample-size datasets), the result of Wilcoxon rank sum test support the null hypothesis that no significant difference between large and small datasets for ABC (p-value of recall, G-mean and AUC are 0.558, 0.661, 0.539, respectively). The result reflects that the performance of ABC is not affected by sample size of datasets.
RQ2: How does LIMCR perform compared with other imbalance learning methods in small sample size of datasets?
Motivation: We have selected the baseline classifier GNB in small sample size in RQ1. Now, we need to validate the effectiveness of our proposed LIMCR.
Approach: To solve the class imbalance problem, researchers usually use two kinds of methods: resampling methods based on data level and classification methods based on algorithm level. Resampling methods are always sorted into three categories: over-sampling methods, under-sampling methods and combination of over- and under-sampling methods. The question that which method is more suitable for the class imbalance problem has been discussed in many research [4,60,61]. To evaluate the effectiveness of LIMCR, we employ six baseline imbalance learning methods with parameters which are listed in Table 7. All these baseline methods are implemented in the Imbalanced-learn model in Python [62].
In this experiment, we compare performance metrics (balancedscore and G-mean) of our LIMCR with baseline imbalance learning methods in small sample size of benchmark datasets. All imbalance learning methods including LIMCR are combined with baseline classifier GNB.
Results: Table 8 and Table 9 present the results of our LIMCR and the baseline imbalance learning methods on small datasets in terms of balancedscore and G-mean, respectively. We notice that the average balancedscore and G-mean of LIMCR are 0.701 and 0.69 which perfoms better than other baseline imbalance learning methods.
In further study, p-value in Friedman test in these two performance metrics are all smaller than 0.00001, which shows that significant difference is existing among the seven methods. Then, the Nemenyi post hoc test shows CD = 2.326, α < 0.05. According to Nemenyi post hoc test, we underline the average ranks which significantly worse than our LIMCR. The results reflect that ensemble algorithms performs significantly worse than resampling method combined with GNB. In order to find out if there is any significant difference between resampling methods, we perform Wilcoxon signed-rank test between our LIMCR and other resampling methods, the p-values of each test are listed in Table 10.
From Table 10 we can learn that SMOE performs no significant difference from LIMCR, but for the Friedman average ranks, LIMCR performs slightly better than SMOE. The overall metrics balancedscore and G-mean of IHT shows significantly worse than LIMCR. Another two methods B-SMO and NCL perform significantly worse than LIMCR in most of datasets for balancedscore and G-mean. Therefore we conclude that LIMCR performs better than most of imbalance learning methods.
RQ3: How does LIMCR work with other classifiers?
Motivation: In principle analyses, our LIMCR is based on Bayesian Probability, and GNB is the most suitable classifier for it. However, we expect that LIMCR still has good performance when it is combined with other classifiers.
Approach: In this experiment, we choose another two well performed classifiers ABC and DTC, with the combination of three resampling methods LIMCR, SOME and IHT on small datasets in terms of balancedscore and G-mean for further comparison.
Results: Table 11 and Table 12 show the results of matching of three classifiers with three resampling methods on small datasets in terms of balacedscore and G-mean. We notice that the average balacedscore value of LIMCR+GNB is 0.701 which is equal to IHT+ABC and higher than other combinations. Simultaneously, the average G-mean of LIMCR+GNB is 0.69 while it is 0.696 for IHT+ABC.
From the average score and Friedman average ranks we see the combination of LIMCR and GNB still performs better than others except for G-mean, IHT+ABC ranks first on G-mean and LIMCR+GNB is slightly worse than it.
Friedman test results are shown in Table 13, column named Total is the Friedman test among all nine methods, p-value of metric G-mean is 0.123 larger than 0.05 means there is no significant difference among all nine methods, i.e., LIMCR with other classifiers can perform as well as it with GNB and other methods for G-mean. The result of Nemenyi post hoc test (CD = 3.12, α < 0.05) supports this result. Column LIMCR, SMOE and IHT represent the Friedman tests among classifiers with same resampling method, respectively, for instance, p-value in column LIMCR is the result of Friedman test among GNB, DTC, ABC with the same resampling method LIMCR. For LIMCR on balancedscore p-value is smaller than 0.05, which suggests the combination of LIMCR + GNB performs significantly better than other combinations of LIMCR. For other resampling methods, p-value of balancedscore and G-mean are all larger than 0.05, which suggests that for SMOE and IHT, there are no significant difference among different combinations of them with different classifiers on the perspective of Balancedscore and G-mean. We can draw a conclusion from the experiment, LIMCR, SMOE and IHT all perform no significant difference on some metrics by being combined with different classifiers but inversely on other metrics. We take more consideration on the difference, performance of data resampling methods may change by using different classifiers in imbalance learning, therefore, when the dataset has a small sample size, it is necessary to choose GNB or Naïve Bayes as the basic classifier in imbalance learning.
RQ4: Does the number of features of datasets have an influence in LIMCR?
Motivation: Our proposed method LIMCR is strongly related to the features of datasets. In former experiments we use datasets with the same feature dimension and have no idea if the number of features have influence on the performance of LIMCR.
Approach: In this experiment, we retain k (k = 4, 8, 12, 16 and 20) highest scoring features to observe the variation of performance on LIMCR. We exploit a feature selection method named SelectKBest from model feature selection in scikit-learn v0.20.3 [58], and the dependence score between each feature and class label is measured by chi-square [57]. The main reason we choose this method is the convenience for selecting certain number of features in the experiment, and it removes features by univariate analyze suits for datasets.
Results: The results are shown in Table 14 and Table 15. We notice that the average balancedscore values vary from 0.636 to 0.662 and the average G-mean from 0.534 to 0.631. When the number of feature is 16, the values of average balancedscore and G-mean are highest.
From the p-value of Friedman test in Table 16 we know for metric balancedscore there is no significant difference between the number of features in five levels. According to the result of Nemenyi post hoc test (CD = 1.575), the Friedman average ranks which larger than the best one more than 1.575. For metric G-mean, only datasets with four features perform significantly worse the best one and others has no significant difference. When the number of features declines, the precision score increases and the overall performance is declining. In summary, it can be believed that the number of features has no influence on the performance of LIMCR unless the number of features below a certain value. The value in this experiment is 4 or 5.
RQ5: How does LIMCR work with datasets with large sample sizes?
Motivation: From RQ1 we know sample size has a great influence on classifier selecting in imbalance learning and our proposed LIMCR is proved performing well with small datasets. However, how LIMCR performs on datasets with large sample size is also needed to know.
Approach: In this experiment we combine ILMCR with three classifiers, GNB, ABC and DTC. IHT combined with the same classifiers are used as comparison. The datasets are large datasets introduced in RQ1.
Results: The results of comparison between six combined methods are listed in Table 17 and Table 18.
From Table 17 and Table 18, for both metrics balancedscore and G-mean, IHT get higher results than LIMCR when using the same classifier. In other words, resampling method IHT performs better than our LIMCR according to average score and Friedman average ranks. Meanwhile, the Nemenyi post hoc test (CD = 2.015) result declares that all classifiers combined with LIMCR perform significantly worse than the best performance on balancedscore. From these we can conclude that the proposed LIMCR performs well when the sample size is small (generally smaller than 1100), but it turns worse than IHT when sample size increases (generally larger than 1100). So LIMCR can achieve better performance with typical classifier when the sample size is smaller.

5. Discussion

5.1. Why Hold 3 Digits for Informative Variable D?

As mentioned in Section 3.3, the rank value of informative variable D on features of a sample have a great effect on estimating how informative the sample is, moreover, the precision of variable D affect the rank value directly. Therefore, it is necessary to discuss a proper value of this parameter (precision for variable D). The aim of this discussion is to present the effects of different precision of D on the performance of proposed LIMCR. Considering the space limitation, here, we randomly select the result of three datasets with different sample size (sample sizes of synapse-1.0, PC4, prop2 are 157, 1270, 23014, respectively) and the average result of 29 datasets introduced in Section 3.1. We choose GNB as classifier and evaluate the performance with balancedscore, precision, recall, and G-mean. The value of the precision of D is varied from 0 to 5 with increment of 1. The experimental result are presented as shown in Figure 6.
From the figures we notice that when this parameter equals to 3, the result of LIMCR performs stable and better than most of other values on average result. The metrics excepting precision have an increasing trend with the increase of this parameter value. Inversely, the metric precision decreases with the parameter value increasing. In order to obtaining the global optimum performance, we choose 3 as the generic value of this parameter.

5.2. Threats to Validity

There are still several potential limitations in our study which are shown as follows.
  • Quality and quantity of datasets for empirical study might be insufficient. Although we have collected more than 100 datasets for illustrating the distribution of sample size and imbalanced ratio in most of SDP datasets, and 29 datasets for investigations in empirical study. It is still hard to confirm if these datasets are typical to reflect characters of SDP data
  • The generalization of our method might be limited. The method we proposed focus on binary classification, it improves the performance of predicting if a sample (software model) has any defects but cannot predict the number of defects in it. More types of defect datasets should be considered in the future to reduce the threats.
  • The performance evaluating metrics we selected might be partial. There are many metrics such as PD and MCC have been used in binary classification for SDP research. At the same time, F1 is also widely used in SDP, but we do not employ it as it is proved to be biased and unreliable [63]. Although we have considered to select evaluating metrics from two aspects, overall performance and one-class accuracy, however, the limited number of metrics still pose some threats to the construct validity.
  • Practical significance of LIMCR in software engineering might be extended. Project members can obtain information on possible defect-prone modules of the software before failure occur by using defect prediction technique, LIMCR has not been applied to predict defect classes/severities [64]. In addition, it is worth studying the performance of LIMCR with different prediction models (within a single project, cross project predictions) [65]. Meanwhile, how to cooperate with instance deletion, missing values replacement, normalization issues mentioned in [66] and defect prediction cost effectiveness [67] also needs further research.

6. Conclusions

The performance of a defect prediction model is influenced by the sample size of dataset, selection of classifiers and data resampling methods. From our empirical study, we compared performance of nine popular classifiers in 29 software datasets with different sample size ranging from 100 to 20,000 to study the influence of sample size and classifiers. The major conclusion in this part is that GNB performs well with small datasets, but its performance deteriorates when sample size of datasets grow to 1100. Another classifier ABC performs stable with different sample size and obtains relatively better result with large datasets among classifiers. On this basis, in order to make an expected matching on small datasets, we proposed a new resampling method LIMCR motivated by the good performance of GNB. LIMCR is used for SDP datasets with small sample size and it is designed as the best resampling method cooperated with classifier GNB. The results of comparison experiments confirm that the performance of LIMCR is better than the other resampling methods, and the matching between GNB and LIMCR is the best solution for the imbalance problem in SDP datasets with small datasets. Besides, we also design experiments to research how it performs with other classifiers, feature selection and data with large sample size. The results can be summarized as follows.
  • LIMCR together with classifier GNB is a better solution for the imbalance problem on SDP datasets with small datasets which is slightly better than SMOE+GNB.
  • On aspect of metric G-mean, LIMCR has the same well performance when cooperates with other classifiers. On aspect of metric balancedscore, when cooperating with LIMCR, GNB performs significantly better than other classifiers.
  • Number of features in a datasets has no influence on LIMCR, but the performance turn significantly worse when the number of features less than 5.
  • When the sample size bigger than 1100, performance of LIMCR is worse than IHT, so when sample size bigger than 1100, IHT is recommended as the best imbalanced learning method for SDP.
Although our proposed LIMCR cannot outperform for all datasets, but the result of our research emphasizes the importance of the influence of datasets. There is no all-purpose imbalance learning methods, the way of choosing methods appropriately is also important. In the future, we plan to extend our research to cover other data distribution problems such as overlapping problem and high dimensionality. We will update our LIMCR to solve more combined problems and being suitable for more SDP datasets.

Author Contributions

Conceptualization, Y.W.; methodology, Y.W. and J.Y.; software, J.Y.; validation, Y.W. and J.Y.; formal analysis, J.Y.; investigation, J.Y. and S.C.; resources, S.C.; data curation, S.C.; writing–original draft preparation, J.Y. and S.C.; writing–review and editing, Y.W.; visualization, J.Y.; supervision, B.L.; project administration, Y.W.; funding acquisition, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Aerospace Science Foundation of China (grant number 2017ZD51052).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Tong, H.; Liu, B.; Wang, S. Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning. Inf. Softw. Technol. 2018, 96, 94–111. [Google Scholar] [CrossRef]
  2. Chen, L.; Fang, B.; Shang, Z.; Tang, Y. Tackling class overlap and imbalance problems in software defect prediction. Softw. Qual. J. 2018, 26, 97–125. [Google Scholar] [CrossRef]
  3. Limsettho, N.; Bennin, K.E.; Keung, J.W.; Hata, H.; Matsumoto, K. Cross project defect prediction using class distribution estimation and oversampling. Inf. Softw. Technol. 2018, 100, 87–102. [Google Scholar] [CrossRef]
  4. Galar, M.; Fernandez, A.; Barrenechea, E.; Bustince, H.; Herrera, F. A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 2011, 42, 463–484. [Google Scholar] [CrossRef]
  5. Błaszczyński, J.; Stefanowski, J. Neighbourhood sampling in bagging for imbalanced data. Neurocomputing 2015, 150, 529–542. [Google Scholar] [CrossRef]
  6. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  7. Laurikkala, J. Improving identification of difficult small classes by balancing class distribution. In Proceedings of the Conference on Artificial Intelligence in Medicine in Europe, Cascais, Portugal, 1–4 July 2001; pp. 63–66. [Google Scholar]
  8. Chen, S.; He, H.; Garcia, E.A. RAMOBoost: Ranked minority oversampling in boosting. IEEE Trans. Neural Netw. 2010, 21, 1624–1642. [Google Scholar] [CrossRef]
  9. Freund, Y.; Schapire, R. Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference (ICML 1996), Bari, Italy, 3–6 July 1996; pp. 148–156. [Google Scholar]
  10. Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef] [Green Version]
  11. Fan, W.; Stolfo, S.J.; Zhang, J.; Chan, P.K. AdaCost: Misclassification Cost-Sensitive Boosting. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML ’99); Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1999; Volume 99, pp. 97–105. [Google Scholar]
  12. Freund, Y.; Schapire, R.E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef] [Green Version]
  13. Lusa, L.; Blagus, R. Class prediction for high-dimensional class-imbalanced data. BMC Bioinform. 2010, 11, 523. [Google Scholar]
  14. García, V.; Sánchez, J.S.; Mollineda, R.A. On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl. Based Syst. 2012, 25, 13–21. [Google Scholar] [CrossRef]
  15. Wang, S.; Yao, X. Diversity analysis on imbalanced data sets by using ensemble models. In Proceedings of the 2009 IEEE Symposium on Computational Intelligence and Data Mining, Nashville, TN, USA, 30 March–2 April 2009; pp. 324–331. [Google Scholar]
  16. Díez-Pastor, J.F.; Rodríguez, J.J.; García-Osorio, C.I.; Kuncheva, L.I. Diversity techniques improve the performance of the best imbalance learning ensembles. Inf. Ences 2015, 325, 98–117. [Google Scholar] [CrossRef]
  17. Weiss, G.M.; Provost, F. The Effect of Class Distribution on Classifier Learning; Technical Report ML-TR-44; Department of Computer Science, Rutgers University: New Brunswick, NJ, USA, 2001. [Google Scholar]
  18. Khoshgoftaar, T.M.; Van Hulse, J.; Napolitano, A. Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 2010, 41, 552–568. [Google Scholar] [CrossRef]
  19. More, A. Survey of resampling techniques for improving classification performance in unbalanced datasets. arXiv 2016, arXiv:1608.06048. [Google Scholar]
  20. He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar]
  21. He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008. [Google Scholar]
  22. Han, H.; Wang, W.; Mao, B. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In Proceedings of the International Conference on Intelligent Computing 2005, Hefei, China, 23–26 August 2005; pp. 878–887. [Google Scholar]
  23. Barua, S.; Islam, M.M.; Yao, X.; Murase, K. MWMOTE–majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. 2012, 26, 405–425. [Google Scholar] [CrossRef]
  24. Bunkhumpornpat, C.; Sinapiromsaran, K.; Lursinsap, C. Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Bangkok, Thailand, 27–30 April 2009; pp. 475–482. [Google Scholar]
  25. Batista, G.E.; Prati, R.C.; Monard, M.C. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 2004, 6, 20–29. [Google Scholar] [CrossRef]
  26. Sun, Z.; Song, Q.; Zhu, X.; Sun, H.; Xu, B.; Zhou, Y. A novel ensemble method for classifying imbalanced data. Pattern Recognit. 2015, 48, 1623–1637. [Google Scholar] [CrossRef]
  27. Wang, S.; Yao, X. Using class imbalance learning for software defect prediction. IEEE Trans. Reliab. 2013, 62, 434–443. [Google Scholar] [CrossRef] [Green Version]
  28. Chawla, N.V.; Lazarevic, A.; Hall, L.O.; Bowyer, K.W. SMOTEBoost: Improving prediction of the minority class in boosting. In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery, Cavtat-Dubrovnik, Croatia, 22–26 September 2003; pp. 107–119. [Google Scholar]
  29. Menzies, T.; Turhan, B.; Bener, A.; Gay, G.; Cukic, B.; Jiang, Y. Implications of ceiling effects in defect predictors. In Proceedings of the 4th International Workshop on Predictor Models in Software Engineering, Leipzig, Germany, 12–13 May 2008; pp. 47–54. [Google Scholar]
  30. Hart, P. The condensed nearest neighbor rule (Corresp.). IEEE Trans. Inf. Theory 1968, 14, 515–516. [Google Scholar] [CrossRef]
  31. Tomek, I. Two modifications of CNN. IEEE Trans. Syst. Man Cybern. 1976. [Google Scholar] [CrossRef] [Green Version]
  32. Wilson, D.L. Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 1972, 408–421. [Google Scholar] [CrossRef] [Green Version]
  33. Kubat, M.; Matwin, S. Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. In Proceedings of the 14th International Conference on Machine Learning (ICML 1997), Nashville, TN, USA, 8–12 July 1997; Volume 97, pp. 179–186. [Google Scholar]
  34. Smith, M.R.; Martinez, T.; Giraud-Carrier, C. An instance level analysis of data complexity. Mach. Learn. 2014, 95, 225–256. [Google Scholar] [CrossRef] [Green Version]
  35. Gupta, S.; Gupta, A. A set of measures designed to identify overlapped instances in software defect prediction. Computing 2017, 99, 889–914. [Google Scholar] [CrossRef]
  36. Liu, X.Y.; Wu, J.; Zhou, Z.H. Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B Cybern. 2008, 39, 539–550. [Google Scholar]
  37. Wang, T.; Zhang, Z.; Jing, X.; Zhang, L. Multiple kernel ensemble learning for software defect prediction. Autom. Softw. Eng. 2016, 23, 569–590. [Google Scholar] [CrossRef]
  38. McCabe, T.J. A complexity measure. IEEE Trans. Softw. Eng. 1976, 308–320. [Google Scholar] [CrossRef]
  39. Maurice, H.H. Elements of Software Science; Elsevier: New York, NY, USA, 1977. [Google Scholar]
  40. Song, Q.; Guo, Y.; Shepperd, M. A comprehensive investigation of the role of imbalanced learning for software defect prediction. IEEE Trans. Softw. Eng. 2018, 45, 1253–1269. [Google Scholar] [CrossRef] [Green Version]
  41. Malhotra, R.; Khanna, M. An empirical study for software change prediction using imbalanced data. Empir. Softw. Eng. 2017, 22, 2806–2851. [Google Scholar] [CrossRef]
  42. Kim, S.; Zhang, H.; Wu, R.; Gong, L. Dealing with noise in defect prediction. In Proceedings of the 2011 33rd International Conference on Software Engineering (ICSE), Honolulu, HI, USA, 21–28 May 2011; pp. 481–490. [Google Scholar]
  43. Wang, H.; Khoshgoftaar, T.M.; Napolitano, A. Software measurement data reduction using ensemble techniques. Neurocomputing 2012, 92, 124–132. [Google Scholar] [CrossRef]
  44. John, G.H.; Langley, P. Estimating continuous distributions in Bayesian classifiers. arXiv 2013, arXiv:1302.4964. [Google Scholar]
  45. Haykin, S. Neural Networks: A Comprehensive Foundation; Prentice-Hall, Inc.: Upper Saddle River, NJ, USA, 2007. [Google Scholar]
  46. Aha, D.W.; Kibler, D.; Albert, M.K. Instance-based learning algorithms. Mach. Learn. 1991, 6, 37–66. [Google Scholar] [CrossRef] [Green Version]
  47. Le Cessie, S.; Van Houwelingen, J.C. Ridge estimators in logistic regression. J. R. Stat. Soc. Ser. C Appl. Stat. 1992, 41, 191–201. [Google Scholar] [CrossRef]
  48. Safavian, S.R.; Landgrebe, D. A survey of decision tree classifier methodology. IEEE Trans. Syst. Man Cybern. 1991, 21, 660–674. [Google Scholar] [CrossRef] [Green Version]
  49. Saunders, C.; Stitson, M.O.; Weston, J.; Holloway, R.; Bottou, L.; Scholkopf, B.; Smola, A. Support Vector Machine. Comput. Sci. 2002, 1, 1–28. [Google Scholar]
  50. Haykin, S.; Network, N. A comprehensive foundation. Neural Netw. 2004, 2, 41. [Google Scholar]
  51. Xiao, J.; Xie, L.; He, C.; Jiang, X. Dynamic classifier ensemble model for customer classification with imbalanced class distribution. Expert Syst. Appl. 2012, 39, 3668–3675. [Google Scholar] [CrossRef]
  52. Del Río, S.; López, V.; Benítez, J.M.; Herrera, F. On the use of MapReduce for imbalanced big data using Random Forest. Inf. Sci. 2014, 285, 112–137. [Google Scholar] [CrossRef]
  53. Jureczko, M.; Madeyski, L. Towards identifying software project clusters with regard to defect prediction. In Proceedings of the 6th International Conference on Predictive Models in Software Engineering, Timisoara, Romania, 12–13 September 2010; pp. 1–10. [Google Scholar]
  54. Menzies, T.; Di Stefano, J.S. How good is your blind spot sampling policy. In Proceedings of the Eighth IEEE International Symposium on High Assurance Systems Engineering, Tampa, FL, USA, 25–26 March 2004; pp. 129–138. [Google Scholar]
  55. Zimmermann, T.; Premraj, R.; Zeller, A. Predicting defects for eclipse. In Proceedings of the Third International Workshop on Predictor Models in Software Engineering (PROMISE’07: ICSE Workshops 2007), Minneapolis, MN, USA, 20–26 May 2007; p. 9. [Google Scholar]
  56. Brodersen, K.H.; Ong, C.S.; Stephan, K.E.; Buhmann, J.M. The balanced accuracy and its posterior distribution. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 3121–3124. [Google Scholar]
  57. Buitinck, L.; Louppe, G.; Blondel, M.; Pedregosa, F.; Mueller, A.; Grisel, O.; Niculae, V.; Prettenhofer, P.; Gramfort, A.; Grobler, J.; et al. API design for machine learning software: Experiences from the scikit-learn project. arXiv 2013, arXiv:1309.0238. [Google Scholar]
  58. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  59. Japkowicz, N.; Shah, M. Evaluating Learning Algorithms: A Classification Perspective; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
  60. Stefanowski, J.; Wilk, S. Selective pre-processing of imbalanced data for improving classification performance. In Proceedings of the International Conference on Data Warehousing and Knowledge Discovery, Turin, Italy, 1–5 September 2008; pp. 283–292. [Google Scholar]
  61. Kamei, Y.; Monden, A.; Matsumoto, S.; Kakimoto, T.; Matsumoto, K.I. The effects of over and under sampling on fault-prone module detection. In Proceedings of the First International Symposium on Empirical Software Engineering and Measurement (ESEM 2007), Madrid, Spain, 20–21 September 2007; pp. 196–204. [Google Scholar]
  62. Lemaître, G.; Nogueira, F.; Aridas, C.K. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 2017, 18, 559–563. [Google Scholar]
  63. Yao, J.; Shepperd, M. Assessing software defection prediction performance: Why using the Matthews correlation coefficient matters. In Proceedings of the Evaluation and Assessment in Software Engineering, Trondheim, Norway, 15–17 April 2020; ACM: New York, NY, USA, 2020; pp. 120–129. [Google Scholar]
  64. Janczarek, P.; Sosnowski, J. Investigating software testing and maintenance reports: Case study. Inf. Softw. Technol. 2015, 58, 272–288. [Google Scholar] [CrossRef]
  65. Korpalski, M.; Sosnowski, J. Correlating software metrics with software defects. In Proceedings of the Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 2018, Wilga, Poland, 1 October 2018; International Society for Optics and Photonics: Bellingham, WA, USA, 2018; Volume 10808, p. 108081P. [Google Scholar]
  66. Pandey, S.K.; Mishra, R.B.; Tripathi, A.K. BPDET: An effective software bug prediction model using deep representation and ensemble learning techniques. Expert Syst. Appl. 2020, 144, 113085. [Google Scholar] [CrossRef]
  67. Hryszko, J.; Madeyski, L. Assessment of the software defect prediction cost effectiveness in an industrial project. In Software Engineering: Challenges and Solutions; Springer: New York, NY, USA, 2017; pp. 77–90. [Google Scholar]
Figure 1. Overall structure of the proposed method LIMCR.
Figure 1. Overall structure of the proposed method LIMCR.
Applsci 10 08324 g001
Figure 2. Distribution of overlapping area. (a) Distribution of original dataset in one dimension; (b) Removing less informative samples in majorities
Figure 2. Distribution of overlapping area. (a) Distribution of original dataset in one dimension; (b) Removing less informative samples in majorities
Applsci 10 08324 g002
Figure 3. Data distribution of SDP datasets.
Figure 3. Data distribution of SDP datasets.
Applsci 10 08324 g003
Figure 4. Average Friedman rank value of 3 metrics for 9 classifiers.
Figure 4. Average Friedman rank value of 3 metrics for 9 classifiers.
Applsci 10 08324 g004
Figure 5. Average result value of 3 metrics for 9 classifiers.
Figure 5. Average result value of 3 metrics for 9 classifiers.
Applsci 10 08324 g005
Figure 6. Result of LIMCR with different precision of variable D.
Figure 6. Result of LIMCR with different precision of variable D.
Applsci 10 08324 g006
Table 1. Description of software defect datasets.
Table 1. Description of software defect datasets.
DatasetAbbreviationFeatureSample SizeLabel = 1Label = 0IR
synapse-1.2synapse-1.220256861701.977
jedit-3.2jedit-3.220272901822.022
synapse-1.1synapse-1.120222601622.700
log4j-1.0log4j-1.020135341012.971
jedit-4.0jedit-4.020306752313.080
ant-1.7ant-1.7207451665793.488
camel-1.6camel-1.6209651887774.133
camel-1.4camel-1.4208721457275.014
Xerces-1.3Xerces-1.320453693845.565
Xalan-2.4Xalan-2.4207231106135.573
jedit-4.2jedit-4.220367483196.646
arcarc20234272077.667
synapse-1.0synapse-1.020157161418.813
tomcattomcat208587778110.143
camel-1.0camel-1.0203391332625.077
PC5PC538169445812362.699
KC1KC12111622948682.952
JM1JM1217720161261083.789
prop-5prop-5208516129972175.556
prop-1prop-12018,471273815,7335.746
eclipse-metrics-files-3.0ec -3.019810,593156890255.7557
eclipse-metrics-files-2.0ec2.0198672997557545.902
PC4PC437127017610946.216
PC3PC33710531309237.100
prop-3prop-32010,274118090947.707
eclipse-metrics-files-2.1ec -2.1198788885470348.237
prop-2prop-22023,014243120,5838.467
prop-4prop-420871884078789.379
MC1MC138195236191653.222
Table 2. Confusion matrix.
Table 2. Confusion matrix.
Actual BuggyActual Clean
Predict Buggy (Positive) T P F P
Predict Clean (Negative) F N T N
Table 3. Selection of classifiers and parameters.
Table 3. Selection of classifiers and parameters.
ClassifierAbbreviationParameters
AdaBoostClassifierABCn_estimators = 5
BaggingClassifierBggbase_estimator = DecisionTree, n_estimators = 18
GaussianNBGNBvar_smoothing = 1 × 10 5
KNeighborsClassifierKnnn_neighbors = 10
RandomForestRFn_estimators = 50, max_depth = 3, min_samples_leaf = 4
SVCSVCdefault
DecisionTreeClassifierDTCmax_depth = 3, min_samples_leaf = 4
MLPClassifierMLPsolver = ‘lbfgs’, hidden_layer_sizes = (10,500)
LogisticRegressionLRdefault
Table 4. Recall of comparison among basic classifiers.
Table 4. Recall of comparison among basic classifiers.
DataSample SizeABCBggDTCGNBKnnLRMLPRFSVC
log4j-1.01350.5630.6110.4410.6080.260.4720.4720.4320.392
synapse-1.01570.260.0860.1810.6460.2050.2560.1640.0370
synapse-1.12220.4460.3740.3790.7240.1640.4620.450.4150.486
arc2340.2540.1920.1550.7620.050.2320.2630.1250.167
synapse-1.22560.5590.5370.5540.4730.5060.4640.460.5030.01
jedit-3.22720.6240.6620.5840.5860.4440.6130.5870.610.662
jedit-4.03060.5120.4540.4530.4080.3230.3750.2720.440.353
camel-1.03390.0250.03300.34100.060.2500
jedit-4.23670.3410.2570.2750.4180.1180.3350.130.1920.274
Xerces-1.34530.3930.2950.390.4570.0330.3190.0470.2980.409
Xalan-2.47230.2530.1110.2090.3820.1260.2020.090.1040.089
ant-1.77450.4590.4750.4450.5550.4130.3580.4050.4220.388
tomcat8580.2290.1250.1770.380.0060.1970.0820.0880.228
camel-1.48720.2680.0690.1320.2880.0340.1190.090.0360.004
camel-1.69650.1940.1190.1440.2480.0250.1380.1040.0440.036
PC310530.3250.1310.3170.8760.1370.1420.0350.0020
KC111620.3740.3010.360.2340.2580.2380.2110.20.01
PC412700.5160.4670.540.0640.0970.4280.0740.0350.01
PC516940.4820.3870.470.1120.3350.2280.2170.2280.029
MC119520.2970.1260.2620.2720.020.0260.1400
ec -2.167290.2220.1730.2890.2530.2120.1340.0430.0660.009
JM177200.3440.1910.3320.0620.1980.1020.1540.0730.011
ec2.078880.3890.3520.4220.3040.3390.2480.1140.1860.047
prop-585160.2320.1830.2280.1570.1720.0470.0230.0160.037
prop-487180.1720.1610.1990.2330.1350.1130.060.0550.016
prop-310,2740.1480.130.1370.1560.1010.0290.0180.0020.037
ec3.010,5930.2970.2340.3580.2530.2660.1570.1410.1040.025
prop-118,4710.3530.3150.3490.2580.2890.1070.0510.0730.136
prop-223,0140.330.3170.3620.2080.2030.0370.0240.0140.063
Average recall0.3400.2710.3150.3700.1890.2290.1780.1660.135
Average Friedman rank2.1724.2593.3623.1036.0865.1386.5697.1217.190
Table 5. Average recall value and Friedman rank of 9 classifiers on different datasets.
Table 5. Average recall value and Friedman rank of 9 classifiers on different datasets.
DatasetsAverageABCBggDTCGNBKnnLRMLPRFSVC
Small datasetsRecall value0.3570.2830.3020.5100.1780.2970.2440.2340.219
Rank value2.5634.7194.6562.0007.4694.4696.1566.5316.438
Large datasetsRecall value0.3200.2570.3310.1970.2020.1460.0980.0810.033
Rank value1.6923.6921.7694.4624.3855.9627.0777.8468.115
Table 6. p-Value and CD value of differences among classifier result on different datasets
Table 6. p-Value and CD value of differences among classifier result on different datasets
MetricFriedman TestNemenyi Analysis (CD) α
Small datasetsRecall<<0.000013.003 α = 0.05
G-mean<<0.000013.003 α = 0.05
AUC<<0.000013.003 α = 0.05
Large datasetsRecall<<0.000013.332 α = 0.05
G-mean<<0.000013.332 α = 0.05
AUC<<0.000013.332 α = 0.05
Table 7. Baseline imbalance learning methods.
Table 7. Baseline imbalance learning methods.
LevelCategoryImbalance Learning MethodAbbreviationParameters
DataOver-samplingBorderline-SMOTE [22]B-SMODefault
Under-samplingNeighbourhood Cleaningrule [7]NCLDefault
Instance hardness threshold [34]IHTDefault
Combination of over- and under-sampling methodsSMOTE+ENN [25]SMOEDefault
AlgorithmAlgorithmRUSBoostRUSBDefault
EasyEnsembleEasyEDefault
Table 8. Balancedscore of GNB with different imbalance learning methods on small datasets.
Table 8. Balancedscore of GNB with different imbalance learning methods on small datasets.
LIMCRSMOEIHTB-SMONCLRUSBEasyE
synapse-1.20.7290.6990.7150.6270.7110.5370.682
jedit-3.20.7030.6690.720.730.6890.530.645
synapse-1.10.6920.6840.6430.6170.5940.5680.522
log4j-1.00.7720.6450.6020.6840.7820.5110.623
jedit-4.00.7580.7080.640.6920.7080.5560.66
ant-1.70.780.7540.7170.7090.7630.5880.668
camel-1.60.6080.6160.590.5940.6050.5430.58
camel-1.40.6560.6650.6580.5780.6050.560.603
Xerces-1.30.6930.7870.7610.7190.650.5020.493
Xalan-2.40.680.650.680.680.6840.550.668
jedit-4.20.7580.7590.7270.730.7220.7050.76
arc0.5610.6590.5830.4430.6160.6220.554
synapse-1.00.7410.7540.7230.6580.5540.4020.679
tomcat0.7350.8570.6050.6870.6770.4880.691
camel-1.00.6540.540.6080.8110.6790.6240.698
Average0.7010.6960.6650.6640.6690.5520.635
Friedman rank2.4672.7673.93343.5676.44.867
Table 9. G-mean of GNB with different imbalance learning methods on small datasets.
Table 9. G-mean of GNB with different imbalance learning methods on small datasets.
LIMCRSMOEIHTB-SMONCLRUSBEasyE
synapse-1.20.7210.6980.7140.6030.7070.3410.681
jedit-3.20.6920.6690.7030.7090.6710.3770.639
synapse-1.10.6720.6770.6430.6160.590.4020.517
log4j-1.00.760.6190.5950.6660.7790.260.6
jedit-4.00.7550.7020.6380.6810.6980.3610.628
ant-1.70.780.7490.710.6970.7550.4810.668
camel-1.60.5550.5780.5890.5010.5310.4310.569
camel-1.40.650.6630.6530.5040.5490.410.594
Xerces-1.30.6750.7860.7320.6940.6190.2810.47
Xalan-2.40.6650.6350.630.6490.6580.4440.667
jedit-4.20.7570.7590.6730.730.6990.6990.75
arc0.5580.6490.5810.3830.6120.5630.553
synapse-1.00.7410.7490.7110.6580.53900.678
tomcat0.7350.8570.6040.6610.6590.1620.67
camel-1.00.6350.5210.5980.8080.6550.5820.698
Average0.690.6870.6520.6370.6480.3860.625
Friedman rank2.5332.63.84.23.8336.6334.4
Table 10. p-Value of Wilcoxon signed-rank tests of resampling methods.
Table 10. p-Value of Wilcoxon signed-rank tests of resampling methods.
LIMCR vs. SMOELIMCR vs. IHTLIMCR vs. B-SMOLIMCR vs. NCL
Balancedscore0.5320.0430.0310.041
G-mean0.6910.0470.1980.011
Table 11. Balancedscore of matching of classifiers with resampling methods on small datasets.
Table 11. Balancedscore of matching of classifiers with resampling methods on small datasets.
DatasetsILMCRSMOEIHT
GNBDTCABCGNBDTCABCGNBDTCABC
synapse-1.20.7290.7010.6890.6990.7130.7190.7150.6630.663
jedit-3.20.7030.7430.730.6690.6980.7160.720.7310.733
synapse-1.10.6920.640.6910.6840.5370.6570.6430.5960.742
log4j-1.00.7720.7320.7260.6450.750.7620.6020.750.685
jedit-4.00.7580.6660.7340.7080.630.6520.640.6590.71
ant-1.70.780.7780.7180.7540.7390.6870.7170.7680.779
camel-1.60.6080.5990.5950.6160.6350.6570.590.5990.643
camel-1.40.6560.5250.5290.6650.5880.6120.6580.5910.597
Xerces-1.30.6930.6590.6660.7870.6980.7450.7610.6840.682
Xalan-2.40.680.6390.5490.650.6570.70.680.7080.725
jedit-4.20.7580.6740.6360.7590.7560.7370.7270.6860.734
arc0.5610.6040.5690.6590.5020.6490.5830.5260.507
synapse-1.00.7410.6790.6520.7540.6880.6960.7230.7320.759
tomcat0.7350.6390.7260.8570.7760.7860.6050.7520.779
camel-1.00.6540.7510.6940.540.7820.6240.6080.740.778
Average0.7010.6690.660.6960.6770.6930.6650.6790.701
Rank3.5675.96.4674.25.5674.25.95.3673.833
Table 12. G-mean of matching of classifiers with resampling methods on small datasets.
Table 12. G-mean of matching of classifiers with resampling methods on small datasets.
DatasetsILMCRSMOEIHT
GNBDTCABCGNBDTCABCGNBDTCABC
synapse-1.20.7210.6880.6890.6980.7130.7170.7140.660.66
jedit-3.20.6920.740.730.6690.640.6850.7030.7250.73
synapse-1.10.6720.6390.690.6770.5110.6380.6430.5520.734
log4j-1.00.760.7320.7240.6190.7450.7590.5950.7450.677
jedit-4.00.7550.6320.7340.7020.6230.6220.6380.6560.709
ant-1.70.780.7720.7170.7490.7230.6710.710.7660.778
camel-1.60.5550.5920.5940.5780.6060.6440.5890.5990.638
camel-1.40.650.5220.5140.6630.5590.5860.6530.5910.587
Xerces-1.30.6750.6570.6660.7860.6870.7190.7320.6820.678
Xalan-2.40.6650.6130.5420.6350.6450.6710.630.6820.724
jedit-4.20.7570.6740.6360.7590.7430.730.6730.6860.725
arc0.5580.6040.5380.6490.5010.6470.5810.5060.492
synapse-1.00.7410.6310.6470.7490.6660.6780.7110.7210.753
tomcat0.7350.6390.7230.8570.7760.7860.6040.7460.777
camel-1.00.6350.750.6860.5210.7810.5820.5980.7370.778
Average0.690.6590.6550.6870.6610.6760.6520.670.696
Rank3.93366.14.25.4334.7335.84.9333.867
Table 13. p-Value of Friedment test between different classifiers for resampling methods.
Table 13. p-Value of Friedment test between different classifiers for resampling methods.
MetricsTotalLIMCRSMOEIHT
Balancedscore0.0240.0110.1650.051
G-mean0.1230.0850.1540.07
Table 14. Balancedscore of different number features in GNB with LIMCR on small datasets.
Table 14. Balancedscore of different number features in GNB with LIMCR on small datasets.
Dataset48121620
synapse-1.20.6420.6490.6380.6590.68
jedit-3.20.6590.7380.7620.7410.772
synapse-1.10.6370.6610.6610.6230.622
log4j-1.00.6430.6490.6730.5420.577
jedit-4.00.5710.5710.5790.6210.657
ant-1.70.7160.7680.7730.7610.767
camel-1.60.5310.5390.5620.5690.568
camel-1.40.5180.5640.5870.6220.63
Xerces-1.30.6070.6090.6920.6740.675
Xalan-2.40.5880.5890.6190.6860.661
jedit-4.20.6160.6390.6850.6830.726
arc0.6990.6930.6460.6310.59
synapse-1.00.9220.8220.8530.7070.707
tomcat0.6140.6210.640.6660.704
camel-1.00.5760.5370.5230.7450.559
Average0.6360.6430.660.6620.66
Rank3.93.3332.5672.7672.433
Table 15. G-mean of different number features in GNB with LIMCR on small datasets.
Table 15. G-mean of different number features in GNB with LIMCR on small datasets.
Dataset48121620
synapse-1.20.590.6060.6230.6460.673
jedit-3.20.5930.7190.7580.7360.771
synapse-1.10.5520.6070.6540.5840.607
log4j-1.00.5630.6060.6220.4560.476
jedit-4.00.4490.4490.5140.5820.638
ant-1.70.690.7620.7720.7610.767
camel-1.60.3140.3540.4670.450.449
camel-1.40.270.4150.5050.6220.626
Xerces-1.30.5150.5370.6640.640.675
Xalan-2.40.4770.5070.560.680.638
jedit-4.20.5120.5530.6270.6440.69
arc0.6510.6470.6140.6260.589
synapse-1.00.9190.8220.8410.6430.643
tomcat0.5080.5360.5870.6580.698
camel-1.00.4050.3890.5230.7410.487
Average0.5340.5670.6220.6310.628
Rank4.2333.62.2672.72.2
Table 16. p-Value of Friedman test for RQ4.
Table 16. p-Value of Friedman test for RQ4.
MetricBalancedscoreG-Mean
p-value0.054<0.001
Table 17. Balancedscore of matching of classifiers with LIMCR and IHT on large datasets.
Table 17. Balancedscore of matching of classifiers with LIMCR and IHT on large datasets.
DatasetsILMCR+GNBILMCR+DTCILMCR+ABCIHT+GNBIHT+DTCIHT+ABC
JM10.5620.5150.5220.6150.640.612
KC10.6310.6340.6170.6280.620.647
MC10.6550.5360.570.6970.5630.617
PC30.4860.530.5380.7180.7550.741
PC40.5770.7170.7590.650.8260.799
PC50.5540.5560.5950.6230.6790.669
prop-10.6120.6520.6260.630.690.692
prop-20.5930.5390.530.5950.5730.7
prop-30.5390.5890.5420.580.640.665
prop-40.6060.6070.5970.6320.6910.704
prop-50.5440.560.5510.6050.6350.681
ec-2.00.6940.6240.6830.6740.6860.685
ec-2.10.6430.6280.6510.6830.6770.659
ec-3.00.6790.6380.6480.6720.650.651
Average0.5980.5950.6020.6430.6660.68
Rank4.2864.6434.71432.4291.929
Table 18. G-mean of matching of classifiers with LIMCR and IHT on large datasets.
Table 18. G-mean of matching of classifiers with LIMCR and IHT on large datasets.
DatasetsILMCR+GNBILMCR+DTCILMCR+ABCIHT+GNBIHT+DTCIHT+ABC
JM10.3980.1870.2310.5590.6280.596
KC10.5830.5690.5340.6190.6090.634
MC10.6540.2770.3890.6680.3750.521
PC30.2070.270.350.7130.7440.724
PC40.4830.7040.7590.6470.8150.773
PC50.3660.360.5340.590.650.645
prop-10.5730.6170.5880.6130.6870.69
prop-20.5060.3070.5160.5460.4520.697
prop-30.4660.5760.5110.5140.6390.663
prop-40.5590.5010.5540.6010.6890.703
prop-50.3960.5580.5460.5750.6330.66
ec-2.00.6830.620.6740.6410.6510.642
ec-2.10.620.6280.6480.6720.6350.6
ec-3.00.6620.6380.6430.640.5990.597
Average0.5110.4870.5340.6140.6290.653
Rank4.2864.8573.92932.5712.357
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Wu, Y.; Yao, J.; Chang, S.; Liu, B. LIMCR: Less-Informative Majorities Cleaning Rule Based on Naïve Bayes for Imbalance Learning in Software Defect Prediction. Appl. Sci. 2020, 10, 8324. https://doi.org/10.3390/app10238324

AMA Style

Wu Y, Yao J, Chang S, Liu B. LIMCR: Less-Informative Majorities Cleaning Rule Based on Naïve Bayes for Imbalance Learning in Software Defect Prediction. Applied Sciences. 2020; 10(23):8324. https://doi.org/10.3390/app10238324

Chicago/Turabian Style

Wu, Yumei, Jingxiu Yao, Shuo Chang, and Bin Liu. 2020. "LIMCR: Less-Informative Majorities Cleaning Rule Based on Naïve Bayes for Imbalance Learning in Software Defect Prediction" Applied Sciences 10, no. 23: 8324. https://doi.org/10.3390/app10238324

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop