CODE: A Moving-Window-Based Framework for Detecting Concept Drift in Software Defect Prediction

Kabir, Md Alamgir; Begum, Shahina; Ahmed, Mobyen Uddin; Rehman, Atiq Ur

doi:10.3390/sym14122508

Open AccessArticle

CODE: A Moving-Window-Based Framework for Detecting Concept Drift in Software Defect Prediction

¹

Artificial Intelligence and Intelligent Systems Research Group, School of Innovation, Design and Engineering, Mälardalen University, Högskoleplan 1, 722 20 Västerås, Sweden

²

Department of Electrical and Computer Engineering, Pak-Austria Fachhochschule Institute of Applied Sciences and Technology, Haripur 22621, Pakistan

^*

Author to whom correspondence should be addressed.

Symmetry 2022, 14(12), 2508; https://doi.org/10.3390/sym14122508

Submission received: 17 October 2022 / Revised: 31 October 2022 / Accepted: 14 November 2022 / Published: 28 November 2022

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Concept drift (CD) refers to data distributions that may vary after a minimum stable period. CD negatively influences models’ performance of software defect prediction (SDP) trained on past datasets when applied to the new datasets. Based on previous studies of SDP, it is confirmed that the accuracy of prediction models is negatively affected due to changes in data distributions. Moreover, cross-version (CV) defect data are naturally asymmetric due to the nature of their class imbalance. In this paper, a moving window-based concept-drift detection (CODE) framework is proposed to detect CD in chronologically asymmetric defective datasets and to investigate the feasibility of alleviating CD from the data. The proposed CODE framework consists of four steps, in which the first pre-processes the defect datasets and forms CV chronological data, the second constructs the CV defect models, the third calculates the test statistics, and the fourth provides a hypothesis-test-based CD detection method. In prior studies of SDP, it is observed that in an effort to make the data more symmetric, class-rebalancing techniques are utilized, and this improves the prediction performance of the models. The ability of the CODE framework is demonstrated by conducting experiments on 36 versions of 10 software projects. Some of the key findings are: (1) Up to 50% of the chronological-defect datasets are drift-prone while applying the most popular classifiers used from the SDP literature. (2) The class-rebalancing techniques had a positive impact on the prediction performance for CVDP by correctly classifying the CV defective modules and detected CD by up to 31% on the resampled datasets.

Keywords:

software defect prediction; cross-version defect prediction; chronological-defect data; class rebalancing; concept drift

1. Introduction

The results of the software industry’s advancement and the widespread use of software applications creates massive amounts of data. These data are being displayed in the streaming format [1]. These circumstances result in non-stationary data. The consequences of non-stationary environments (NSE) are improbable, and learning from one presents difficulties. For instance, the probabilistic characteristics of data evolve over time. As a result, a successful prediction model developed in a stationary environment is rendered useless [2], which has been the subject of recent intense research. Furthermore, predicting defects in such a non-stationary environment becomes challenging. Recent frameworks (e.g., [3]) developed for stationary settings are inadequate to handle non-stationary defect data. However, a practical learning framework is required for the increasing amount of defect data driven from NSE.

The software entities (such as files, packages, and functions) in a software system that are prone to defects are identified through models for software defect prediction (SDP). An accurate prediction model assists developers in concentrating on the anticipated flaws and efficiently using their time and effort. Prediction models collect knowledge about a software project by studying previous software information, and they can then predict whether or not instances introduced in the future will be flawed. There are numerous software defect prediction (SDP) models (e.g., [4,5,6,7,8]), which can be used to create a trustworthy and high-quality software system that uses machine learning to predict faulty modules in the software system. These models forecast defects for upcoming projects using historical data (on which a defect prediction model is trained). Among these SDP techniques, cross-version defect prediction (CVDP) [9] is a process wherein the historical data from prior versions of a project are used to train models to predict defects in the latest version of the same project. As this reflects the actual process of software development, it is more realistic and effective than cross-project [6,10] and cross-validation defect prediction [11,12]. This approach to determining the training and test sets is also referred to as chronological splitting [13,14] in the area of software-effort prediction. In addition to the practical relevance of CVDP, two seemingly contradictory key facts motivated us to conduct our study in the CVDP context. (1) The use of long-standing software results in the co-existence of several versions, and, consequently, the characteristics of development-related data, which are used to build prediction models, might change with each release after a stable period, and (2) at the same time, the defect-inducing patterns of previous versions might be relevant to defect prediction for the latest version. Therefore, a prediction model built upon a prior version of the same project may produce better results than a prediction model that uses data from different projects [15], as the latter would share similar distribution information among releases.

Furthermore, software defect datasets are usually imbalanced with defective and non-defective modules that makes them asymmetric [16]. Previous studies have tried to show performance improvement for SDP models by adopting rebalancing techniques in an effort to make the data more symmetric and easy to handle [16,17]. For example, Tantithamthavorn et al. [17] show the impact of class-rebalancing techniques on SDP models. They point out that drift develops when the class distributions of training and testing sets are not similar. Furthermore, the change in prior probabilities of classes is responsible for drift [18]. Since class-rebalancing techniques try to produce a similar representation of two classes, they may eliminate drift and impact the performance of CVDP.

In this paper, a moving window-based concept-drift detection (CODE) is proposed to detect concept drift (CD) in choronological-defect datasets. Moreover, a systematic investigation is conducted by utilizing the CODE framework to examine and assess how drift prone the chronological defect datasets are and examine whether class rebalancing could help eliminate CD from chronological defect datasets and also impact the prediction performance of CVDP. In other words, we assess how drift prone the resampled datasets are and the impact on the performance of CVDP. Thereby, we designed the following two research questions:

(RQ1): How drift prone are the chronological-defect datasets?
(RQ2): Does class rebalancing eliminate CD from chronological-defect datasets and impact the performance of CVDP?

The conducted empirical investigation on 36 versions of 10 open-source projects collected by Madeyski and Jureczko, based on the aforementioned research questions and the proposed CODE framework, demonstrates that 50% of the chronological software defect-datasets are drift prone while utilizing the most widely known classifiers used in the SDP research and the class-rebalancing techniques affect the predictive accuracy of CVDP models and eliminate the CD from the chronological data by up to 31%.

The rest of this paper is organized as follows: Section 2 presents the theories of concept drift and the description of drift detection methods. Section 3 describes the CODE framework and details each step involved in the framework. Section 4 is devoted to the experiment for evaluating the CODE framework for predicting CVDP. In Section 5, we present the findings from the experiments. Section 6 discusses the results obtained from the experiments. Section 7 highlights the potential threats. The associated study summarizing the CVDP experiments is presented in Section 8, and the conclusion is presented in Section 9.

2. Theoretical Background

2.1. Concept of Dataset Shift

The recent prevalence of technology has created an enormous amount of data available in the streaming format. In general, streaming data are assumed to be non-stationary [19]. Data shift or concept drift are two different ways that NSE’s data-generating process is characterized [19,20]. The phenomenon has been defined by various researchers in data-stream mining [21,22], and is also known as prior-probability shift [21]. Data drift occurs if the probability of X at time t changes. Formally, it can be defined as

P_{t} (X)

≠

P_{t + 1} (X)

. Furthermore, data drift can be explained from the perspective of two-window-based data distributions known as historical data and new data. In Figure 1, the historical data

D_{t}

to

D_{t + 6}

are considered as the old window and the new data

D_{t + 7}

are considered as the recent window, and are defined by the user as specified by Lu et al. [22]. The windows continually move as new data appear over time. To adapt to the changes in data distribution, the prediction model needs to be regularly updated, as widely acknowledged in the literature [22,23,24].

2.2. Data-Drift Detection Methods

Drift detection methods could be categorized into three methods: error estimations, distribution change detection, and statistical-test-based methods. In this section, we briefly describe some of them.

Error-rate-based methods: These methods focus on monitoring the error rate of the learning algorithms. The drift detection method (DDM) designed for online scenarios is the most traditional technique in the data-streaming literature. DDM was the first model to be developed to define a warning level and a drift level according to the changes in the data distribution. The learning model makes a decision whenever the data become available. Another widely used method is the early-drift detection method (EDDM) [25]. EDDM considers the average distance between two error rates and the standard deviation of the classification model. Both DDM and EDDM are threshold-dependent and use a single instance to identify the drift in streaming data. Adaptive window (ADWIN) is a two-time window-based drift detection method [26]. The sliding window is maintained to adapt to the concept drift where the window size changes according to the distribution changes. The Hoeffding drift detection method (FHDDM) [27] also utilizes a sliding window. It monitors the changes in the probability of current predictors with the previous predictor of maximum probability.

Distribution-based methods: This type of detection method considers the distance metric to identify dissimilarity between the current and historical distribution. If the dissimilarity between current and historical distribution is dissimilar enough to cross a (user-supplied) threshold, the algorithm triggers an alarm to update the model. The users define historical and recent data to identify drift. Figure 1 shows a sliding two-window-based data distribution at different timestamps, where historical data

D_{t}

to

D_{t + 6}

are considered as the old window and the new data

D_{t + 7}

are considered as the recent window. More information about distribution-based drift detection methods can be found in the literature [21,22,23,28].

Hypothesis-test-based methods: This type of method also considers the sliding-window approach with different statistical testing strategies. The statistical test of equal proportions detection (STEPD) assesses two-time windows, i.e., the most recent window and overall window, and computes the changes in error rate for each timestamp. The users define the window size. De Lima Cabral et al. [20] use Fisher’s exact test with two windows and monitors the prediction results of the two windows. Similar to the work of Cabral et al., in our proposed framework, we adopt the strategy of the sliding window.

3. CODE: Concept-Drift Detection Framework

This section lays out the concept-drift detection (CODE) framework for detecting CD in chronological-defect datasets. This framework has four primary steps: (1) data processing, (2) constructing CV defect models, (3) agreement and disagreement measuring (calculating test statistics), and (4) significance test (hypothesis test). Figure 2 provides an overview of the CODE framework.

Step 1 (Data processing): For implication in practice, we formulate the chronological-defect data as a CV data stream (i.e., windowed stream). The adopted version-based moving-window approach (Figure 3) used to build the CV classifiers is the most closely related to actual practice, and, thus, the most practical strategy is to create the window using the method suggested by Klinkenberg et al. [29].

We collect all the available versions from the sources and restructure them according to chronological order. The windowing technique manages the versions. The old windows’ distributions are discarded by following the moving-window approach. This is the chronological splitting procedure [13,14,30] in the window-based operation, known as the data-management approach, used in studies of data-stream mining [29]. The distributions of each version is considered as one window.

Step 2 (Construct cross-version defect models): In CVDP, two nearest versions are considered for the defect prediction while keeping in mind that successive versions have comparable characteristics, which aids in accurately training the model, as indicated by Amasaki [31,32,33]. To forecast the defects in the upcoming version, defect prediction models are trained on the previous version. As discussed above, we utilize the version-based moving-window approach to split the versions, because earlier research used a similar approach (i.e., split the versions). Similar to project ant (see Table 1), it contains five versions (1.3 to 1.7). The prediction model is trained on version 1.3 and used to test the model on version 1.4. Afterwards, the prediction model will be trained on version 1.4 and tested on version 1.5.

Step 3 (Test-statistics calculation): Test statistics are the measurement of the pairwise agreement and disagreement (e.g., diversity) measures of two consecutive CV defect predictors. Several studies [18,20] calculated the agreement and disagreement of the classifiers to detect drift in data streams. For example, De Lima Cabral et al. [20] calculated the diversity of the classifiers using sliding windows. We adopt a similar strategy to calculate the agreement and disagreement of two consecutive CV defective models using a version-based moving-window approach. As an example, let

X = x_{1}, \dots, x_{n}

be a distribution and

o_{1} = [o_{1} (x_{1}), \dots, o_{1} (x_{n})]

a binary vector which represents the output of the classifier c. Suppose, for a pair of classifiers

c_{1}

and

c_{2}

,

n^{a b}

is all possible outcomes of instances

x_{j} \in X

for which

c_{1} (x_{i}) = a

and

c_{2} (x_{i}) = b

.

We calculate the test statistics for consecutive pairs of CV defect models. Let

R_{i}

and

O_{i}

be the distributions obtained from two consecutive CV data points at time t and

t + 1

. The two feature spaces are transformed into

V_{i} = (q_{1}, r_{1}, s_{1}, t_{1})

and

V_{j} = (q_{2}, r_{2}, s_{2}, t_{2})

. For t time window, (

q_{1}

,

t_{1}

) are the correctly classified positive and negative instances, and (

s_{1}

,

r_{1}

) are the wrongly classified positive and negative instances, respectively. Similarly, for the

t + 1

time window, (

q_{2}

,

t_{2}

) are the correctly classified positive and negative instances and (

s_{2}

,

r_{2}

) are the wrongly classified positive and negative instances, respectively. Thereby, the test statistics are calculated.

The agreement and disagreement measures are the consistent and inconsistent decisions of the total considered observations. These reflect the variety of responses of the pair of CV predictors that changes at time t and

t + 1

, i.e., the probability of diversity between a pair of CV learners. Thus, from a

2 \times 2

contingency table (Table 2), the test statistic of correctly and incorrectly classified instances at t and

t + 1

time windows (i.e., variety) are calculated where (

q_{1} + t_{1}

) (i.e., agreement) are the number of correctly classified instances and (

r_{1} + s_{1}

) (i.e., disagreement) are the number of incorrectly classified instances among the (

q_{1} + t_{1} + r_{1} + s_{1}

) total number of instances at the t time window, respectively. Furthermore, (

q_{2} + t_{2}

) (i.e., agreement) is the number of correctly classified instances and (

r_{2} + s_{2}

) (i.e., disagreement) is the number of incorrectly classified instances among the (

q_{2} + t_{2} + r_{2} + s_{2}

) total instances at the

t + 1

time window, respectively.

Step 4 (Hypothesis test): Prior research efforts on CD detection [20,34] make the assumption that, in the absence of concept drift, a prediction model developed using historical data will statistically perform identically to another prediction model developed using current data (data from the following period). To measure the CDs, we use the same theory. A CD exists if an SDP model trained on data from a previous version shows a statistically significant variation in prediction accuracy for detecting defective or non-defective modules of the models of a subsequent version’s data. To determine the statistical significance of the differences seen in two successive versions of the data collected from step 3, we perform Fisher’s exact test at a 5% significant level.

4. Methodology

4.1. Studied Datasets

We considered a corpus of publicly accessible datasets encompassing 36 versions of 10 benchmark open-source software projects when selecting chronological-defect datasets. The Jureczko and Madeyski [35] corpus has been extensively used in CVDP studies [36,37,38,39]. The SEACRAFT repository (https://zenodo.org/communities/seacraft/, (accessed on 10 October 2022)) [40], which was formerly known as the PROMISE repository [41], is from where we obtained the corpus. Table 1 lists the information for these datasets, including the versions and the dates of their releases, as well as the number of problematic modules, faults, and percentages of faults. Day-month-year is the format used for the release date. Our chosen projects have designated versions which are listed chronologically. Version-control repositories, which are consulted for Bangash et al. [42] investigations, are from where the release date of the versions was gathered. These datasets have 20 static metrics and a labeled feature (i.e., BUG) which is listed in Table 3. The defects (i.e., BUG) are in class level. In the datasets, if the value of BUG is 0, the module is non-defective; and otherwise, it is defective. Therefore, our experiment was conducted for CVDP on class level.

4.2. Apply Class-Rebalancing Approaches

We applied four class-rebalancing approaches in this empirical study. These were: random over sampling (OVER), which increases the minority class samples created by randomly selecting and duplicating the minority samples that are the same size as majority class samples; random under sampling (UNDER), which randomly reduces the “majority class” to be the same number as the minority class. A significant drawback of the UNDER approach is that it loses important information [17]. The synthetic minority oversampling technique (SMOTE) excessively samples members of the minority class by creating “synthetic” instances. Bootstrap random oversampling examples (ROSE) generates augmented instances under minority instance. In a practical scenario, class-rebalancing techniques are applied on training sets [17]. In our experiment, we used the techniques on the training sets. To apply OVER, UNDER, and ROSE, we made use of the functions’ implementation, upSample, downSample, and ROSE, respectively, provided by the caret package [44]. To apply SMOTE, we utilized the SMOTE function of the R DMwR package [45] with the default setting; k-neighbour value was set to five.

4.3. Construct Cross-Version Defect Models

Numerous techniques are employed in software defect prediction [17,46]. We chose a reasonable number of classification methods for our empirical study. To build the cross-version defect predictors, we selected six machine-learning algorithms: naive Bayes (NB), random forest (RF), decision tree (DT), support vector machine (SVM), k-nearest neighbor (KNN), and logistic regression (LR).

NB, a probability-based method, estimates score by simplifying the Bayes law which has been widely used in the defect prediction literature [47]. RF, a powerful machine-learning method, builds multiple decision trees from bootstrap data samples. DT, a decision-based method, produces reasonable prediction results with various input data. SVM, kernel-based technique, determines hyperplane that separates negative and positive instances. LR is a regression-based method which measures the relationship between dependent and independent variables. KNN, an instance-based method, utilizes the similarity of features to predict the new data based on how close it is to the assigned values. We used the implementation of caret parameter optimization before generating the models by the train function with the option of nb, rf, rpart, svmLinear, knn, and glm, respectively, because the considered methods provide settings of changeable parameter.

4.4. Calculate Model Performance

Non-defective and defective data sets constitute the defect dataset. The results of the prediction models are used to create these performance measures. Defective modules are viewed as positive and non-defective modules as negative in a confusion matrix, which is used to determine the performance outcomes of defect prediction classification models. The results are categorized into four groups: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).

The considered performance indicators have been used to precisely detect defective classes by several defect prediction studies [16,17,46,48]. The metrics are computed and defined using the confusion matrix shown in Table 4 to evaluate the defect prediction models. For imbalance datasets, Menzies et al. [47] suggested considering the probability of false alarms (pf) and recall (pd) as stable metrics. To achieve high recall and precision simultaneously is challenging in practice, and such a measure, F-measure, calculates the harmonic mean between these two measures. Menzies et al. [49] also refuted precision as an unstable measure. We do not consider precision and F-measure in our empirical study. Measures such as recall (pd), pf, and Gmean are considered in our study. The best predictor is one with a high pd and a low pf. Better performance is indicated by higher Gmean values. Equations (1) to (3) define the mathematical definitions of Gmean, pf, and pd.

p f = \frac{F P}{F P + T N}

(1)

R e c a l l (p d) = \frac{T P}{T P + F N}

(2)

G m e a n = \sqrt{\frac{T P}{T P + F N} \times \frac{T N}{F P + T N}}

(3)

4.5. Statistical Analysis of the Experiment

4.5.1. Statistical Comparison

We evaluated the prediction results using the Mann–Whitney U test at a 5% significance level and adopted the win–tie–loss statistic [50] in order to conduct a statistical analysis and establish whether the performance differences between the benchmark approaches are statistically significant. Using this statistic, we conducted a pairwise comparison between two predictors, i.e., a predictor without employing the class-rebalancing technique with a predictor while utilizing the class rebalancing technique for CVDP. Three counters were computed: If the compared two predictors were not statistically different, they were marked as tie; win counter refers to the significance of a predictor while the other predictor is marked as loss.

4.5.2. Effect-Size Computation

To examine the practical and statistical significance of the empirical results, we adopted a robust effect-size computation method, (Cliff’s

δ)

, to compute the magnitude of the performance differences. The method of computing Cliff’s

δ)

effect size was recommended by Kitchenham et al. [51] for its effectiveness in handling tied results. The values of

δ

range from zero to plus one (+1). A high value indicates more practical and significant results, where zero (0) values denote that the results are identical. We interpret the practical significance using the magnitude thresholds of Romano et al. [52] as follows: negligible effect

(δ < 0.147)

, small effect

(0.147 \leq δ < 0.33)

, medium effect

(0.33 \leq δ < 0.474)

, and large effect

(δ \geq 0.474)

.

4.6. Experimental Setup

The application of class-rebalancing strategies to each version of CVDP is the primary objective of this empirical study, but it also assesses the influence on CD and determines how robust the impact is. To respond to the RQs, we, therefore, conducted an empirical investigation.

(RQ1) How drift prone are the chronological-defect datasets?

Motivation: CD refers to the fact that data distributions that are being collected may vary after a minimum stable period [28,53]. According to Dong et al. [23], such drift may have a negative impact on how well a model performs in making predictions when it is used with new datasets. Our prior research (i.e., [39,54]) confirms that changes in defect data distributions have a negative impact on prediction model accuracy. However, little is known about how drift prone the chronological defect datasets are. Therefore, we set out to explore the studied chronological datasets through RQ1. We used statistical testing, in particular, to locate CD.

Approach: Prior research on CD detection (e.g., [20,34]) makes the assumption that there is no CD if the performance of a prediction model trained on previous data does not statistically differ from that of the model trained on the subsequent time window (data of next period). To measure the CDs in the chronological-defect datasets, we used the same hypothesis. A CD exists if a defect prediction model developed using data from a prior version shows a statistically significant variation in prediction accuracy for detecting defective or non-defective modules of data from a subsequent version of the model. Section 3 provides an overview of CD detection process for chronological-defect datasets. We plot the CD detection distributions for each prediction model in Figure 4.

(RQ2) Does class rebalancing eliminate CD from chronological-defect datasets and impact the performance of CVDP?

Motivation: Class-rebalancing approaches aim to rebalance the classes of the training datasets before constructing a model. Such approaches try to produce a similar representation of the defective and non-defective modules (two-class) prior to developing a defect prediction classification model. Turhan [55] suggested that class-rebalancing approaches may influence CD. Tantithamthavorn et al. [17] pointed out that drift develops when the class distributions of training and testing sets are not similar. Again, the change in prior probabilities of classes is responsible for CD [18]. The performance of CVDP may be affected by class-rebalancing approaches because they aim to build two classes that are comparable to each other in representation, which may reduce CD. Additionally, the results of earlier research projects have demonstrated the advantages of modifying class-balance strategies for SDP. Through RQ2, we assess whether class-rebalancing techniques can help reduce CD and impact the performance of CVDP.

Approach: We calculated and compared the results of using class-rebalancing approaches with and without CD detection and CVDP performance values. The temporal sequence of the versions is taken into consideration while creating the CVDP models, as discussed in Section 3, so that a model was trained using one version and tested using defect datasets from a later version.

To assess whether class-rebalancing techniques can help reduce CD, we compare the CD detection results with and without class-rebalancing techniques among the classification techniques. We plot the CD detection distributions while utilized class-rebalancing techniques in Figure 5. In addition, we computed the differences in the CD detection rate by subtracting the results of with and without class-rebalancing techniques in Figure 6.

We examined the outcomes of each performance metric with and without class-rebalancing strategies in order to determine whether class-rebalancing techniques have an impact on the performance of CVDP models. Each resampling set produced by a class-rebalancing technique is used in this experimental setup as input to the five defect prediction models that were looked at. Ten benchmark projects with 36 versions were taken into consideration. The experiment was carried out in the cross-version situation, as was already described in Section 4.1. Since there are 26 consecutive pairs of performance values for each model, there are 26 performance values for each model. We utilized three performance metrics. A total of 468 performance-measure results were produced by our experiment for the baseline (26 performance values × 6 classifiers × 3 performance measures). Our experiment provides 1872 (468 × 4) performance-measure values for four class-rebalancing approaches. The distribution of performance differences was then plotted using Figure 7’s boxplots for each performance metric corresponding to each classifier and class-rebalancing technique.

5. Results

(RQ1) How drift prone are the chronological-defect datasets?

Figure 4 presents the CDs for each version of the considered chronological-defect datasets corresponding to each classification technique. The varying numbers of CDs detected by the prediction models may be the result of the different models’ sensitivity to the evaluation of the datasets. We observe no drift for the ivy defect datasets when considering the classification techniques. On the other hand, all the versions of log4j and velocity projects are drift-prone when considered among the classification techniques. We notice that classification techniques attain the different CD percentages among all the CV predictors, even if they are applied on the same chronological defect datasets (Figure 8). We find that 50% of datasets are drift-prone when KNN is applied, which is the maximum CD detection. DT achieves the minimum score of drift-proneness, i.e., 35%. Overall, we observe that a maximum of 50% datasets are drift-prone when the prediction models considered the default data with KNN. Furthermore, the use of deep-learning-based models could improve the CD detection results, which are considered as future steps of this study. From the experiments, we observe that CD detection in chronological-defect datasets is model-aware. This observation will help practitioners to produce an effective drift adaptation technique for detecting defects in such a CVDP scenario. Furthermore, this observation agrees with the conclusion of Haug et al. [56], that concept-drift detection is model-dependent. The instability of the CD detection rate among the classification models provides practical implications. In such a realistic CVDP scenario where false-alarm rates are expensive, model awareness reduces the sensitivity of CD in chronological-defect datasets.

(RQ2) Does class rebalancing eliminate CD from chronological-defect datasets and impact the performance of CVDP?

Figure 5 plots the CD detection distributions for each version of the considered defect datasets corresponding to each classifier and each class-rebalancing technique. Like RQ1, the different number of CDs identified while considering class-rebalancing techniques corresponding to each classification model. Figure 6 presents the percentage of the CD detection rate of the datasets while applying the class-rebalancing techniques. The studied class-rebalancing techniques reduce the CDs by balancing the datasets by up to (CD detection for original dataset—CD detection with the resampled dataset, i.e., 46–15.38%) almost 31% while employing ROSE with RF. It yields the highest average pd (i.e., 84%), which fulfill the prediction performance criteria defined by He et al. [48]. We also observe that ROSE with NB achieves a maximum score on average 84% for pd with (CD detection for original dataset—CD detection with resampled dataset, i.e., 42–12%) 30% CD elimination rate. In contrast, while utilizing the default datasets, 42% of that dataset is drift-prone, while a maximum mean pd of 62% was achieved by NB, followed by DT with an average score of 59%. The noteworthy observation is that ROSE has a positive impact on reducing CD by up to 31% when using NB and yields 84% performance improvement for pd of CVDP models. Looking at Figure 6, we observe that resampling the chronological datasets with class-rebalancing techniques eliminates CD from the considered datasets and improves prediction performance. To avoid CD, this observation guides us to recommend that practitioners maintain the high performance of the prediction models over time for such a practical CVDP scenario where pd rates are costly.

For each of the three performance indicators, the results of class-rebalancing approaches applied to CVDP models are shown in Figure 7. We discuss the experimental results for each performance metric that (1) does not affect prediction performance and (2) does affect performance when class-rebalancing approaches are used.

From Figure 7, we observe that the considered class-rebalancing techniques impacted positively on Gmean for most of the CVDP models’ performance improvement. For the ROSE technique, the Gmean measure is less sensitive. Figure 7 shows that the ROSE class-rebalancing technique yields unstable conclusions when applied to CVDP models. This observation supports the conclusion of Tantithamthavorn et al. [17], who observed that the ROSE technique produces a negative and positive impact on Gmean. From our experiment, we observe that the ROSE technique produces a negative impact when the CVDP models are built with RF. The SMOTE, undersampling, and oversampling techniques substantially improve Gmean and show a positive impact on the performance of CVDP models. The performance values of the Gmean for 75% (i.e., the values of 1st–3rd quartiles) of the CVDP models vary from −5% to 19%. Overall, when employing the SMOTE, undersampling, and oversampling techniques, we find that the performance difference in Gmean for 75% (i.e., the values of 1st–3rd quartiles) of the CVDP models vary from 0% to 21%. These results indicate that these techniques tend to positively impact Gmean when they are employed for CVDP models.

Across all the CV models, the class-rebalancing techniques substantially improve pd (i.e., probability of detection) and achieve a maximum score of, on average, 84%. We find that, while employing the ROSE with NB and RF, the proportion of CV defective modules that are accurately classified improves the models’ performance by, on average, up to 84%. From Figure 7, it is clear that the performance for correctly categorizing the CV faulty modules is improved by the use of the SMOTE, undersampling, and oversampling approaches. Overall, we observe that the performance of pd for 75% (i.e., the values of 1st–3rd quartiles) of the CVDP models vary from 4% to 19%. It is interesting to see that class-rebalancing strategies significantly lower pf and raise pd, indicating that they have a good effect on categorizing the CV faulty modules.

To make a statistical evaluation and determine whether the effect of class-rebalancing techniques are statistically significant on the CVDP models, we assess the results by comparing the performance of models trained on default imbalanced datasets with the performance of the models trained on the class-rebalancing techniques using win–tie–loss statistic. A record in Figure 9 shows the performance metrics used, the status of a win (solid green circle) and large effect size (empty green colored box) attached to the “circle" indicates the class-rebalancing techniques outperformed the default (original)-dataset-trained models. A solid black box (in combination with a black circle and an empty black box) shows otherwise. Class-rebalancing techniques with more green circles and embedded green empty boxes indicate a better performance. The noteworthy findings are: undersampling- and SMOTE-trained models outperform the prediction models for all the performance measures. In some cases, oversampling and SMOTE techniques perform better when considered with Gmean. For all considered performance measures, the application of SMOTE and undersampling-techniques-based prediction models statistically outperform the prediction models trained on default (original) chronological-defect datasets. For the pd performance measure, some losses are listed when the oversampling method was applied to the CV prediction models. Based on the empirical analysis and statistical tests, we observe the following:

The ROSE technique eliminates CD by balancing the datasets by up to 31%. Overall, the added benefit of class-rebalancing techniques is noticeable and reduces CD from the chronological-defect datasets.
The experiment results allow us to guide practitioners to maintain the high prediction performance of CVDP models over time to eliminate CDs.
The impact of SMOTE and undersampling techniques on CVDP models are statistically and practically significant regarding the considered performance metrics.

6. Discussion

Is retraining a model actionable enough to remove CD? The most frequent solution in such a case is to update the current model to eliminate drift [22,28]. In Figure 10, we illustrate how a model is updated periodically. In Figure 10a, a stationary model is trained based on the historical data and tested on the new data. The model is never updated during the learning process. In Figure 10b, the prediction model is updated periodically when the new data becomes available. Lyu et al. [34] considered this strategy to update the models for artificial-intelligence-dependent IT solutions. They observed that periodically updated models help eliminated the drifts. In our experimental setup, we follow the same strategy for defect prediction in the cross-version scenario. The CV models are updated in chronological order by maintaining a version-based moving window (Figure 3). Note that while retraining the model, not all of the historical data from earlier versions is used. Lin et al. [57] suggested that using all historical data may not lead to better performance. Through RQ1, we updated the prediction models by maintaining the sliding window which moves forward in chronological order where each window refers to one version of the data. Through RQ2, we adopt the class-rebalancing technique while updating the models. We observe that only retaining the model may not actionable enough, and the added class-rebalancing techniques improve prediction performance and reduce drift. In summary, the experiment results reveal that when the studied classification techniques are trained on the default datasets, up to 50% of the datasets are drift-prone. The class-rebalancing methods increase the CVDP models’ ability to predict and reduce the drift of the chronological-defect datasets by up to 31%.

By carrying out this extensive and methodical empirical study, we were able to make the following two observations, which we will now address.

Model-awareness. We observe from the experiment that the classifiers’ choice is too sensitive to the drift of the chronological-defect datasets. NB achieves the best results without applying any class-rebalancing techniques. However, 50% of the datasets are drift-prone with KNN. Among the considered classifiers, DT yields the minimum percentage of drift-proneness, i.e., 35%, while it could not obtain satisfying prediction performances. Our observation shows that model-awareness reduces the sensitivity of drift.

Update the model—not actionable. From the data-streaming mining literature, the most advised strategy to adapt to CD is to update the model [28]. Our experiment reveals that updating the model does not always improve the prediction performance and does not reduce the drift from the data. While employing class-rebalancing techniques during the drift detection process, we observe that the class-rebalancing techniques eliminates drift up to 31% from the chronological-defect datasets and improves prediction performance. Therefore, we suggest adopting and applying class-rebalancing techniques in the CD elimination process from the chronological-defect datasets and improving prediction performance.

Looking at Figure 4 and Figure 5, we observe that some versions of the projects have CDs while employing the class-rebalancing techniques. However, there were no drifts when the models were updated without applying class-rebalancing techniques. For example, for the ivy project, CDs appear after employing ROSE techniques with KNN. However, there were no drifts when only the models were updated over the versions. This finding encourages us to advise practitioners to take the class-rebalancing strategy into account as an attention approach for CD adaptation. Overall, the class-rebalancing technique can be thought of as one of the properties of drift adaptation in such a CVDP practical environment. From this empirical investigation, we provide guidelines to software-engineering researchers and practitioners—they should be aware of the models and take class rebalancing into consideration as an attention strategy if they want to remove CD from chronological-defect datasets.

7. Threats to Validity

7.1. Construct and Internal Validity

From the available benchmark defect datasets, we chose those that uphold the software development cycle’s chronological order. We consider the chronological-split approach to make the CV scenario. Others might decide to carry out CVDP experiments in a different way. Furthermore, the adopted empirical approach for this study does have an impact on the CD-detection and elimination results for CVDP. It should be noted that our CODE framework relies on the available class labels of testing data. Detection of CD in the absence of class labels is beyond this empirical study, although this is quite an important limitation in practice.

One possible threat concerns the ground-truth information. Chronological-defect data usually does not provide any ground-truth information about the location of CD. From the considered datasets, Xu et al. [36] found that 40% of the modules in the following versions were faulty, indicating distribution disparities. If the versions are thought of as being in a CV data stream, then such disparities may render the versions drift-prone. Based on our previous research [39,54], we affirm the existence of CD in the software defect datasets.

In our comprehensive and systematic empirical study, we characterise the software with static code metrics, which may not capture other aspects captured by process metrics. This is a limitation imposed by the choice of datasets used. However, the considered static code metrics are widely used in the CVDP literature [36,37,38].

7.2. External Validity

We consider a limited number of cross-version systems. As a result, the conclusions may not be generalizable beyond the experimental environment and datasets used. However, to the best of our knowledge, our empirical study considers the chronological-defect datasets that have been used in the studies of CVDP [15,36,37,38,58,59].

In addition, the results of this empirical study rely on cross-version defect prediction. However, there are various scenarios of defect prediction in the literature (e.g., just-in-time defect prediction [60] and cross-project defect prediction [12,61]). Therefore, the lessons learned from the experimental results may differ from other defect prediction scenarios. We studied three performance measures. Our results may vary to threshold-independent performance measures (e.g., area under the receiver operator characteristic curve (AUC) and Brier score), which can be explored in future work.

8. Related Work

8.1. Cross-Version Defect Prediction

CVDP has attracted significant attention because of its applicability in practice. Among the CVDP literature, most works using machine-learning models assume that the historical data collected from software projects used to build prediction models maintain chronological order.

For a realistic evaluation, prediction models have been used for defect prediction using multiple versions of software projects [36,58,62]. For example, Bennin et al. [58] carried out a comparative study of 11 density-based prediction models using effort-ware measures. According to the empirical findings, K-star and M5 are the CVDP context’s top performers. Xu et al. [36] mentioned that the differences in the prediction performance obtained from the experimental results were not statistically significant among all the methods.

Recent work has either emphasized enhancing the training sets or utilized the cross-project defect prediction (CPDP) approaches for CVDP. By utilizing hybrid active learning and kernel principal component analysis (KPCA), Xu et al. [37] suggested a framework for CVDP in which features are taken from the most recent version and integrated with the prior version to create representative module sets (i.e., mixed training sets). The suggested framework made an effort to improve the training sets. However, considerable effort is required to label selected modules and produce a suitable training set. Another shortcoming of this approach is that it requires one to convert the metrics back to the datasets for an estimator, and it influences the result of estimation while the models are constructed, as described by [59]. Additionally, the study by Amasaki [33] concentrated on the outcomes of CPDP strategies under CVDP. These experiments were carried out using a variety of previous versions. However, some methods from CPDP were not considered when comparing with the proposed methods. Moreover, Shukla et al. [15] addressed the problem of multi-objective optimization in CVDP (i.e., maximizing recall by minimizing cost and misclassification) and conducted experiments on 30 versions from 11 projects.

An insightful work [59] in CVDP is based on the ranking task. In this work, Yang and Wen addressed CVDP as a problem of multicollinearity. They investigated ridge and lasso regression in 41 versions from 11 projects. Ridge regression performed better compared with baseline methods. However, their work focuses on ranking tasks, not on the defect classification.

The works mentioned above do not take into account the real-world scenario that software companies expand with time and that the methodologies created may become outdated at a particular point. The works [33,37,58,59] mentioned were probably created for stationary situations. Changes in distributions can lead to inaccurate results, and even a well-trained prediction model will become outdated in the face of such CD, as noted by Dong et al. [23].

8.2. CVDP Considering Distribution Differences

The works that attempted to eliminate distribution disparities between two nearby versions of the CVDP software system are presented in this part.

Xu et al. [38] used dissimilarity-based spare subset selection (DS3) in an effort to minimize the disparities in distribution between the two versions. For the CVDP study, 56 iterations of 15 projects were used. They found that the performance of all the models’ predictions was compromised by distributional differences. By removing identified modules, they attempted to reduce the distribution difference between two nearby versions. Moreover, if modules were deleted from the prior version, it hindered an effective prediction model’s performance. However, it requires extra effort to level the representative modules. Xu et al. [36] conducted a large-scale experiment on 50 versions of 17 projects in order to address this problem and presented a two-stage training-subset selection method. In both studies [36,38], modules were removed from project versions in an attempt to alleviate the distribution differences and produce a suitable training set. Inevitably, this caused the loss of data carrying information about the current and prior versions of the projects. These losses of information are a drawback to the performance of such models.

Even though the studies above considered the chronology of the projects’ versions, they did not consider the CVDP as a learning scenario where the versions appear after a certain period of time. This may lead to data drift. Consequently, a prediction performance drop could occur in case concept drift occurs in the associated versions.

Concerning concept drift, very few studies examined it in SDP [63,64,65]. For instance, Ekanayake et al. [64] conducted an investigation in open-source projects for change-level defect prediction. They observed the impact of concept drift leading to prediction performance deterioration in SDP. In our prior works [39,54], we also observe the significant impact of concept drift in SDP. To a large extent, these studies guide us to consider the chronological-defect datasets as a CV data stream to reflect the CVDP in practice. In the current state of CVDP research, the detection of CD is required for improving prediction performance in CV settings.

9. Conclusions

In this work, we conducted an investigation on concept drift (CD) and the impact of four class-rebalancing techniques, i.e., SMOTE, ROSE, undersampling, and oversampling, on CD for chronological-defect datasets. We performed empirical analysis on 36 versions of 10 software projects where project versions appear in chronological order after a certain period. We trained the cross-version (CV) models using six classification techniques and evaluated the performance using three performance measures. To examine the CD in the CV data stream, a version-based moving-window strategy was used. Through a comprehensive and systematic empirical study, we observe the following:

When using the most-used classifiers from the defect prediction literature, up to 50% of the chronological-defect datasets exhibit drift-prone behavior.
By up to 31%, the CD from the datasets is eliminated by the class-rebalancing procedures, which also enhance CVDP-model prediction performance.
The class-rebalancing techniques exhibit noteworthy significant and practical performance improvement when considering recall and Gmean. Additionally, the models yield better performance enhancement when employing the SMOTE and undersampling techniques.

Based on the findings of the experimental results, we make the following suggestions:

The class-rebalancing techniques are beneficial when the practitioners wish to increase the ability of correctly classifying the CV defective modules.
We suggest adding class-rebalancing techniques in the drift elimination process if practitioners wish to alleviate CD from chronological-defect datasets.

In practice, software quality practitioners are more interested in techniques that can effectively allocate testing resources. Thus, the knowledge of the stability of past prediction models is necessary for practitioners in such NSE. To that end, the CODE framework assesses the robustness of class-rebalancing techniques to help the practitioners in allotting the scarce testing resources, efficiently. From the experiment, the significant results demonstrate the impact of the proposed framework. Although this empirical study is conducted for CVDP, the proposed CODE framework could be employed for any defect datasets where the data appears over time and is also based on the availability of labeled data.

The study is performed specifically on software defect datasets; therefore, the proposed framework cannot be generalized to other applications. As future work, the proposed framework could be tested on more generalized datasets to increase the applicability of the framework. Moreover, as the proposed framework relies on the labeled data, this could limit its use in specific applications where the labeled data is not available.

Author Contributions

Conceptualization, M.A.K.; methodology, M.A.K.; software, M.A.K.; validation, M.A.K., A.U.R. and S.B.; formal analysis, M.A.K.; investigation, M.A.K.; resources, M.A.K.; data curation, M.A.K.; writing—original draft preparation, M.A.K.; writing—review and editing, M.A.K., A.U.R., M.U.A. and S.B.; visualization, M.A.K.; supervision, S.B.; project administration, M.A.K.; funding acquisition, M.A.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Gangwar, A.K.; Kumar, S.; Mishra, A. A Paired Learner-Based Approach for Concept Drift Detection and Adaptation in Software Defect Prediction. Appl. Sci. 2021, 11, 6663. [Google Scholar] [CrossRef]
Malialis, K.; Panayiotou, C.G.; Polycarpou, M.M. Nonstationary data stream classification with online active learning and siamese neural networks. Neurocomputing 2022, 512, 235–252. [Google Scholar] [CrossRef]
Pandit, M.; Gupta, D.; Anand, D.; Goyal, N.; Aljahdali, H.M.; Mansilla, A.O.; Kadry, S.; Kumar, A. Towards Design and Feasibility Analysis of DePaaS: AI Based Global Unified Software Defect Prediction Framework. Appl. Sci. 2022, 12, 493. [Google Scholar] [CrossRef]
Pachouly, J.; Ahirrao, S.; Kotecha, K.; Selvachandran, G.; Abraham, A. A systematic literature review on software defect prediction using artificial intelligence: Datasets, Data Validation Methods, Approaches, and Tools. Eng. Appl. Artif. Intell. 2022, 111, 104773. [Google Scholar] [CrossRef]
Alazba, A.; Aljamaan, H. Software Defect Prediction Using Stacking Generalization of Optimized Tree-Based Ensembles. Appl. Sci. 2022, 12, 4577. [Google Scholar] [CrossRef]
Zhao, Y.; Zhu, Y.; Yu, Q.; Chen, X. Cross-Project Defect Prediction Considering Multiple Data Distribution Simultaneously. Symmetry 2022, 14, 401. [Google Scholar] [CrossRef]
Jorayeva, M.; Akbulut, A.; Catal, C.; Mishra, A. Deep Learning-Based Defect Prediction for Mobile Applications. Sensors 2022, 22, 4734. [Google Scholar] [CrossRef]
Pan, C.; Lu, M.; Xu, B.; Gao, H. An Improved CNN Model for Within-Project Software Defect Prediction. Appl. Sci. 2019, 9, 2138. [Google Scholar] [CrossRef] [Green Version]
Kabir, M.A.; Keung, J.; Turhan, B.; Bennin, K.E. Inter-release defect prediction with feature selection using temporal chunk-based learning: An empirical study. Appl. Soft Comput. 2021, 113, 107870. [Google Scholar] [CrossRef]
Luo, H.; Dai, H.; Peng, W.; Hu, W.; Li, F. An Empirical Study of Training Data Selection Methods for Ranking-Oriented Cross-Project Defect Prediction. Sensors 2021, 21, 7535. [Google Scholar] [CrossRef]
Hosseini, S.; Turhan, B.; Gunarathna, D. A Systematic Literature Review and Meta-Analysis on Cross Project Defect Prediction. IEEE Trans. Softw. Eng. 2019, 45, 111–147. [Google Scholar] [CrossRef] [Green Version]
Porto, F.; Minku, L.; Mendes, E.; Simao, A. A systematic study of cross-project defect prediction with meta-learning. arXiv 2018, arXiv:1802.06025. [Google Scholar]
Lokan, C.; Mendes, E. Investigating the use of moving windows to improve software effort prediction: A replicated study. Empir. Softw. Eng. 2017, 22, 716–767. [Google Scholar] [CrossRef]
Minku, L.; Yao, X. Which models of the past are relevant to the present? A software effort estimation approach to exploiting useful past models. Autom. Softw. Eng. 2017, 24, 499–542. [Google Scholar] [CrossRef] [Green Version]
Shukla, S.; Radhakrishnan, T.; Muthukumaran, K.; Neti, L.B.M. Multi-objective cross-version defect prediction. Soft Comput. 2018, 22, 1959–1980. [Google Scholar] [CrossRef]
Bennin, K.E.; Keung, J.W.; Monden, A. On the relative value of data resampling approaches for software defect prediction. Empir. Softw. Eng. 2019, 24, 602–636. [Google Scholar] [CrossRef]
Tantithamthavorn, C.; Hassan, A.E.; Matsumoto, K. The Impact of Class Rebalancing Techniques on the Performance and Interpretation of Defect Prediction Models. IEEE Trans. Softw. Eng. 2020, 46, 1200–1219. [Google Scholar] [CrossRef] [Green Version]
Mahdi, O.A.; Pardede, E.; Ali, N.; Cao, J. Fast Reaction to Sudden Concept Drift in the Absence of Class Labels. Appl. Sci. 2020, 10, 606. [Google Scholar] [CrossRef] [Green Version]
Ditzler, G.; Roveri, M.; Alippi, C.; Polikar, R. Learning in Nonstationary Environments: A Survey. IEEE Comput. Intell. Mag. 2015, 10, 12–25. [Google Scholar] [CrossRef]
de Lima Cabral, D.R.; de Barros, R.S.M. Concept drift detection based on Fisher’s Exact test. Inf. Sci. 2018, 442–443, 220–234. [Google Scholar] [CrossRef]
Webb, G.I.; Hyde, R.; Cao, H.; Nguyen, H.L.; Petitjean, F. Characterizing concept drift. Data Min. Knowl. Discov. 2016, 30, 964–994. [Google Scholar] [CrossRef]
Lu, J.; Liu, A.; Dong, F.; Gu, F.; Gama, J.; Zhang, G. Learning under Concept Drift: A Review. IEEE Trans. Knowl. Data Eng. 2019, 31, 2346–2363. [Google Scholar] [CrossRef] [Green Version]
Dong, F.; Lu, J.; Li, K.; Zhang, G. Concept drift region identification via competence-based discrepancy distribution estimation. In Proceedings of the 2017 12th International Conference on Intelligent Systems and Knowledge Engineering (ISKE), Nanjing, China, 24–26 November 2017; pp. 1–7. [Google Scholar] [CrossRef] [Green Version]
Rehman, A.U.; Belhaouari, S.B.; Ijaz, M.; Bermak, A.; Hamdi, M. Multi-Classifier Tree With Transient Features for Drift Compensation in Electronic Nose. IEEE Sens. J. 2021, 21, 6564–6574. [Google Scholar] [CrossRef]
Baena-Garcıa, M.; del Campo-Ávila, J.; Fidalgo, R.; Bifet, A.; Gavalda, R.; Morales-Bueno, R. Early drift detection method. In Proceedings of the Fourth International Workshop on Knowledge Discovery from Data Streams, Xi’an, China, 14–16 August 2006; Volume 6, pp. 77–86. [Google Scholar]
Bifet, A.; Gavalda, R. Learning from time-changing data with adaptive windowing. In Proceedings of the 2007 SIAM International Conference on Data Mining, Minneapolis, MN, USA, 26–28 April 2007; pp. 443–448. [Google Scholar]
Pesaranghader, A.; Viktor, H.L. Fast Hoeffding Drift Detection Method for Evolving Data Streams. In Machine Learning and Knowledge Discovery in Databases; Frasconi, P., Landwehr, N., Manco, G., Vreeken, J., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 96–111. [Google Scholar]
Gama, J.; Zliobaite, I.; Bifet, A.; Pechenizkiy, M.; Bouchachia, A. A Survey on Concept Drift Adaptation. ACM Comput. Surv. 2014, 46, 1–37. [Google Scholar] [CrossRef]
Klinkenberg, R.; Joachims, T. Detecting Concept Drift with Support Vector Machines. In Proceedings of the Seventeenth International Conference on Machine Learning, San Francisco, CA, USA, 29 June–2 July 2000; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2000; pp. 487–494. [Google Scholar]
Lokan, C.; Mendes, E. Investigating the use of duration-based moving windows to improve software effort prediction: A replicated study. Inf. Softw. Technol. 2014, 56, 1063–1075. [Google Scholar] [CrossRef] [Green Version]
Amasaki, S. On Applicability of Cross-Project Defect Prediction Method for Multi-Versions Projects. In Proceedings of the 13th International Conference on Predictive Models and Data Analytics in Software Engineering, Toronto, ON, Canada, 8 November 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 93–96. [Google Scholar] [CrossRef]
Amasaki, S. Cross-Version Defect Prediction Using Cross-Project Defect Prediction Approaches: Does It Work? In Proceedings of the 14th International Conference on Predictive Models and Data Analytics in Software Engineering, Oulu, Finland, 10 October 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 32–41. [Google Scholar] [CrossRef]
Amasaki, S. Cross-version defect prediction: Use historical data, cross-project data, or both? Empir. Softw. Eng. 2020, 25, 1573–1595. [Google Scholar] [CrossRef]
Lyu, Y.; Li, H.; Sayagh, M.; Jiang, Z.M.J.; Hassan, A.E. An Empirical Study of the Impact of Data Splitting Decisions on the Performance of AIOps Solutions. ACM Trans. Softw. Eng. Methodol. 2021, 30, 1–38. [Google Scholar] [CrossRef]
Madeyski, L.; Jureczko, M. Which process metrics can significantly improve defect prediction models? An empirical study. Softw. Qual. J. 2015, 23, 393–422. [Google Scholar] [CrossRef]
Xu, Z.; Li, S.; Luo, X.; Liu, J.; Zhang, T.; Tang, Y.; Xu, J.; Yuan, P.; Keung, J. TSTSS: A two-stage training subset selection framework for cross version defect prediction. J. Syst. Softw. 2019, 154, 59–78. [Google Scholar] [CrossRef]
Xu, Z.; Liu, J.; Luo, X.; Zhang, T. Cross-version defect prediction via hybrid active learning with kernel principal component analysis. In Proceedings of the 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), Campobasso, Italy, 20–23 March 2018; pp. 209–220. [Google Scholar] [CrossRef]
Xu, Z.; Li, S.; Tang, Y.; Luo, X.; Zhang, T.; Liu, J.; Xu, J. Cross Version Defect Prediction with Representative Data via Sparse Subset Selection. In Proceedings of the 26th Conference on Program Comprehension, Gothenburg, Sweden, 27–28 May 2018; ACM: New York, NY, USA, 2018; pp. 132–143. [Google Scholar] [CrossRef]
Kabir, M.A.; Keung, J.W.; Bennin, K.E.; Zhang, M. Assessing the Significant Impact of Concept Drift in Software Defect Prediction. In Proceedings of the 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC), Milwaukee, WI, USA, 15–19 July 2019; Volume 1, pp. 53–58. [Google Scholar] [CrossRef]
The SEACRAFT Repository of Empirical Software Engineering Data. 2017. Available online: https://zenodo.org/communities/seacraft (accessed on 1 January 2022).
The Promise Repository of Empirical Software Engineering Data. 2005. Available online: http://promise.site.uottawa.ca/SERepository (accessed on 1 January 2022).
Bangash, A.A.; Sahar, H.; Hindle, A.; Ali, K. On the time-based conclusion stability of cross-project defect prediction models. Empir. Softw. Eng. Int. J. 2020, 25, 5047–5083. [Google Scholar] [CrossRef]
Feng, S.; Keung, J.; Yu, X.; Xiao, Y.; Bennin, K.E.; Kabir, M.A.; Zhang, M. COSTE: Complexity-based OverSampling TEchnique to alleviate the class imbalance problem in software defect prediction. Inf. Softw. Technol. 2021, 129, 106432. [Google Scholar] [CrossRef]
Kuhn, M.; Wing, J.; Weston, S.; Williams, A.; Keefer, C.; Engelhardt, A.; Cooper, T.; Mayer, Z.; Kenkel, B.; Team, R.C.; et al. Package ‘caret’. R J. 2020. Available online: http://free-cd.stat.unipd.it/web/packages/caret/caret.pdf (accessed on 1 January 2022).
Torgo, L.; Torgo, M.L. Package ‘DMwR’; Comprehensive R Archive Network: Vienna, Austria, 2013. [Google Scholar]
Bennin, K.E.; Keung, J.; Phannachitta, P.; Monden, A.; Mensah, S. MAHAKIL: Diversity Based Oversampling Approach to Alleviate the Class Imbalance Issue in Software Defect Prediction. IEEE Trans. Softw. Eng. 2018, 44, 534–550. [Google Scholar] [CrossRef]
Menzies, T.; Greenwald, J.; Frank, A. Data Mining Static Code Attributes to Learn Defect Predictors. IEEE Trans. Softw. Eng. 2007, 33, 2–13. [Google Scholar] [CrossRef]
He, Z.; Shu, F.; Yang, Y.; Li, M.; Wang, Q. An investigation on the feasibility of cross-project defect prediction. Autom. Softw. Eng. 2012, 19, 167–199. [Google Scholar] [CrossRef]
Menzies, T.; Dekhtyar, A.; Distefano, J.; Greenwald, J. Problems with Precision: A Response to “Comments on ’Data Mining Static Code Attributes to Learn Defect Predictors”. IEEE Trans. Softw. Eng. 2007, 33, 637–640. [Google Scholar] [CrossRef]
Kocaguneli, E.; Menzies, T.; Keung, J.; Cok, D.; Madachy, R. Active learning and effort estimation: Finding the essential content of software effort estimation data. IEEE Trans. Softw. Eng. 2013, 39, 1040–1053. [Google Scholar] [CrossRef]
Kitchenham, B.; Madeyski, L.; Budgen, D.; Keung, J.; Brereton, P.; Charters, S.; Gibbs, S.; Pohthong, A. Robust Statistical Methods for Empirical Software Engineering. Empir. Softw. Eng. 2017, 22, 579–630. [Google Scholar] [CrossRef] [Green Version]
Romano, J.; Kromrey, J.D.; Coraggio, J.; Skowronek, J. Appropriate statistics for ordinal level data: Should we really be using t-test and Cohen’sd for evaluating group differences on the NSSE and other surveys. In Proceedings of the Annual Meeting of the Florida Association of Institutional Research, Cocoa Beach, FL, USA, 1–3 February 2006; pp. 1–33. [Google Scholar]
Gama, J. Knowledge Discovery from Data Streams; CRC Press: Boca Raton, FL, USA, 2010. [Google Scholar]
Kabir, M.A.; Keung, J.W.; Bennin, K.E.; Zhang, M. A Drift Propensity Detection Technique to Improve the Performance for Cross-Version Software Defect Prediction. In Proceedings of the 2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC), Madrid, Spain, 13–17 July 2020; pp. 882–891. [Google Scholar] [CrossRef]
Turhan, B. On the dataset shift problem in software engineering prediction models. Empir. Softw. Eng. 2012, 17, 62–74. [Google Scholar] [CrossRef]
Haug, J.; Kasneci, G. Learning Parameter Distributions to Detect Concept Drift in Data Streams. arXiv 2020, arXiv:2010.09388. [Google Scholar]
Lin, Q.; Hsieh, K.; Dang, Y.; Zhang, H.; Sui, K.; Xu, Y.; Lou, J.G.; Li, C.; Wu, Y.; Yao, R.; et al. Predicting node failure in cloud service systems. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Lake Buena Vista, FL, USA, 4–9 November 2018; pp. 480–490. [Google Scholar]
Bennin, K.E.; Toda, K.; Kamei, Y.; Keung, J.; Monden, A.; Ubayashi, N. Empirical Evaluation of Cross-Release Effort-Aware Defect Prediction Models. In Proceedings of the 2016 IEEE International Conference on Software Quality, Reliability and Security (QRS), Vienna, Austria, 1–3 August 2016; pp. 214–221. [Google Scholar] [CrossRef]
Yang, X.; Wen, W. Ridge and Lasso Regression Models for Cross-Version Defect Prediction. IEEE Trans. Reliab. 2018, 67, 885–896. [Google Scholar] [CrossRef]
Fan, Y.; Xia, X.; Alencar da Costa, D.; Lo, D.; Hassan, A.E.; Li, S. The Impact of Changes Mislabeled by SZZ on Just-in-Time Defect Prediction. IEEE Trans. Softw. Eng. 2019, 47, 1559–1586. [Google Scholar] [CrossRef]
Herbold, S.; Trautsch, A.; Grabowski, J. A Comparative Study to Benchmark Cross-Project Defect Prediction Approaches. IEEE Trans. Softw. Eng. 2018, 44, 811–833. [Google Scholar] [CrossRef]
Turhan, B.; Tosun Mısırlı, A.; Bener, A. Empirical evaluation of the effects of mixed project data on learning defect predictors. Inf. Softw. Technol. 2013, 55, 1101–1118. [Google Scholar] [CrossRef]
Ekanayake, J.; Tappolet, J.; Gall, H.C.; Bernstein, A. Tracking concept drift of software projects using defect prediction quality. In Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories, Vancouver, BC, Canada, 16–17 May 2009; pp. 51–60. [Google Scholar] [CrossRef]
Ekanayake, J.; Tappolet, J.; Gall, H.C.; Bernstein, A. Time variance and defect prediction in software projects. Empir. Softw. Eng. 2012, 17, 348–389. [Google Scholar] [CrossRef]
Bennin, K.E.; bin Ali, N.; Börstler, J.; Yu, X. Revisiting the Impact of Concept Drift on Just-in-Time Quality Assurance. In Proceedings of the 2020 IEEE 20th International Conference on Software Quality, Reliability and Security (QRS), Macau, China, 11–14 December 2020. [Google Scholar] [CrossRef]

Figure 1. Two-window-based data distribution at timestamp

t + 7

.

Figure 1. Two-window-based data distribution at timestamp

t + 7

.

Figure 2. An overview of CODE framework.

Figure 3. Window size 1 (i.e., defect data for one version) in a version-based moving-window-based approach.

Figure 4. When taking into account all metric sets, CD detection was found in the benchmark datasets in this investigation. The project’s drift between two successive versions is indicated by a green sign (p-value < 0.05 in Fisher’s exact test); whereas a red symbol denotes the opposite (p-value ≥ 0.05 in Fisher’s exact test).

Figure 5. CD detection in the chronological-defect datasets while considering class-rebalancing techniques. A red indicator denotes a drift between two successive versions of the software project (p-value < 0.05 in Fisher’s exact Test), while a green symbol denotes the opposite (p-value ≥ 0.05 in Fisher’s exact Test).

Figure 6. The percentage of CD elimination difference when applying class-rebalancing techniques to 36 cross-version defect datasets. Here, a high value represents the best score, i.e., the CD elimination rate is high.

Figure 7. The CVDP performance difference, when employing class-rebalancing techniques. A red line indicates no improvement (i.e., the performance difference is zero).

Figure 8. The percentage of CD detection across the prediction models of the 36 chronological versions of 10 open-source software projects. Here, a high value defines the worst score, i.e., the CD detection rate is high.

Figure 9. Mann–Whitney U test win–tie–loss comparison with Cliff’s

| δ |

effect-size magnitudes of without class-rebalancing vs. with class-rebalancing techniques across cross-version defect prediction models. The complete green box indicates significant values.

Figure 9. Mann–Whitney U test win–tie–loss comparison with Cliff’s

| δ |

effect-size magnitudes of without class-rebalancing vs. with class-rebalancing techniques across cross-version defect prediction models. The complete green box indicates significant values.

Figure 10. Scenario for periodically updated models [34].

Table 1. A statistical analysis of various defect datasets.

Project	Version	Release Date	#Modules	#Defects	Defects (%)
ant	ant-1.3	12 August 2003	125	20	15.90%
	ant-1.4	12 August 2003	178	40	22.50%
	ant-1.5	12 August 2003	293	32	10.90%
	ant-1.6	18 December 2003	351	92	26.10%
	ant-1.7	13 December 2006	745	166	22.30%
camel	camel-1.0	19 January 2009	339	13	3.80%
	camel-1.2	19 January 2009	608	216	35.50%
	camel-1.4	19 January 2009	872	145	16.60%
	camel-1.6	17 February 2009	965	188	19.50%
poi	poi-1.5	24 June 2007	1988	141	59.50%
	poi-2.0	24 June 2007	9277	37	11.80%
	poi-2.5	24 June 2007	1988	248	64.40%
	poi-3.0	24 June 2007	125	281	63.60%
log4j	log4j-1.0	08 January 2001	135	34	25.20%
	log4j-1.1	20 May 2001	109	37	33.90%
	log4j-1.2	10 May 2002	205	189	92.20%
xerces	xerces-init	08 November 1999	162	77	47.50%
	xerces-1.2	23 June 2000	440	71	16.10%
	xerces-1.3	29 November 2000	453	69	15.20%
	xerrces-1.4	26 January 2001	588	437	74.30%
velocity	velocity-1.4	01 December 2006	196	147	75.00%
	velocity-1.5	06 March 2007	214	142	66.40%
	velocity-1.6	01 December 2008	229	78	34.10%
ivy	ivy-1.1	13 June 2005	111	63	56.80%
	ivy-1.4	09 November 2006	241	16	6.60%
	ivy-2.0	18 January 2009	352	40	11.40%
lucene	lucene-2.0	26 May 2006	195	91	46.70%
	lucene-2.2	17 June 2007	247	144	58.30%
	lucene-2.4	08 October 2008	340	203	59.70%
synapse	synapse-1.0	13 June 2007	157	16	10.20%
	synapse-1.1	12 November 2007	222	60	27.00%
	synapse-1.2	09 June 2008	256	86	33.60%
xalan	xalan-2.4	28 August 2002	723	110	15.20%
	xalan-2.5	10 April 2003	803	387	48.20%
	xalan-2.6	27 February 2004	885	411	46.40%
	xalan-2.7	06 August 2005	909	898	98.80%

Table 2. The test statistics of a pair of classifiers at time t and

t + 1

.

Table 2. The test statistics of a pair of classifiers at time t and

t + 1

.

Instances	Window at Time t	Window at Time $t + 1$
# of correct	( $q_{1} + t_{1}$ )	( $q_{2} + t_{2}$ )
# of incorrect	( $r_{1} + s_{1}$ )	( $r_{2} + s_{2}$ )
Total	( $q_{1} + t_{1} + r_{1} + s_{1}$ )	( $q_{2} + t_{2} + r_{2} + s_{2}$ )

Table 3. Description of the static code metrics [43] used in this study.

Metric	Description
WMC	Weighted methods per class
DIT	Depth of inheritance tree
NOC	Number of children
CBO	Coupling between objects classes
RFC	Response for a class
LCOM	Lack of cohesion in methods
Ca	Afferent coupling
Ce	Effective coupling
NPM	Number of public methods
LCOM3	Lack of cohesion in methods, different from LCOM
LOC	Number of lines of code
DAM	Data access metric
MOA	Measure of aggregation
MFA	Measure of functional abstraction
CAM	Cohesion among methods of class
IC	Inheritance coupling
CBM	Coupling between methods
AMC	Average method complexity
MaxCC	Maximum values of CC methods in a investigated class
Avg(CC)	Arithmetic mean of CC methods in a investigated class
BUG	Indicate the number of fault in a class

Table 4. Confusion matrix.

	Predicted Positive	Predictive Negative
Actual Positive	TP	FN
Actual Negative	FP	TN

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kabir, M.A.; Begum, S.; Ahmed, M.U.; Rehman, A.U. CODE: A Moving-Window-Based Framework for Detecting Concept Drift in Software Defect Prediction. Symmetry 2022, 14, 2508. https://doi.org/10.3390/sym14122508

AMA Style

Kabir MA, Begum S, Ahmed MU, Rehman AU. CODE: A Moving-Window-Based Framework for Detecting Concept Drift in Software Defect Prediction. Symmetry. 2022; 14(12):2508. https://doi.org/10.3390/sym14122508

Chicago/Turabian Style

Kabir, Md Alamgir, Shahina Begum, Mobyen Uddin Ahmed, and Atiq Ur Rehman. 2022. "CODE: A Moving-Window-Based Framework for Detecting Concept Drift in Software Defect Prediction" Symmetry 14, no. 12: 2508. https://doi.org/10.3390/sym14122508

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CODE: A Moving-Window-Based Framework for Detecting Concept Drift in Software Defect Prediction

Abstract

1. Introduction

2. Theoretical Background

2.1. Concept of Dataset Shift

2.2. Data-Drift Detection Methods

3. CODE: Concept-Drift Detection Framework

4. Methodology

4.1. Studied Datasets

4.2. Apply Class-Rebalancing Approaches

4.3. Construct Cross-Version Defect Models

4.4. Calculate Model Performance

4.5. Statistical Analysis of the Experiment

4.5.1. Statistical Comparison

4.5.2. Effect-Size Computation

4.6. Experimental Setup

5. Results

6. Discussion

7. Threats to Validity

7.1. Construct and Internal Validity

7.2. External Validity

8. Related Work

8.1. Cross-Version Defect Prediction

8.2. CVDP Considering Distribution Differences

9. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI