Multi-Source Heterogeneous Kernel Mapping in Software Defect Prediction

Yao, Jingxiu; Liu, Bin; Wu, Yumei; Li, Zhibo

doi:10.3390/app13095526

Open AccessArticle

Multi-Source Heterogeneous Kernel Mapping in Software Defect Prediction

by

Jingxiu Yao

,

Bin Liu

,

Yumei Wu

^* and

Zhibo Li

School of Reliability and Systems Engineering, Beihang University, Beijing 100191, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(9), 5526; https://doi.org/10.3390/app13095526

Submission received: 24 March 2023 / Revised: 21 April 2023 / Accepted: 26 April 2023 / Published: 28 April 2023

Download

Browse Figures

Versions Notes

Abstract

:

Heterogeneous defect prediction (HDP) is a significant research topic in cross-project defect prediction (CPDP), due to the inconsistency of metrics used between source and target projects. While most HDP methods aim to improve the performance of models trained on data from one source project, few studies have investigated how the number of source projects affects predictive performance. In this paper, we propose a new multi-source heterogeneous kernel mapping (MSHKM) algorithm to analyze the effects of different numbers of source projects on prediction results. First, we introduce two strategies based on MSHKM for multi-source HDP. To determine the impact of the number of source projects on the predictive performance of the model, we regularly vary the number of source projects in each strategy. Then, we compare the proposed MSHKM with state-of-the-art HDP methods and within-project defect prediction (WPDP) methods, in terms of three common performance measures, using 28 data sets from five widely used projects. Our results demonstrate that, (1) in the multi-source HDP scenario, strategy 2 outperforms strategy 1; (2) for MSHKM, a lower number of source projects leads to better results and performance under strategy 1, while n = 4 is the optimal number under strategy 2; (3) MSHKM performs better than related state-of-the-art HDP methods; and (4) MSHKM outperforms WPDP. In summary, our proposed MSHKM algorithm provides a promising solution for heterogeneous cross-project defect prediction, and our findings suggest that the number of source projects should be carefully selected to achieve optimal predictive performance.

Keywords:

cross-project defect prediction; heterogeneous defect prediction; product metrics; transfer learning

1. Introduction

Software defect prediction is a technology that uses historical software defect information to predict the defect state of entities within the software being tested (e.g., classes, files, methods, packages, and so on) [1]. According to the prediction results, software testers can focus on the parts which are more likely to have defects; more importantly, the prediction results provide guidance for the implementation strategy of test cases, in order to realize the reasonable allocation of test resources and improve testing efficiency. Software defect prediction establishes a correlation between historical metrics (also known as features or attributes) and defect information, using machine learning or statistical methods. This correlational relationship is called a software defect prediction model. When an adequate amount of historical version data of the software under test is available, the process of building the software defect prediction model is called within-project defect prediction (WPDP) [2,3,4]. WPDP refers to using a portion of the data set of a given project as the training set to construct a defect prediction model, which is then employed to predict the defect situation of the remaining part. However, it is more often the case that the software being tested has insufficient historical defect data, or is part of a new project without historical data. At this time, metric information and defect data collected from other software (source projects) should be employed to establish a software defect prediction model for the software under test (target project). This process is called cross-project defect prediction (CPDP) [5,6,7]. Furthermore, when the name, order, and number of metric elements across projects are consistent with the metric elements of the tested software, the constructed CPDP model is called a homogeneous defect prediction model, also known as a cross-project defect prediction with common metrics (CPDP-CM) model [8]. However, in practice, there are often large differences in the feature spaces of metrics and the distribution of data between source and target projects, due to differences in application domains, programming languages, development processes, and coder experience, which poses a challenge for CPDP. Generally speaking, the metrics of the source project and those of the target project frequently exhibit disparities; that is, the type or number of metrics is different; this situation is named heterogeneous. For this situation, traditional CPDP methods are not applicable and have limitations. To solve the inconsistency of metrics between source and target projects, researchers have proposed heterogeneous defect prediction (HDP) approaches [9,10,11,12,13,14,15,16,17,18,19,20,21,22,23].

Although there have been some studies on HDP, most of the data comes from just one source project, and the limitation of a single source project is that it is too dependent on its data quality. If the data from a source project is poorly correlated to the target one, the performance of the classifier may decrease, or even lead to a negative transfer. For a scenario with multi-source projects, negative transfers can be reduced by increasing the number of source projects. However, there are not many studies on multi-source HDP, and the models associated with it need improving. Moreover, few studies have focused on how the number of source projects impacts model performance. In practice, the experience of selecting the proper number of source projects can directly guide engineers to build high-performance models, under the premise of saving time and improving efficiency.

In this paper, we propose the multi-source heterogeneous kernel mapping (MSHKM) approach, based on heterogeneous mapping (HeMap) [24] and KSETE [17], to predict the defect proneness of each module from the target project by learning the knowledge in the data sets of multiple source projects, which are heterogeneous with the feature space of the target project. The core idea of the MSHKM is to map the sample data with the different metrics of multiple-source projects into the same feature space at the same time, under the premise of minimal information loss. At the same time, we apply the MSHKM algorithm to two multi-source mapping strategies, and conduct experimental comparison. Furthermore, we explore the impact of the quantity of source projects on the MSHKM algorithm by changing the quantity parameters for the two strategies.

In summary, the main contributions of this study are as follows:

(1): Considering problems, including multiple inconsistent feature spaces, distribution differences, non-linear correlations, and data imbalances, a novel multi-source HDP method, named MSHKM (with two strategies), is addressed.
(2): To evaluate the performance of the MSHKM algorithm under different strategies, experiments are performed on 28 public data sets from five projects—NASA [25], SOFTLAB [26], Relink [27], AEEEM [28], and MORPH [29]—under HDP and WPDP scenarios. The experimental results indicate that our proposed MSHKM outperforms the baseline approaches.
(3): Furthermore, we provide a replication package of MSHKM, including the source code and data sets, which is available on the website: https://doi.org/10.5281/zenodo.7692416 (accessed on 9 March 2023).

The remainder of the paper is structured as follows: The next section will introduce the previous work on HDP and multi-source transfer learning. In Section 3, we propose the research methodology, including definitions, hypotheses, methods, formula derivations, and algorithms. Section 4 describes the conducted experiments, including research questions, benchmark data sets, evaluation measures, and the setup we used in our experiments. Section 5 provides an analysis of the experimental results, and Section 6 details the statistical tests. In Section 7, we discuss the limitations of our study and suggest potential directions for future work. Finally, we conclude the work of this paper in Section 8.

2. Related Works

2.1. Heterogeneous Defect Prediction

There have been several studies on HDP. Nam et al. [9] introduced a method for HDP, utilizing metric matching as its foundation. First, feature selection is carried out on the source data to eliminate irrelevant or redundant features. Second, a metric match analyzer is used to calculate the matching scores between source and target data. Finally, a conventional machine learning method is used to facilitate the model training process and predict the defect tendency.

Jing et al. [10] introduced the statistical canonical correlation analysis (CCA) method into CPDP, and proposed a new method called CCA+. The core idea is to take the canonical correlation coefficient as the measurement standard of disparity in the distribution between the datasets, minimizing the canonical correlation coefficient to diminish the distribution discrepancy between the source and target data sets. The specific principles are as follows: First, the source and target data are pre-processed to construct a unified metric representation (UMR). Then, on the basis of UMR, canonical correlation analysis is carried out to calculate the projection vector and the mapping outcomes of both source and target data under the projection vector. Finally, the defect tendency within the target data set is forecast using the nearest neighbor classification in the new projection space, based on CCA+.

Cheng et al. [11] introduced an innovative support vector machine algorithm for HDP, called the cost-sensitive correlation transfer support vector machine (CCT-SVM). They stated that the key emphasis of the software defect prediction process is placed on risk costs. By endeavoring to classify a module as defective, CCT-SVM offers an effective solution for reducing the negative impact of data imbalance. The specific method is as follows: First, the UMR technique is employed to enable the comparison of heterogeneous data. Subsequently, drawing upon the UMR result for heterogeneous data, the CCA technique is utilized to identify a common representation for features extracted from both source and target projects. Finally, they construct a novel SVM model for heterogeneous defect prediction by introducing a correlation regularizer and specific misclassification costs.

Ma et al. [12] put forward a new methodology, KCCA+, by combining a kernel function with CCA+, considering that CCA is essentially a linear transformation, and that there are often complex non-linear relationships between metric elements of different data sets.

Li et al. [13] applied the technique of cost-sensitive learning and combined it with kernel learning technology to suggest a novel analysis approach for HDP, called cost-sensitive transfer kernel canonical correlation analysis (CTKCCA), which aims to address the problem of linear inseparability and alleviate issues related to class imbalance in the software defect prediction process.

Yu et al. [14] presented a heterogeneous cross-project defect tendency prediction algorithm on the basis of feature matching, called feature match transfer (FMT). The principle is as follows: First, the features of source data are selected, and the distribution curves of the features are calculated. Simultaneously, the method employed to calculate the distribution curves of features in the target data remains consistent. Then, according to a defined feature distance formula, feature matching between the source data and target data is completed. Finally, the model is trained, and the defects in the target data are predicted on the matched feature space.

Li et al. [15] incorporated the concept of ensemble learning, and proposed a novel HDP method, known as ensemble multi-kernel correlation alignment (EMKCA). EMKCA constructs multiple kernels, based on different types of features extracted from the source and target domains. These kernels are then combined into a single kernel matrix, using an ensemble approach. The kernel matrix is then used to align the source and target domains by maximizing the correlation between them. After that, in [16], they presented a novel two-stage ensemble learning (TSEL) method for HDP. In the first stage, the EMKCA method is employed. In the second stage, the RESample with replacement (RES) technique is adopted to train multiple EMKCA predictors with diversity, which are then aggregated using the average ensemble method.

Tong et al. [17] suggested an innovative approach for HDP, called the kernel spectral embedding transfer ensemble (KSETE). Their approach entails tackling the class imbalance issue within the source data set initially, then combining kernel spectral embedding, transfer learning, and ensemble learning, to identify potential common kernel feature subspaces. Finally, classifiers are built on every common feature space to make predictions for the target.

Xu et al. [18] proposed a new heterogeneous domain adaptation (HDA) method for HDP. HDA is a subfield of domain adaptation. It is utilized to embed cross-project data into a low-dimensional comparable feature space, followed by the measurement of the dissimilarity between the mapped domains based on dictionary learning techniques. Subsequently, in [19], they proposed a novel Multiple-View Spectral Embedding (MVSE) approach for HDP. MVSE combines multiple views of the data into a unified low-dimensional depiction that aptly captures the underlying structure of the data. In terms of HDP, MVSE utilizes a spectral embedding technique to transform the heterogeneous feature set into a coherent space, maximizing the similarity between the two mapped feature sets. The key idea is to perform spectral embedding on each view of the data.

Gong et al. [20] proposed an unsupervised deep domain adaptation method for HDP. Specifically, they first adopt the unified metric representation (UMR) of the source and target project data. Next, a deep autoencoder is employed to acquire feature representations for both source and target projects. Finally, the maximum mean difference (MMD) is introduced, as a measure to characterize the disparity between the projects, and an adversarial loss function is utilized to minimize the distance between these feature representations.

Wang et al. [22] presented a few-shot learning-based balanced distribution adaptation (FSLBDA), which requires little labeled data to fit the distribution of the target domain. Specifically, first, extreme gradient boosting is employed to eliminate redundant metrics of the data sets. Then, balanced distribution adaptation is utilized to alleviate the difference between the source and target domains. Finally, the influence of a small training data set can be reduced by adaptive boosting.

Zong et al. [23] recently introduced optimal transport (OT) theory, and proposed two prediction algorithms for HDP using optimal transport: one algorithm is based on the entropic Gromov–Wasserstein (EGW) discrepancy, while the other is the EGW+ transport algorithm. Specifically, the method first computes a transportation plan that minimizes the discrepancy between the source and target distributions. Subsequently, a weighted transfer learning approach is used to transfer knowledge from the source domain to the target domain, based on the transportation plan.

The above HDP methods all use single-source project data as training data. However, in order to obtain good model performance, the data quality of the source project needs to be high. Once the data does not meet the requirements, it will lead to negative transfers. In this scenario, these methods cannot achieve satisfactory model performance.

2.2. Multi-Source Transfer Learning

For the multi-source heterogeneous problem, there have been some studies in other fields. Wu et al. [30] proposed multiple graphs and low-rank embedding (MGLE) for the learning of representations of data points in a graph which employs a low-rank embedding to acquire a shared representation for the data across multiple graphs. Specifically, the method learns a low-dimensional embedding for each graph separately, then combines these embeddings into a shared representation, using a weighted sum.

Chai et al. [31] put forward a method for predicting future price fluctuations, using a combination of heterogeneous data sources. The method utilizes natural language processing techniques to extract relevant information from multiple heterogeneous textual data sources, then combines this information with numerical data from financial statements. The combined data are then used to train a machine learning model to predict future price fluctuations.

Zhao et al. [32] proposed an ontology and mapping heterogeneous data from multiple sources, using a combination of a hybrid neural network, composed of a convolutional neural network (CNN), a recurrent neural network (RNN), and an autoencoder, which is employed to learn a low-dimensional representation of the data that can be used to construct an ontology and map data from different sources.

The aforementioned method is primarily used to address multi-source heterogeneous problems in other fields, and there are also some studies on the application of this method for multi-source CPDP.

Chen et al. [33] evaluated the effectiveness of different strategies, with respect to defect prediction, including single-source and multi-source approaches, feature selection, and transfer learning. They concluded that HDP methods based on metric transformation usually have better predictive performance, while HDP methods based on metric selection have better interpretability. Liu et al. [34] evaluated the effectiveness of different multi-source cross-project defect prediction (MSCPDP) models for defect prediction across multiple projects.

However, the above studies considered homogeneous CPDP, rather than heterogeneous. For multi-source HDP, there are two related studies in the literature: Li et al. [35] presented a multi-source selection-based manifold discriminant alignment (MSMDA) approach to address the challenges of using multiple sources of data for heterogeneous defect prediction while preserving privacy, while Wu et al. [21] proposed a multi-source heterogeneous cross-project method (MHCPDP), which is implemented using multi-source transfer learning and autoencoder techniques.

Despite the significance of multi-source HDP, there remains a limited number of studies addressing this topic, with associated models requiring further refinement. Furthermore, the impact of varying numbers of source projects on model performance has received little attention in the literature.

3. Research Methodology

This section describes the details of the proposed method, including feature mapping strategies, data pre-processing, and the multi-source heterogeneous kernel mapping (MSHKM) approach.

3.1. Feature Mapping Strategies

To introduce the feature mapping strategies and MSHKM algorithm, we first identify the target project. Then, we select several source projects in the heterogeneous scenario where the data feature spaces of the source projects and target project differ. The details of the projects and datasets are presented in Section 4.1.

There are two mapping strategies that can be used when conducting a feature space transformation to deal with multi-source HDP.

Strategy 1: Mapping once.

(1): Input n source projects simultaneously;
(2): Map n source projects and target project into the same feature space at the same time;
(3): Take the mapped n source projects data together as the training data of the model.

Strategy 2: Mapping n times.

(1): Input n source projects successively;
(2): Map one source project and target project into the same feature space each time;
(3): The source project data after each mapping is used as training data to build one model, and a total of n models are built after n mapping times.

Figure 1 and Figure 2 present the two strategies of the proposed MSHKM for multi-source HDP, respectively. It is mainly divided into three parts: data pre-processing, multi-source heterogeneous kernel mapping (MSHKM), and model training.

First, the data sets of original source and target projects should be pre-processed under both strategies. Due to the limitations of the MSHKM algorithm (see Section 3.3), before mapping, it is necessary to ensure that the number of samples in each data set are consistent, so, an over-sampling or under-sampling method should be adopted. The pink parts in the figures represent generated data, which cannot be used as test data for the model. The following describes the difference between the two strategies. In strategy 1, all of the source project data and target project data are mapped once by the MSHKM algorithm, following which, the projected source project data are trained in the model. Meanwhile, strategy 2 involves mapping the data of each source project and target project, respectively, by the MSHKM algorithm for a total of n times, then, conducting model training, respectively, to build a total of n models. It is important to note that the generated data must be stripped of the mapped test data (i.e., target project data) before testing, regardless of the strategy.

3.2. Data Pre-Processing

Considering the problems associated with class imbalances and possible noise in software defect data, the defect data are pre-processed as follows: (1) redundant instances are eliminated; (2) instances with missing values are removed; (3) if the positive samples in the source data greatly outnumbered the negative ones, the synthetic minority over-sampling technique (SMOTE) [36] is employed to treat the class imbalance data; and (4) considering the magnitudes of different metrics may vary greatly, Z-score normalization is used to standardize the source and target data sets, such that the mean value of each metric is 0 and its variance is 1.

In order to perform subsequent calculations, it is necessary to ensure that the number of instances in the source and target data sets are consistent. To achieve this, we first determine the quantity empirically, according to the tradeoff of sample size between source and target data sets. Then, samples exceeding this number need to be under-sampled, such as random sampling, and SMOTE should be carried out for samples less than this number, so as to ensure that the multiple data sets have the same number of samples. If the sampled data set is the target data set, the generated samples obtained from these samples need to be removed when predicting the labels of the target data.

3.3. Multi-Source Heterogeneous Kernel Mapping

Multi-source heterogeneous kernel mapping (MSHKM) is based on heterogeneous mapping (HeMap) [24], and is similar to KSETE [17]. Different from KSETE, the input of MSHKM is multiple data sets of different dimensions, so, its feasibility needs to be proved from a theoretical perspective. The mapping principle of MSHKM is to multiply the data sets of source and target projects with different metrics into different projected matrices, respectively, in order to obtain data sets with the same metric feature space, as shown in Figure 3. Thus, the key is to determine the projected matrices. It should be noted that the number of instances for the source and target projects needs to be consistent, so under- or over-sampling technology should be used in data pre-processing.

Denote an unlabeled data set from the target project as

T = x_{t}^{i} |_{i = 1}^{N_{t}},

and some labeled data sets from multiple source projects as

S_{i} = x_{S_{i}}^{j} |_{j = 1}^{N_{S_{i}}}

,

Y_{S_{i}} = y_{S_{i}}^{j} |_{j = 1}^{N_{S_{i}}}

,

i = (1, 2, \dots, n)

, where

x_{t}^{i}

and

x_{S_{i}}^{j}

denote the ith and jth instance in

T

and

S_{i}

, drawn from the marginal distributions

p_{t} (x_{t})

and

p_{S_{i}} (x_{S_{i}})

, respectively.

y_{S_{i}}^{j}

denotes the label of the jth instance in

S_{i}

.

N_{t}

and

N_{S_{i}}

are the number of instances in

T

and

S_{i}

, respectively, and n is the number of source projects.

x_{t}^{i} \in R^{1 * l_{t}}

represents the

l_{t}

metric value for the ith instance in the target project, while

x_{S_{i}}^{j} \in R^{1 * l_{S_{i}}}

denotes the

l_{S_{i}}

metric value for the jth instance in the source project

S_{i}

.

l_{S_{i}}

and

l_{t}

represent the number of metrics for

S_{i}

and

T

, respectively. It should be noted that the metric size for the target and source data are different (i.e.,

R^{1 * l_{t}} \neq R^{1 * l_{S_{i}}}

). Meanwhile, the marginal distribution of target data

p_{t} (x_{t}),

and of the source data

p_{S_{i}} (x_{S_{i}}),

also differ (

p_{t} (x_{t}) \neq p_{S_{i}} (x_{S_{i}})

). In other words, the metric sets of the source and target data sets are unlike, or heterogeneous. However, we need to train the model reasonably and effectively, by using labeled data sets from multiple source projects, to predict whether the labels of instances from the target project are defective. This requires that the data sets between the source projects and the target project are in the same feature space and subject to the same distribution. Consequently, we aim to identify a shared feature space for the target data set and multiple source data sets. According to the description above, the flowchart of the MSHKM algorithm is shown in Figure 4.

The first step involves mapping the data from multiple source projects and the target project using a kernel function. Next, we check whether the feature space of the mapped data is consistent. If it is consistent, then it is a homogeneous problem, and only the distribution differences between the data of source and target projects need to be calculated. However, if it is inconsistent, the MSHKM algorithm is needed. To perform the algorithm, we must ensure that the feature space of the product of the optimal projection matrix and the mapping matrix is consistent with the feature space of the matrix before mapping, as specified in Formula (5). We then calculate the optimization function to obtain the optimal projection matrices and its corresponding mapping matrices. Finally, we train the model using the source project data from the optimal projection matrices, and use the target project data for model prediction. The theoretical derivation of the above process is as follows:

Definition 1: Supposing the ith source data set is

S_{i}

and the target data set is

T

,

Φ (\cdot)

is defined as a mapping function, where

Φ (S_{i}) \in R^{N_{S_{i}} * l_{S_{i}}}

and

Φ (T) \in R^{N_{t} * l_{t}}

represent the mapping results on the source and target data sets, respectively. On this basis, the optimal projection

B_{Φ (S_{i})}

and

B_{Φ (T)}

can be obtained through the following optimization objective:

\min_{B_{Φ (S_{i})}, B_{Φ (T)}} L (B_{Φ (S_{i})}, Φ (S_{i})) + L (B_{Φ (T)}, Φ (T)) + β \cdot D (B_{Φ (S_{i})}, B_{Φ (T)}),

(1)

where

L (\cdot, \cdot)

represents the weighted sum of each difference between the data set after

Φ (\cdot)

mapping and its further mapping data set, such as

Φ (T)

and

B_{Φ (T)}

;

L (B_{Φ (S_{i})}, Φ (S_{i}))

is defined as follows:

L (B_{Φ (S_{i})}, Φ (S_{i})) = \sum_{i = 1}^{n} θ_{i} \cdot ℓ (B_{Φ (S_{i})}, Φ (S_{i}));

(2)

and

D (\cdot, \cdot)

represents the weighted sum of each difference between the further mappiing target and source data sets.

β

is a hyper-parameter, controlling the degree of similarity between two data sets, with value ranging from 0 to 1. We further define

D (B_{Φ (S_{i})}, B_{Φ (T)})

as follows:

D (B_{Φ (S_{i})}, B_{Φ (T)}) = \sum_{i = 1}^{n} θ_{i} \cdot d (B_{Φ (S_{i})}, B_{Φ (T)}),

(3)

where

d (\cdot, \cdot)

denotes the average difference between the mapping target data set

Φ (T)

and the further mapping source data set

B_{Φ (S_{i})}

, as well as that between the mapping source data set

Φ (S_{i})

and the further mapping target data set

B_{Φ (T)}

. It is defined as follows:

d (B_{Φ (S_{i})}, B_{Φ (T)}) = \frac{1}{2} (ℓ (B_{Φ (S_{i})}, Φ (T)) + ℓ (B_{Φ (T)}, Φ (S_{i})))

(4)

It should be noted that

θ_{i}

is the ith weight of

l (\cdot, \cdot)

, which denotes the importance of the ith source data set, with respect to the target data set. The sum of all

θ_{i}

is equal to 1. Furthermore,

l (\cdot, \cdot)

denotes the difference between the mapping data set and its further mapping data set, defined as follows:

\begin{array}{l} ℓ (B_{Φ (S_{i})}, Φ (S_{i})) = {‖ B_{Φ (S_{i})} P_{Φ (S_{i})} - Φ (S_{i}) ‖}_{F}^{2} \\ ℓ (B_{Φ (T)}, Φ (T)) = {‖ B_{Φ (T)} P_{Φ (T)} - Φ (T) ‖}_{F}^{2} \\ ℓ (B_{Φ (S_{i})}, Φ (T)) = {‖ B_{Φ (S_{i})} P_{Φ (T)} - Φ (T) ‖}_{F}^{2} \\ ℓ (B_{Φ (T)}, Φ (S_{i})) = {‖ B_{Φ (T)} P_{Φ (S_{i})} - Φ (S_{i}) ‖}_{F}^{2} \end{array},

(5)

where

‖ \cdot ‖_{F}^{2}

represents the Frobenius norm.

P_{Φ (S_{i})} \in R^{k * l_{S_{i}}}

and

P_{Φ (T)} \in R^{k * l_{t}}

represent the corresponding mapping matrices of the mapping source and target data sets

Φ (S_{i})

and

Φ (T)

, respectively. Thus, the optimization objective can be expanded as:

\begin{array}{l} \min_{B_{Φ (S_{i})}^{Τ} B_{Φ (S_{i})} = Ι, B_{Φ (T)}^{Τ} B_{Φ (T)} = Ι} G (B_{Φ (S_{i})}, B_{Φ (T)}, P_{Φ (S_{i})}, P_{Φ (T)}) \\ = \min_{B_{Φ (S_{i})}^{Τ} B_{Φ (S_{i})} = Ι, B_{Φ (T)}^{Τ} B_{Φ (T)} = Ι} {‖ Φ (T) - B_{Φ (T)} P_{Φ (T)} ‖}_{F}^{2} + \sum_{i = 1}^{n} θ_{i} \cdot {‖ Φ (S_{i}) - B_{Φ (S_{i})} P_{Φ (S_{i})} ‖}_{F}^{2} \\ + β \cdot \sum_{i = 1}^{n} θ_{i} \cdot (\frac{1}{2} {‖ Φ (S_{i}) - B_{Φ (T)} P_{Φ (S_{i})} ‖}_{F}^{2} + \frac{1}{2} {‖ Φ (T) - B_{Φ (S_{i})} P_{Φ (T)} ‖}_{F}^{2}) \end{array},

(6)

where

B_{Φ (S_{i})} \in R^{N_{S_{i}} * k}

and

B_{Φ (T)} \in R^{N_{t} * k}

.

β

is a hyperparameter controlling the degree of similarity between

B_{Φ (S_{i})}

and

B_{Φ (T)}

, and

θ_{i}

controls the importance of the ith source data set.

Lemma 1.

P_{Φ (S_{i})}

and

P_{Φ (T)}

can be calculated as follows:

\begin{array}{l} P_{Φ (S_{i})} = \frac{1}{2 + β} (2 B_{Φ (S_{i})}^{Τ} Φ (S_{i}) + β B_{Φ (T)}^{Τ} Φ (S_{i})) \\ P_{Φ (T)} = \frac{1}{2 + β} (2 B_{Φ (T)}^{Τ} Φ (T) + β \cdot \sum_{i = 1}^{n} θ_{i} \cdot B_{Φ (S_{i})}^{Τ} Φ (T)) \end{array} .

(7)

Proof of Lemma 1.

According to the following formulas:

{‖ X ‖}_{F}^{2} = t r (X^{Τ} X),

(8)

B_{Φ (S_{i})}^{Τ} B_{Φ (S_{i})} = Ι,

(9)

t r (P_{Φ (s_{i})}^{Τ} B_{Φ (S_{i})}^{Τ} Φ (S_{i})) = t r (B_{Φ (S_{i})}^{Τ} Φ (S_{i}) P_{Φ (s_{i})}^{Τ}),

(10)

Equation (5) can be described in terms of the matrix trace norm:

{‖ Φ (S_{i}) - B_{Φ (S_{i})} P_{Φ (S_{i})} ‖}_{F}^{2} = t r (Φ^{Τ} (S_{i}) Φ (S_{i})) - 2 t r (B_{Φ (S_{i})}^{Τ} Φ (S_{i}) P_{Φ (S_{i})}^{Τ}) + t r (P_{Φ (S_{i})}^{Τ} P_{Φ (S_{i})}) .

(11)

Similarly,

‖ Φ (T) - B_{Φ (T)} P_{Φ (T)} ‖_{F}^{2}

,

‖ Φ (S_{i}) - B_{Φ (T)} P_{Φ (S_{i})} ‖_{F}^{2}

, and

‖ Φ (T) - B_{Φ (S_{i})} P_{Φ (T)} ‖_{F}^{2}

can also be denoted. Then, the optimization objective function stated in Equation (6) can be reformulated as:

\begin{array}{l} \min_{B_{Φ (S_{i})}^{Τ} B_{Φ (S_{i})} = Ι, B_{Φ (T)}^{Τ} B_{Φ (T)} = Ι} G (B_{Φ (S_{i})}, B_{Φ (T)}, P_{Φ (S_{i})}, P_{Φ (T)}) \\ = \min_{B_{Φ (S_{i})}^{Τ} B_{Φ (S_{i})} = Ι, B_{Φ (T)}^{Τ} B_{Φ (T)} = Ι} (1 + \frac{β}{2}) \sum_{i = 1}^{n} θ_{i} \cdot t r (Φ^{Τ} (S_{i}) Φ (S_{i})) + (1 + \frac{β}{2}) t r (Φ^{Τ} (T) Φ (T)) \\ + (1 + \frac{β}{2}) \sum_{i = 1}^{n} θ_{i} \cdot t r (P_{Φ (S_{i})}^{Τ} P_{Φ (S_{i})}) + (1 + \frac{β}{2}) t r (P_{Φ (T)}^{Τ} P_{Φ (T)}) - 2 \sum_{i = 1}^{n} θ_{i} \cdot t r (B_{Φ (S_{i})}^{Τ} Φ (S_{i}) P_{Φ (S_{i})}^{Τ}) \\ - 2 t r (B_{Φ (T)}^{Τ} Φ (T) P_{Φ (T)}^{Τ}) - β \sum_{i = 1}^{n} θ_{i} \cdot t r (B_{Φ (S_{i})}^{Τ} Φ (T) P_{Φ (T)}^{Τ}) - β \sum_{i = 1}^{n} θ_{i} \cdot t r (B_{Φ (T)}^{Τ} Φ (S_{i}) P_{Φ (S_{i})}^{Τ}) \end{array} .

(12)

Taking the derivative of G, with regard to

P_{Φ (T)}

and

P_{Φ (S_{i})}

, we obtain:

\begin{array}{l} \nabla G (P_{Φ (S_{i})}) = (2 + β) θ_{i} \cdot P_{Φ (S_{i})} - 2 θ_{i} \cdot B_{Φ (S_{i})}^{Τ} Φ (S_{i}) - β θ_{i} \cdot B_{Φ (T)}^{Τ} Φ (S_{i}) \\ \nabla G (P_{Φ (T)}) = (2 + β) θ_{i} \cdot P_{Φ (T)} - 2 B_{Φ (T)}^{Τ} Φ (T) - β \sum_{i = 1}^{n} θ_{i} \cdot B_{Φ (S_{i})}^{Τ} Φ (T) \end{array} .

(13)

Let

\nabla G (P_{Φ (T)}) = 0

and

\nabla G (P_{Φ (S_{i})}) = 0

in the optimal solution, according to the Karush–Kuhn–Tucker conditions [37]. Then, we obtain Equation (7). □

Meanwhile, we obtain the subordinative consequence as follows:

\begin{array}{l} B_{Φ (T)}^{Τ} Φ (T) = (1 + \frac{β}{2}) \cdot P_{Φ (T)} - \frac{β}{2} \sum_{i = 1}^{n} θ_{i} \cdot B_{Φ (S_{i})}^{Τ} Φ (T) \\ B_{Φ (S_{i})}^{Τ} Φ (S_{i}) = (1 + \frac{β}{2}) \cdot P_{Φ (S_{i})} - \frac{β}{2} \cdot B_{Φ (T)}^{Τ} Φ (S_{i}) \end{array} .

(14)

Based on Equations (7), (12) and (14), the solution of

B_{Φ (S_{i})}

and

B_{Φ (T)}

can be obtained according to the following theorem.

Theorem 1.

The minimization problem in Equation (12) can be converted to the following maximization problem:

\min_{B_{Φ (S_{i})}^{Τ} B_{Φ (S_{i})} = Ι, B_{Φ (T)}^{Τ} B_{Φ (T)} = Ι} G (B_{Φ (S_{i})}, B_{Φ (T)}, P_{Φ (S_{i})}, P_{Φ (T)}) = \max_{B^{Τ} B = Ι} t r (B^{Τ} A B),

(15)

where

B = [\begin{matrix} B_{Φ (T)} \\ B_{Φ (S_{1})} \\ ⋮ \\ B_{Φ (S_{n})} \end{matrix}], A = [\begin{matrix} A_{00} & A_{01} & \dots & A_{0 n} \\ A_{10} & A_{11} & \dots & A_{1 n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ A_{n 0} & A_{n 1} & \dots & A_{n n} \end{matrix}]

(16)

and

\begin{array}{l} A_{00} = 2 Φ (T) Φ^{Τ} (T) + \frac{β^{2}}{2} \sum_{i = 1}^{n} θ_{i} \cdot Φ (S_{i}) Φ^{Τ} (S_{i}) = 2 K (T, T) + \frac{β^{2}}{2} \sum_{i = 1}^{n} θ_{i} K (S_{i}, S_{i}) \\ A_{i i} = \frac{β^{2} θ_{i}^{2}}{2} \cdot Φ (T) Φ^{Τ} (T) + 2 θ_{i} \cdot Φ (S_{i}) Φ^{Τ} (S_{i}) = \frac{β^{2} θ_{i}^{2}}{2} K (T, T) + 2 θ_{i} K (S_{i}, S_{i}) \\ A_{0 i} = A_{i 0}^{T} = β θ_{i} \cdot Φ (T) Φ^{Τ} (T) + β θ_{i} \cdot Φ (S_{i}) Φ^{Τ} (S_{i}) = β θ_{i} K (T, T) + β θ_{i} K (S_{i}, S_{i}) \\ A_{i j} = A_{j i}^{T} = \frac{β^{2} θ_{i} θ_{j}}{2} \cdot Φ (T) Φ^{Τ} (T) = \frac{β^{2} θ_{i} θ_{j}}{2} K (T, T), (i \neq j) \\ (i, j = 1, 2, \dots, n) . \end{array},

(17)

Proof.

Based on Equation (7), the optimization problem in Equation (12) can be expressed as:

\begin{array}{l} \min_{B_{Φ (S_{i})}^{Τ} B_{Φ (S_{i})} = Ι, B_{Φ (T)}^{Τ} B_{Φ (T)} = Ι} G (B_{Φ (S_{i})}, B_{Φ (T)}, P_{Φ (S_{i})}, P_{Φ (T)}) \\ = \min_{B_{Φ (S_{i})}^{Τ} B_{Φ (S_{i})} = Ι, B_{Φ (T)}^{Τ} B_{Φ (T)} = Ι} (1 + \frac{β}{2}) \sum_{i = 1}^{n} θ_{i} \cdot t r (Φ (S_{i}) Φ^{Τ} (S_{i})) \\ + (1 + \frac{β}{2}) t r (Φ^{Τ} (T) Φ (T)) - \frac{1}{2 + β} t r (B^{Τ} A B) \end{array} .

(18)

as

t r (Φ^{T} (T) Φ (T))

and

t r (Φ (S_{i}) Φ^{T} (S_{i}))

are constants, and

β \in [0, 1]

. Therefore, the minimization problem in Equation (18) can be transformed into the maximization problem in Equation (15).

In addition, the matrix

A

in Equation (16) is a symmetric matrix (i.e.,

A^{T} = A

), as:

A^{T} = [\begin{matrix} A_{00}^{T} & A_{10}^{T} & \dots & A_{n 0}^{T} \\ A_{01}^{T} & A_{11}^{T} & \dots & A_{n 1}^{T} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ A_{0 n}^{T} & A_{1 n}^{T} & \dots & A_{n n}^{T} \end{matrix}] = [\begin{matrix} A_{00} & A_{01} & \dots & A_{0 n} \\ A_{10} & A_{11} & \dots & A_{1 n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ A_{n 0} & A_{n 1} & \dots & A_{n n} \end{matrix}] = A .

(19)

According to the following theorem, we can obtain B in Equation (16). □

Theorem 2 (Ky Fan theorem [38]).

Let

A

denote a symmetric matrix with eigenvalues λ₁ ≥ λ₂ ≥ ⋯ ≥ λ_k and corresponding eigenvectors

U = [u_{1}, u_{2}, \dots, u_{k}]

. Then,

\sum_{i = 1}^{k} λ_{i} = \max_{X^{Τ} X = Ι} t r (X^{Τ} A X)

(20)

and X = UQ, where Q is an arbitrary orthogonal matrix.

We set Q = I in this paper, and the optimal B consists of the top k eigenvectors of

A

. The proposed MSHKM algorithm is described in Algorithm 1.

Algorithm 1. MSHKM algorithm

Input: Original target data matrix T; Original source data matrices

S_{1}, S_{2}, \dots, S_{n}

; Similarity parameter

β

(default as 1); Importance parameter

θ_{i}

(default as

1 / n

); Dimensions of the new feature space k; Kernel function

K (x, x)

.

Output: Projected target data matrix

B_{Φ (T)}

;
Projected source data matrices

B_{Φ (S_{1})}, B_{Φ (S_{2})}, \dots, B_{Φ (S_{n})}

.
1: Construct the matrix

A

according to Equations (15) and (16);
2: Calculate the top k eigenvalues of

A

and the corresponding eigenvectors

U = [u_{1}, u_{2}, \dots, u_{k}]

;
3:

B_{Φ (T)}

is the first

1 / (n + 1)

rows of

U

:

B_{Φ (T)} = [\begin{matrix} U (1, 1) & \dots & U (1, k) \\ ⋮ & ⋱ & ⋮ \\ U (\frac{l}{n + 1}, 1) & \dots & U (\frac{l}{n + 1}, k) \end{matrix}],

(21)

where

l

is the number of rows of

U

;
4:

B_{Φ (S_{i})}

is the

(i + 1)

th

1 / (n + 1)

rows of

U

;

B_{Φ (S_{i})} = [\begin{matrix} U (\frac{i l}{n + 1} + 1, 1) & \dots & U (\frac{i l}{n + 1} + 1, k) \\ ⋮ & ⋱ & ⋮ \\ U (\frac{(i + 1) l}{n + 1}, 1) & \dots & U (\frac{(i + 1) l}{n + 1}, k) \end{matrix}] .

(22)

5: Return

B_{Φ (T)}

and

B_{Φ (S_{i})}

.

4. Experiments

This section provides details of the conducted experiments, including the benchmark data sets, evaluation measures, experimental design, experiment results, statistical significance tests, and effect size tests.

4.1. Benchmark Data Sets

In this paper, a total of 28 data sets, which were both publicly accessible and commonly used, from five distinct projects—NASA [25], SOFTLAB [26], Relink [27], AEEEM [27], and MORPH [28]—were employed as benchmark data sets in the experiments. Each project had a heterogeneous metric set; for example, only one metric was found to be common between NASA and AEEEM, which is the line of code (LOC) metric. The metrics here acted as features and were utilized to extract software defects. Specifically, software metrics can be divided into method-level metrics for procedural-oriented software and class-level metrics for object-oriented software, according to different programming languages. For procedural-oriented software, a method (or function) is usually called a software module, and the measurement object is a single function. In this case, the software metric is called the method-level metric, for example, McCabe metrics [39] and Halstead metrics [40]. For object-oriented software, a class is usually called a software module, and the measurement object is each class; at this time, the software metric is called the class-level metric (also known as object-oriented metric), such as CK metrics [41]. For further details, Table 1 provides a comprehensive list of the data sets used in this research.

The NASA [25] data sets were collected from 11 subsystems, from which we selected 5 of them with 37 metrics, for convenience of comparison with current HDP methods [13]. These data sets contain static code metrics, such as size, readability, and complexity, as well as information about module bugs.

The SOFTLAB [26] data sets, containing 5 subsets, were obtained from a Turkish software company (SOFTLAB). We utilized all SOFTLAB data sets in the PROMISE repository. Both SOFTLAB and NASA have 28 common metrics, including McCabe and Halstead metrics, but SOFTLAB has no complexity metrics that are included in NASA.

The Relink [27] data sets contain defect information from three open-source projects, and include 26 code complexity metrics for each project.

The AEEEM [28] data sets include information about five Java projects, consisting of 61 metrics, including process metrics, previous defects, source code metrics, entropy of changes, churn of source code metrics, and entropy of source code metrics, for each project.

The MORPH [29] data sets were utilized to deal with privacy problems for defect prediction, which contain several open-source projects and include 20 metrics, such as McCabe metrics and CK metrics.

4.2. Evaluation Measures

For the purpose of assessing the efficacy of our method, three widely recognized performance metrics, namely Pd (recall rate), Pf (false positive rate), and GM (Geometric mean), were applied. These metrics can be characterized in relation to the confusion matrix [42], which consists of the number of predicted outcomes of TP (true positive), FP (false positive), TN (true negative), and FN (false negative), as detailed in Table 2.

Here, TP refers to the number of instances predicted as buggy that were actually buggy, FP represents the number of instances predicted as buggy that were actually clean, TN denotes the number of instances predicted as clean that were actually clean, and FN represents the number of instances predicted as clean that were actually buggy.

Pd denotes the proportion of actual buggy instances that were correctly predicted as buggy; higher values of Pd indicate better prediction performance. Pf represents the proportion of actual clean instances that were incorrectly predicted as buggy. GM is the geometric mean of Pd and 1−Pf, which provides an evaluation measure for classifiers that deal with imbalanced data sets. All evaluation measures employed in our experiment range from 0 to 1. For details of the evaluation measures, see Table 3.

4.3. Experimental Design

The present study aims to address the following four research questions:

RQ1. Which strategy is preferable for multi-source HDP?
RQ2. How many source projects applied in MSHKM can achieve the optimal prediction performance?
RQ3. Does MSHKM outperform existing HDP methods in terms of prediction performance?
RQ4. Does MSHKM outperform WPDP in terms of prediction performance?

To conduct our research, 28 data sets from five projects were employed as experimental data for performing HDP. In each turn of the experiment, one data set was designated as a target data set, while other data sets from different projects were employed as source data sets. For example, when we selected ant1.3 in MORPH as the target data set, then, 18 data sets from other projects (NASA, SOFTLAB, Relink, and AEEEM) were used as source data sets. Before performing the MSHKM, the general over-sampling method SMOTE was applied to pre-process all data sets. The number of new samples was set as 500, with 250 samples for both positive and negative samples. To ensure fairness, the logistic regression (LR) classifier was used for all baseline algorithms. We also used 90% of the sample as training data, and repeated the process 20 times in each round.

For RQ1 and RQ2, we compared the performance measures under different values of n in MSHKM for the two strategies, where n represents the number of source projects. Each time, we chose I (i ∈ [1, n], i = 1, 2, 4, 8, 12, 16) source projects randomly from the source projects chosen for the experiments, with the aim of determining a better appropriate number of source projects for MSHKM. Note that the maximum value of cross-project data sets for NASA, SOFTLAB, and AEEEM was 23 (28 − 5), while that for MORPH was 18 (28 − 10). For MSHKM, we set β = 1 and θ_i = 1/n in Strategy 1 and β = 1 and θ_i = 1 in Strategy 2 by default. Meanwhile, k was set equal to the number of available eigenvalues, for which the value was not less than 0.001.

For RQ3, three HDP methods, including CCA+ [10], HDP-KS [9], CTKCCA [13], and one multi-source HDP method, named MHCPDP [21], were chosen as baselines. Another method for multi-source HDP, named MSMDA [35], was excluded as a baseline, as its results were no better than CTKCCA when compared in [33]. The best results obtained for MSHKM were selected when assessing RQ1 and RQ2 (i.e., the better strategy and the optimal number of the source projects n). We used similar experimental settings and compared our results with those provided in reference [13].

For RQ4, we compared MSHKM with WPDP. During the WPDP experiment, it was necessary to split each data set into training and test parts. To assess the efficacy of the defect prediction model, five-fold cross validation was used. During each round of five-fold cross validation, one fold served as the test set, whereas the other four were used for training. Note that we used different folds as training sets during the same round. Therefore, five tests were conducted as one round, and there were, totally, 100 rounds (i.e., 500 tests).

To implement CCA+, HDP-KS, CTKCCA, MHCPDP, and WPDP, we employed Python programming, followed the prescribed settings outlined in their respective papers. We applied Z-score normalization to all data sets prior to executing these algorithms, as it is widely utilized in the field of software defect prediction [10,13,15,16,17,18,19,21,43]. For the kernel function in MSHKM, the Gaussian kernel function was used. If MSHKM obtained better prediction results than CCA+, HDP-KS, CTKCCA, MHCPDP, and WPDP methods, then, the conclusion is that MSHKM can effectively improve the performance of the prediction model in the scenario of HDP.

5. Results

The average outcomes with different n applied to MSHKM under the two strategies, in terms of Pd, Pf, and GM, for each target project is presented. The results are given in Appendix A, where, n represents the number of source projects, ranging from 1 to 23/18. Strategy 1 is abbreviated as Str. 1, while strategy 2 is abbreviated as Str. 2. It can be noted that the results under strategy 1 and strategy 2 were the same when n = 1, as, for single-source HDP, the two strategies are equivalent. The result trends of the two strategies with different numbers of source projects are depicted in Figure 5, Figure 6 and Figure 7.

From these tables and figures, we can obviously observe that the values obtained under strategy 2 are better than those under strategy 1, in terms of Pd, Pf, and GM, which provides an answer to RQ1. In addition, the figures also show the tendencies of Pd, Pf, and GM; that is, for Pd and GM, the values decrease with increasing n, while the opposite occurs for Pf when using strategy 1. However, under strategy 2, the values increase with an increase in n, up to n = 4, for Pd and GM, while the opposite occurs for Pf, after which it levels off. Overall, the best results were achieved when n = 1 under strategy 1, and with n > 2 under strategy 2. From the above, we can answer RQ2: the fewer source projects applied in MSHKM under strategy 1, the better the prediction performance; when using just one source project each time, we can achieve the best prediction performance. However, under strategy 2, when the number of source projects is less than or equal to 4, the higher the number of source projects, the better the prediction performance, after which the performance becomes stable. For Pf and GM, the best value is achieved when n = 4. Although the best value for Pd is not achieved when n = 4, it serves as a turning point towards stability. In order to save time in model training, a smaller number of training sets should be selected when the prediction effect is similar.

Based on the above experimental results, we can draw the following conclusion. Both of our proposed two strategies, combined with MSHKM, can solve the problem of multi-source HDP. However, the predicted performance of strategy 1 becomes worse with the increase of the number of source projects, while strategy 2 is on the contrary. For strategy 1, n = 1 is the best, and, for strategy 2, n = 4 is the best. In general, in the multi-source heterogeneous scenario, strategy 2 is the better choice.

Table 4, Table 5 and Table 6 provide the mean values of Pd, Pf, and GM, respectively, when using five different methods for each target project. It can be seen that the results of MSHKM provided the best values when n = 4 under strategy 2. The best values for each target project are labeled in bold font, and a value with a dark gray background indicates a significant improvement, whether MSHKM is better than baselines or vice versa. For the meanings of other backgrounds, refer to the notes in Table 4. From these tables, we can see that the mean values of MSHKM in Pd and GM are higher than those of the baselines, while the Pf values are lower; in other words, MSHKM achieved better results than the other methods. At the same time, the performance of MSHKM was improved by 50.57%, 78.73%, and 43.61%, in terms of Pd, Pf, and GM, respectively, when compared to MHCPDP. Moreover, our MSHKM exceeded WPDP by 17.09%, 65.46%, and 16.65% in Pd, Pf, and GM, respectively. In addition, the numbers and medians of the best results (described in bold font) for each method are summarized in Table 7. We can see the counts of the best results for MSHKM, in terms of Pd and GM, were higher than the other four methods, except, however, for Pf, where CTKCCA achieved a higher result. The medians of the best results for each method are also provided, from which it can be noted that the medians of the best results for WPDP and CTKCCA were better than those for MSHKM. These results indicate that WPDP and CTKCCA can perform better than MSHKM in some of the data sets. However, most of the results obtained by WPDP and CTKCCA were unsatisfying.

Additionally, Figure 8, Figure 9 and Figure 10 show the boxplots of results of WPDP, CCA+, HDP-KS, CTKCCA, MHCPDP, and MSHKM, in terms of Pd, Pf, and GM, respectively. The plots show that MSHKM had the highest mean values for Pd and GM, as well as the lowest mean values for Pf. Moreover, according to the distribution in the figures, the effect of MSHKM was more stable than those of the other five methods.

According to the above results, we can answer RQ2 and RQ3. For RQ2, the number of best results for MSHKM was higher than that for CCA+, HDP-KS, CTKCCA, and MHCPDP, except for the Pf of CTKCCA. However, all mean values of MSHKM were better than those of CTKCCA. This suggests that MSHKM performs better and is more stable than CTKCCA. For RQ3, both the numbers of best results and mean values of MSHKM were higher than those of WPDP.

Based on the above experimental results, we can draw the following conclusion. The proposed MSHKM is superior to the existing HDP method, including CCA+, HDP-KS, and CTKCCA. In addition, MSHKM significantly outperforms the multiple-source HDP method, MHCPDP, and improves by at least 50.57%, 78.73%, and 43.61%, in terms of Pd, Pf, and GM, respectively. Moreover, MSHKM outperforms WPDP and improves by 17.09%, 65.46%, and 16.65%, in terms of Pd, Pf, and GM, respectively.

6. Statistical Testing

6.1. Statistical Significance Test

To assess the statistical significance of the predicting Pd, Pf, and GM, we conducted the non-parametric Friedman test with a post-hoc Nemenyi test at a 95% confidence level when comparing WPDP, CCA+, HDP-KS, CTKCCA, MHCPDP, and MSHKM over the 28 data sets [44]. The Friedman test compares whether the average ranks have statistical significance, and has been previously applied in heterogeneous CPDP studies [9,13]. We ranked the values of different methods for each data set to 6, 5, 4, 3, 2, and 1, from large to small, according to Table 4, Table 5 and Table 6; for example, the ranks of five methods for Apache were 5 (WPDP), 1 (CCA+), 2 (HDP-KS), 4 (CTKCCA), 3 (MHCPDP), and 6 (MSHKM), in terms of Pd, according to Table 4, Table 5 and Table 6. It is worth noting that a rank of 6 represents the best result for Pd and GM, but represents the worst result for Pf. The results of the Friedman test, reported in Table 8, demonstrated mean rank of each method, and the p-values are far less than 0.001. To further analyze these results, we employed the Nemenyi test, which graphically visualize the comparison results for Pd, Pf, and GM in Figure 11, Figure 12 and Figure 13, respectively. Specifically, the critical difference (CD) was used to connect methods without significant differences. In other words, there is no significant difference in the methods of connection [44].

In terms of the Pd mean ranks shown in Figure 11, there were four groups: {MSHKM, WPDP}, {WPDP, CTKCCA}, {CTKCCA, MHCPDP}, and {MHCPDP, CCA+, HDP-KS}. The inner methods of each group are connected. It should be noted that a higher mean rank (the left side in the axis) represents better prediction performance, which is also suitable for GM. However, for Pf, the right side indicates better performance. From Figure 10 and Figure 11, we can see that MSHKM outperformed CTKCCA, MHCPDP, CCA+, and HDP-KS significantly for Pd and Pf, but did not significantly outperform WPDP, as they were in the same group. Similarly, for GM, from Figure 13, we can observe that MSHKM outperformed the other five baselines with significant differences.

Overall, MSHKM provided better performance against the HDP baselines and showed comparable results when compared to WPDP, with statistical significance in Pd, Pf, and GM comparisons.

6.2. Effect Size Test

Furthermore, to measure the magnitude of the difference between MSHKM and baselines, we performed a non-parametric Cliff’s delta (δ) effect size test [45]. Cliff’s delta estimates the probability that a result value from one method is greater than one from another, which is measured in the closed interval [−1, 1]. The effect size of 1 or −1 indicates no overlap between two methods, whereas, a value of 0 indicates complete overlap between two distributions. We applied the evaluation standard of effect size, as described in Table 9 [46].

The background of Table 4, Table 5 and Table 6 shows the results of the Cliff’s delta. A cell with a dark gray background suggests that MSHKM provides a considerably significant improvement over the corresponding method, a medium gray background indicates that MSHKM provides a moderately significant improvement, and a light gray background indicates that MSHKM provides a small significant improvement. For a clearer comparison of the results for the effect size, we counted the number of projects in each effectiveness level, and reported the results in Table 10. From Table 10, we can see the number at the L level was always higher than the others. This indicates that MSHKM can achieve large performance improvements, compared to the other four methods (except for CTKCCA in Pf).

7. Discussion

The MSHKM method proposed in this paper can effectively improve the performance of HDP. However, there are several problems and potential threats to the validity of our empirical study.

Firstly, this article deals with the problem of heterogeneous feature space of different projects and proposes two strategies. However, two strategies have different time complexities for the MSHKM algorithm. Assuming there are n source projects, and each matrix has a dimension of

m \times m

, the complexity for strategy 2 is O(

n m^{3}

), while strategy 1 requires O(

n^{3} m^{3}

). Therefore, for situations with fewer source projects, it is better to choose the more effective strategy, while, for situations with a large number of source projects, the time cost should be considered, and strategy 2 should be chosen as much as possible. Regardless of which strategy is used, the algorithm’s complexity will increase with an increase in the number of source projects, which is the main disadvantage of this paper.

For performance measures, we used Pd (also known as recall), Pf, and GM to evaluate the performance of all methods, as they are commonly used performance measures in software defect prediction, especially on the baseline methods used in this paper. Additional comprehensive performance measures, such as AUC [9], Matthews correlation coefficient (MCC) [47,48], and balance [49], were not used. However, we intend to use such measures for specific problems as future work, such as using AUC to evaluate classifiers without fixed thresholds. It should be noted that F1 was excluded, as it is biased and not suitable in the presence of the class imbalance problem [47,48]. Similarly, precision is not used because it is also be affected by the imbalance of category distribution. In the case of class imbalance, the classifier may tend to predict the class with a large proportion, resulting in a high value of precision. Accuracy is also a metric leading to misleading results. For example, when 95% of samples are positive samples, as long as the classifier classifies all samples as positive samples, the accuracy can still be 0.95, but the performance of the model may not be good, as the model cannot predict negative samples.

For the classifier, we employed logistic regression (LR) for all experiments in this paper, since it has been successfully and commonly used in prior HDP research [9,14,15,16,17,18,34,35]. Moreover, the LR classifier does not require parameter setting and hyperparameter adjustment, thus avoiding additional computational costs [43]. In addition, LR classification techniques tend to produce top-performing classifiers in software defect prediction [16,50]. Despite the above reasons, other classifiers still need to be compared on the basis of the novel MSHKM algorithm. Therefore, the selection of a more suitable classifier for use with MSHKM will also be considered as future work.

In addition, with regard to the θ of MSHKM, we set

θ_{i}

= 1/n by default. According to our experimental results, we achieved the best results when n = 1 in strategy 1, and n = 4 in strategy 2 (i.e.,

θ_{i}

= 1 or

θ_{i}

= ¼). In fact,

θ_{i}

is a changeable parameter, in terms of the correlations between different data sets. For example, when we chose CM1 as the target project, MW1 is more similar than EQ as a source project, when compared with CM1. As a result, it is reasonable that the

θ

value of MW1 should be larger than that of EQ. At the same time, the selection of source projects in RQ2 is random, which is unstable. If the similarity between source and target projects can be determined, the source projects can be selected according to the similarity, which may allow for better and more reasonable results to be obtained. More importantly,

θ_{i}

can be re-defined, based on the similarity between the source and target projects. Therefore, the similarity between heterogeneous data and the

θ_{i}

values in MSHKM can be studied as future work.

8. Conclusions

HDP is an efficient technique for constructing a suitable defect prediction model when the source and target projects have heterogeneous metrics, potentially decreasing the data cost of the target project, and enhancing the efficiency of software testing, as well as improving software quality.

In this paper, we presented a new multi-source heterogeneous kernel mapping (MSHKM) algorithm for HDP. To analyze the effects of multiple-source projects on the prediction results, two strategies, based on MSHKM, for multi-source HDP were proposed. In addition, the number of source projects in each experiment was changed to determine the rules that affect the prediction performance. Furthermore, we compared the proposed MSHKM with existing HDP methods, including CCA+, HDP-KS, CTKCCA, and MHCPDP, as well as comparing MSHKM with WPDP.

Experiments were performed on 28 data sets from five widely used projects, using three common measures to evaluate the considered methods. We further evaluated the experimental results by performing a non-parametric Friedman test with the Nemenyi post hoc test and the Cliff’s delta effect size test. The experimental results demonstrated that (1) in the multi-source HDP scenario, strategy 2 is superior to strategy 1; (2) for MSHKM, the lower the number of source projects, the better the results obtained under strategy 1, while n = 4 was the optimal number for strategy 2; (3) MSHKM can outperform state-of-the-art HDP methods; and (4) MSHKM also performs better than WPDP. From the above, it can be stated that the proposed MSHKM approach may improve software quality obviously, as well as decreasing both time and human costs.

For future work, we plan to employ other effective classifiers and evaluation measures for comparison. Additionally, the similarity between heterogeneous source and target projects will also be studied to optimize source project selection, as well as the weight parameters for multi-source projects in the MSHKM algorithm.

Author Contributions

Conceptualization, Y.W. and J.Y.; methodology, J.Y.; software, J.Y.; validation, J.Y.; formal analysis, J.Y.; investigation, J.Y.; resources, J.Y.; data curation, J.Y.; writing—original draft preparation, J.Y. and Z.L.; writing—review and editing, Y.W.; visualization, J.Y.; supervision, B.L.; project administration, Y.W.; funding acquisition, B.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National defense research foundation of China, grant number JZX7Y20220142200501.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data used in this article is publicly available data. Information about the data is contained within the text.

Conflicts of Interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled.

Appendix A

Table A1. Results for Pd under a varying number of source projects n between two strategies in MSHKM.

Target	n = 1	n = 2		n = 4		n = 8		n = 12		n = 16		n = 23/18
Target	Str. 1&2	Str. 1	Str. 2	Str. 1	Str. 2	Str. 1	Str. 2	Str. 1	Str. 2	Str. 1	Str. 2	Str. 1	Str. 2
CM1	0.863	0.860	0.969	0.811	0.935	0.737	0.933	0.671	0.909	0.641	0.928	0.607	0.954
MW1	0.855	0.850	0.922	0.799	0.923	0.723	0.938	0.658	0.941	0.641	0.969	0.594	0.926
PC1	0.874	0.843	0.892	0.798	0.915	0.719	0.926	0.662	0.928	0.621	0.925	0.581	0.928
PC3	0.880	0.868	0.919	0.794	0.937	0.739	0.948	0.666	0.923	0.635	0.921	0.605	0.931
PC4	0.895	0.887	0.939	0.807	0.950	0.703	0.905	0.659	0.962	0.613	0.925	0.590	0.938
AR1	0.937	0.928	0.876	0.847	0.875	0.740	0.906	0.662	0.923	0.597	0.926	0.571	0.928
AR3	0.886	0.862	0.922	0.828	0.871	0.750	0.868	0.672	0.894	0.613	0.749	0.598	0.875
AR4	0.891	0.865	0.966	0.829	0.943	0.733	0.962	0.674	0.913	0.606	0.942	0.565	0.994
AR5	0.967	0.950	0.911	0.909	0.864	0.746	0.871	0.617	0.884	0.578	0.856	0.500	0.873
AR6	0.942	0.910	0.951	0.832	0.965	0.754	0.947	0.668	0.930	0.623	0.984	0.559	0.937
Apache	0.848	0.825	0.897	0.790	0.958	0.730	0.948	0.683	0.902	0.649	0.916	0.602	0.930
Safe	0.929	0.895	0.944	0.868	0.935	0.732	0.956	0.660	0.948	0.622	0.932	0.564	0.953
ZXing	0.816	0.816	0.834	0.789	0.908	0.729	0.967	0.690	0.987	0.673	0.981	0.632	0.976
EQ	0.897	0.876	0.910	0.823	0.940	0.738	0.956	0.679	0.978	0.642	0.960	0.598	0.948
JDT	0.887	0.844	0.892	0.807	0.972	0.752	0.962	0.686	0.952	0.654	0.932	0.614	0.947
LC	0.925	0.897	0.950	0.834	0.934	0.753	0.944	0.682	0.951	0.626	0.945	0.595	0.948
ML	0.872	0.880	0.903	0.836	0.939	0.734	0.924	0.682	0.919	0.638	0.876	0.602	0.912
PDE	0.880	0.863	0.966	0.797	0.969	0.731	0.950	0.676	0.947	0.657	0.952	0.609	0.961
ant1.3	0.873	0.861	0.901	0.807	0.959	0.722	0.952	0.674	0.955	0.644	0.953	0.640	0.945
arc	0.858	0.846	0.860	0.790	0.911	0.730	0.943	0.677	0.973	0.647	0.958	0.636	0.957
camel1.0	0.816	0.841	0.824	0.803	0.931	0.741	0.944	0.695	0.963	0.666	0.953	0.657	0.942
poi1.5	0.893	0.857	0.937	0.801	0.927	0.721	0.936	0.682	0.951	0.645	0.954	0.630	0.956
redaktor	0.887	0.856	0.966	0.829	0.904	0.718	0.890	0.672	0.900	0.639	0.869	0.628	0.821
skarbonka	0.914	0.894	0.915	0.822	0.924	0.757	0.941	0.627	0.892	0.569	0.982	0.556	0.956
tomcat	0.887	0.822	0.937	0.803	0.959	0.711	0.928	0.668	0.935	0.635	0.962	0.625	0.935
Velocity1.4	0.910	0.890	0.934	0.845	0.947	0.741	0.949	0.661	0.989	0.638	0.955	0.613	0.955
Xalan2.4	0.889	0.837	0.928	0.789	0.934	0.727	0.928	0.675	0.959	0.641	0.965	0.633	0.965
Xerces1.2	0.825	0.818	0.871	0.758	0.955	0.706	0.945	0.675	0.969	0.644	0.945	0.630	0.941
Ave.	0.886	0.866	0.916	0.816	0.932	0.733	0.935	0.670	0.938	0.632	0.933	0.601	0.937
Std.	0.036	0.033	0.038	0.029	0.028	0.014	0.025	0.017	0.030	0.024	0.048	0.033	0.034

Table A2. Results for Pf under a varying number of source projects n between two strategies in MSHKM.

Target	n = 1	n = 2		n = 4		n = 8		n = 12		n = 16		n = 23/18
Target	Str. 1&2	Str. 1	Str. 2	Str. 1	Str. 2	Str. 1	Str. 2	Str. 1	Str. 2	Str. 1	Str. 2	Str. 1	Str. 2
CM1	0.115	0.123	0.094	0.201	0.037	0.253	0.078	0.310	0.084	0.339	0.081	0.376	0.078
MW1	0.099	0.119	0.086	0.174	0.012	0.259	0.062	0.308	0.069	0.346	0.074	0.377	0.068
PC1	0.092	0.126	0.108	0.175	0.075	0.258	0.096	0.314	0.092	0.352	0.089	0.394	0.094
PC3	0.116	0.124	0.089	0.186	0.029	0.257	0.064	0.319	0.059	0.348	0.060	0.383	0.062
PC4	0.109	0.126	0.112	0.183	0.056	0.268	0.065	0.322	0.062	0.358	0.059	0.388	0.078
AR1	0.100	0.123	0.092	0.186	0.091	0.289	0.108	0.341	0.098	0.365	0.103	0.412	0.110
AR3	0.066	0.096	0.078	0.140	0.084	0.258	0.058	0.353	0.054	0.395	0.059	0.428	0.062
AR4	0.103	0.118	0.122	0.175	0.073	0.265	0.109	0.331	0.118	0.363	0.117	0.406	0.108
AR5	0.064	0.061	0.087	0.099	0.130	0.241	0.084	0.379	0.076	0.426	0.102	0.488	0.086
AR6	0.111	0.115	0.077	0.186	0.235	0.287	0.098	0.341	0.101	0.377	0.075	0.415	0.098
Apache	0.114	0.142	0.101	0.183	0.109	0.264	0.096	0.300	0.091	0.339	0.102	0.380	0.084
Safe	0.087	0.134	0.098	0.168	0.069	0.285	0.092	0.355	0.096	0.400	0.084	0.441	0.087
ZXing	0.166	0.166	0.095	0.219	0.112	0.252	0.137	0.302	0.119	0.314	0.106	0.353	0.095
EQ	0.086	0.111	0.076	0.167	0.050	0.252	0.070	0.322	0.066	0.356	0.068	0.397	0.072
JDT	0.129	0.143	0.094	0.183	0.062	0.249	0.059	0.319	0.068	0.344	0.059	0.374	0.063
LC	0.115	0.129	0.111	0.183	0.062	0.274	0.087	0.331	0.093	0.365	0.091	0.407	0.100
ML	0.100	0.121	0.081	0.167	0.037	0.262	0.063	0.319	0.067	0.365	0.069	0.404	0.068
PDE	0.118	0.151	0.116	0.196	0.035	0.273	0.089	0.307	0.093	0.338	0.079	0.379	0.086
ant1.3	0.138	0.146	0.073	0.181	0.051	0.250	0.070	0.296	0.062	0.325	0.068	0.342	0.067
arc	0.112	0.124	0.085	0.171	0.042	0.248	0.075	0.307	0.074	0.327	0.078	0.345	0.071
camel1.0	0.160	0.159	0.094	0.188	0.118	0.250	0.079	0.281	0.088	0.310	0.087	0.319	0.075
poi1.5	0.150	0.170	0.124	0.184	0.002	0.263	0.078	0.312	0.079	0.343	0.084	0.352	0.078
redaktor	0.103	0.101	0.114	0.158	0.031	0.250	0.110	0.320	0.099	0.351	0.094	0.367	0.106
skarbonka	0.091	0.104	0.069	0.182	0.060	0.298	0.067	0.359	0.068	0.405	0.059	0.418	0.069
tomcat	0.093	0.135	0.101	0.171	0.051	0.249	0.079	0.301	0.073	0.335	0.073	0.342	0.078
Velocity1.4	0.119	0.154	0.082	0.193	0.124	0.283	0.069	0.341	0.068	0.380	0.074	0.397	0.078
Xalan2.4	0.113	0.142	0.096	0.178	0.015	0.246	0.068	0.293	0.074	0.325	0.076	0.337	0.074
Xerces1.2	0.141	0.132	0.127	0.186	0.019	0.242	0.112	0.291	0.109	0.323	0.116	0.330	0.109
Ave.	0.111	0.128	0.096	0.177	0.067	0.262	0.083	0.321	0.082	0.354	0.082	0.384	0.082
Std.	0.024	0.023	0.016	0.021	0.019	0.015	0.020	0.023	0.018	0.028	0.017	0.037	0.015

Table A3. Results for GM under a varying number of source projects n between two strategies in MSHKM.

Target	n = 1	n = 2		n = 4		n = 8		n = 12		n = 16		n = 23/18
Target	Str. 1&2	Str. 1	Str. 2	Str. 1	Str. 2	Str. 1	Str. 2	Str. 1	Str. 2	Str. 1	Str. 2	Str. 1	Str. 2
CM1	0.873	0.867	0.937	0.804	0.949	0.741	0.956	0.679	0.945	0.650	0.937	0.614	0.958
MW1	0.878	0.865	0.918	0.812	0.955	0.731	0.939	0.674	0.951	0.646	0.956	0.607	0.953
PC1	0.890	0.858	0.892	0.811	0.920	0.730	0.927	0.673	0.926	0.633	0.924	0.593	0.922
PC3	0.881	0.872	0.915	0.804	0.954	0.741	0.959	0.673	0.947	0.643	0.943	0.611	0.943
PC4	0.893	0.880	0.913	0.811	0.947	0.716	0.939	0.668	0.942	0.627	0.927	0.600	0.934
AR1	0.917	0.901	0.892	0.828	0.892	0.722	0.892	0.656	0.956	0.610	0.914	0.573	0.924
AR3	0.908	0.882	0.922	0.840	0.893	0.741	0.884	0.652	0.899	0.603	0.867	0.577	0.886
AR4	0.893	0.872	0.921	0.826	0.935	0.732	0.913	0.669	0.929	0.619	0.913	0.577	0.908
AR5	0.951	0.944	0.912	0.903	0.867	0.748	0.846	0.612	0.829	0.565	0.849	0.493	0.848
AR6	0.914	0.897	0.937	0.822	0.859	0.732	0.891	0.661	0.901	0.620	0.852	0.569	0.863
Apache	0.866	0.840	0.898	0.802	0.924	0.732	0.921	0.690	0.924	0.653	0.921	0.609	0.928
Safe	0.920	0.879	0.923	0.847	0.933	0.720	0.935	0.650	0.912	0.608	0.922	0.558	0.923
ZXing	0.824	0.824	0.869	0.784	0.898	0.738	0.883	0.693	0.908	0.678	0.910	0.638	0.902
EQ	0.905	0.882	0.917	0.827	0.945	0.742	0.937	0.678	0.945	0.643	0.957	0.600	0.955
JDT	0.878	0.850	0.899	0.811	0.955	0.751	0.956	0.683	0.956	0.654	0.964	0.619	0.963
LC	0.905	0.883	0.919	0.825	0.936	0.739	0.934	0.675	0.937	0.630	0.945	0.593	0.939
ML	0.885	0.879	0.911	0.834	0.951	0.735	0.928	0.681	0.939	0.636	0.919	0.598	0.925
PDE	0.880	0.855	0.924	0.800	0.967	0.728	0.963	0.684	0.964	0.659	0.966	0.615	0.961
ant1.3	0.867	0.857	0.914	0.812	0.954	0.735	0.956	0.688	0.952	0.658	0.948	0.648	0.936
arc	0.872	0.859	0.887	0.808	0.934	0.740	0.936	0.684	0.928	0.659	0.939	0.644	0.945
camel1.0	0.826	0.840	0.864	0.806	0.906	0.745	0.904	0.705	0.906	0.677	0.906	0.668	0.908
poi1.5	0.871	0.843	0.906	0.808	0.962	0.728	0.964	0.683	0.958	0.650	0.963	0.637	0.972
redaktor	0.891	0.876	0.925	0.835	0.936	0.732	0.933	0.675	0.910	0.642	0.936	0.629	0.927
skarbonka	0.910	0.894	0.923	0.818	0.932	0.727	0.929	0.629	0.878	0.574	0.881	0.562	0.855
tomcat	0.896	0.843	0.918	0.815	0.954	0.730	0.956	0.683	0.941	0.649	0.968	0.641	0.958
Velocity1.4	0.894	0.867	0.926	0.824	0.911	0.728	0.897	0.659	0.911	0.627	0.905	0.606	0.880
Xalan2.4	0.887	0.847	0.916	0.804	0.959	0.739	0.957	0.690	0.969	0.657	0.969	0.647	0.969
Xerces1.2	0.841	0.842	0.872	0.785	0.968	0.730	0.973	0.690	0.981	0.659	0.978	0.649	0.973
Ave.	0.886	0.868	0.910	0.818	0.932	0.734	0.929	0.673	0.930	0.637	0.928	0.606	0.927
Std.	0.027	0.024	0.019	0.022	0.029	0.008	0.031	0.020	0.031	0.027	0.035	0.037	0.035

References

Fenton, N.E.; Neil, M. A critique of software defect prediction models. IEEE Trans. Softw. Eng. 2002, 25, 675–689. [Google Scholar] [CrossRef]
Shao, Y.; Liu, B.; Wang, S.; Li, G. A novel software defect prediction based on atomic class-association rule mining. Expert Syst. Appl. 2018, 114, 237–254. [Google Scholar] [CrossRef]
Shao, Y.; Liu, B.; Wang, S.; Li, G. Software defect prediction based on correlation weighted class association rule mining. Knowl.-Based Syst. 2020, 196, 105742. [Google Scholar] [CrossRef]
Tong, H.; Liu, B.; Wang, S. Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning. Inf. Softw. Technol. 2018, 96, 94–111. [Google Scholar] [CrossRef]
Zimmermann, T.; Nagappan, N.; Gall, H.; Giger, E.; Murphy, B. Cross-project defect prediction: A large scale experiment on data vs. domain vs. process. In Proceedings of the Joint 12th European Software Engineering Conference and 17th ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC-FSE’09), Amsterdam, The Netherlands, 24–29 August 2009; pp. 91–100. [Google Scholar]
Qiu, S.; Lu, L.; Cai, Z.; Jiang, S. Cross-project defect prediction via transferable deep learning-generated and handcrafted features. In Proceedings of the 31st International Conference on Software Engineering and Knowledge Engineering (SEKE 2019), Lisbon, Portugal, 10–12 July 2019; Knowledge Systems Institute Graduate School: Skokie, IL, USA, 2019; pp. 431–436. [Google Scholar]
Herbold, S.; Trautsch, A.; Grabowski, J. Global vs. local models for cross-project defect prediction. Empir. Softw. Eng. 2016, 22, 1866–1902. [Google Scholar] [CrossRef]
Xiao, P.; Liu, B.; Wang, S. Feedback-based integrated prediction: Defect prediction based on feedback from software testing process. J. Syst. Softw. 2018, 143, 159–171. [Google Scholar] [CrossRef]
Nam, J.; Kim, S. Heterogeneous Defect Prediction. IEEE Trans. Softw. Eng. 2018, 44, 874–896. [Google Scholar] [CrossRef]
Jing, X.; Wu, F.; Dong, X.; Qi, F.; Xu, B. Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, Bergamo, Italy, 30 August–4 September 2015; pp. 496–507. [Google Scholar]
Cheng, M.; Wu, G.; Jiang, M.; Wan, H.; You, G.; Yuan, M. Heterogeneous Defect Prediction via Exploiting Correlation Subspace. In Proceedings of the SEKE, Redwood City, CA, USA, 1–3 July 2016; pp. 171–176. [Google Scholar]
Ma, Y.; Zhu, S.; Chen, Y.; Li, J. Kernel CCA based transfer learning for software defect prediction. IEICE Trans. Inf. Syst. 2017, 100, 1903–1906. [Google Scholar] [CrossRef]
Li, Z.; Jing, X.Y.; Wu, F.; Zhu, X.; Xu, B.; Ying, S. Cost-sensitive transfer kernel canonical correlation analysis for heterogeneous defect prediction. Autom. Softw. Eng. 2018, 25, 201–245. [Google Scholar] [CrossRef]
Yu, Q.; Jiang, S.; Zhang, Y. A feature matching and transfer approach for cross-company defect prediction. J. Syst. Softw. 2017, 132, 366–378. [Google Scholar] [CrossRef]
Li, Z.; Jing, X.Y.; Zhu, X.; Zhang, H. Heterogeneous defect prediction through multiple kernel learning and ensemble learning. In Proceedings of the IEEE International Conference on Software Maintenance and Evolution (ICSME), Shanghai, China, 17–22 September 2017; pp. 91–102. [Google Scholar]
Li, Z.; Jing, X.Y.; Zhu, X.; Zhang, H.; Xu, B.; Ying, S. Heterogeneous defect prediction with two-stage ensemble learning. Autom. Softw. Eng. 2019, 26, 599–651. [Google Scholar] [CrossRef]
Tong, H.; Liu, B.; Wang, S. Kernel Spectral Embedding Transfer Ensemble for Heterogeneous Defect Prediction. IEEE Trans. Softw. Eng. 2021, 47, 1886–1906. [Google Scholar] [CrossRef]
Xu, Z.; Yuan, P.; Zhang, T.; Tang, Y.; Li, S.; Xia, Z. HDA: Cross-project defect prediction via heterogeneous domain adaptation with dictionary learning. IEEE Access 2018, 6, 57597–57613. [Google Scholar] [CrossRef]
Xu, Z.; Ye, S.; Zhang, T.; Xia, Z.; Pang, S.; Wang, Y.; Tang, Y. Mvse: Effort-aware heterogeneous defect prediction via multiple-view spectral embedding. In Proceedings of the IEEE 19th International Conference on Software Quality, Reliability and Security (QRS), Sofia, Bulgaria, 22–26 July 2019; pp. 10–17. [Google Scholar]
Gong, L.; Jiang, S.; Yu, Q.; Jiang, L. Unsupervised deep domain adaptation for heterogeneous defect prediction. IEICE Trans. Inf. Syst. 2019, 102, 537–549. [Google Scholar] [CrossRef]
Wu, J.; Wu, Y.; Niu, N.; Zhou, M. MHCPDP: Multi-source heterogeneous cross-project defect prediction via multi-source transfer learning and autoencoder. Softw. Qual. J. 2021, 29, 405–430. [Google Scholar] [CrossRef]
Wang, A.; Zhang, Y.; Wu, H.; Jiang, K.; Wang, M. Few-shot learning based balanced distribution adaptation for heterogeneous defect prediction. IEEE Access 2020, 8, 32989–33001. [Google Scholar] [CrossRef]
Zong, X.; Li, G.; Zheng, S.; Zou, H.; Yu, H.; Gao, S. Heterogeneous cross-project defect prediction via optimal transport. IEEE Access 2023, 11, 12015–12030. [Google Scholar] [CrossRef]
Shi, X.; Liu, Q.; Fan, W.; Yu, P.S. Transfer across completely different feature spaces via spectral embedding. IEEE Trans. Knowl. Data Eng. 2011, 25, 906–918. [Google Scholar]
Shepperd, M.; Song, Q.; Sun, Z.; Mair, C. Data quality: Some comments on the NASA software defect datasets. IEEE Trans. Softw. Eng. 2013, 39, 1208–1215. [Google Scholar] [CrossRef]
Turhan, B.; Menzies, T.; Bener, A.B.; Di Stefano, J. On the relative value of cross-company and within-company data for defect prediction. Empir. Softw. Eng. 2009, 14, 540–578. [Google Scholar] [CrossRef]
Wu, R.; Zhang, H.; Kim, S.; Cheung, S.C. Relink: Recovering links between bugs and changes. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, Szeged, Hungary, 5–9 September 2011; pp. 15–25. [Google Scholar]
D’Ambros, M.; Lanza, M.; Robbes, R. Evaluating defect prediction approaches: A benchmark and an extensive comparison. Empir. Softw. Eng. 2012, 17, 531–577. [Google Scholar] [CrossRef]
Peters, F.; Menzies, T. Privacy and utility for defect prediction: Experiments with morph. In Proceedings of the 2012 34th International conference on software engineering (ICSE), Zurich, Switzerland, 2–9 June 2012; IEEE: New York, NY, USA, 2012; pp. 189–199. [Google Scholar]
Wu, H.; Ng, M.K. Multiple graphs and low-rank embedding for multi-source heterogeneous domain adaptation. ACM Trans. Knowl. Discov. Data 2022, 16, 1–25. [Google Scholar] [CrossRef]
Chai, L.; Xu, H.; Luo, Z.; Li, S. A multi-source heterogeneous data analytic method for future price fluctuation prediction. Neurocomputing 2020, 418, 11–20. [Google Scholar] [CrossRef]
Zhao, W.; Fu, Z.; Fan, T.; Wang, J. Ontology construction and mapping of multi-source heterogeneous data based on hybrid neural network and autoencoder. Neural Comput. Appl. 2023, 1–11. [Google Scholar] [CrossRef]
Chen, H.; Jing, X.Y.; Li, Z.; Wu, D.; Peng, Y.; Huang, Z. An empirical study on heterogeneous defect prediction approaches. IEEE Trans. Softw. Eng. 2020, 47, 2803–2822. [Google Scholar] [CrossRef]
Liu, X.; Li, Z.; Zou, J.; Tong, H. An Empirical Study on Multi-Source Cross-Project Defect Prediction Models. In Proceedings of the 2022 29th Asia-Pacific Software Engineering Conference (APSEC), Virtual Event, 6–9 December 2022; IEEE: New York, NY, USA; pp. 318–327. [Google Scholar]
Zhang, H.; Zhu, X.; Li, Z.; Xu, B.; Jing, X.Y.; Ying, S. On the Multiple Sources and Privacy Preservation Issues for Heterogeneous Defect Prediction. IEEE Trans. Softw. Eng. 2019, 45, 391–411. [Google Scholar]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Karush, W. Minima of functions of several variables with inequalities as side conditions. In Traces and Emergence of Nonlinear Programming; Springer: Basel, Switzerland, 2013; pp. 217–245. [Google Scholar]
Bhatia, R. Matrix analysis. Grad. Texts Math. 1997, 169, 1–17. [Google Scholar]
Mccabe, T.J. A Complexity Measure. IEEE Trans. Softw. Eng. 2006, 2, 308–320. [Google Scholar] [CrossRef]
Halstead, M.H.; Halstead, M. Elements of Software Science. In Advances in Computers; Elsevier: Amsterdam, The Netherlands, 1977. [Google Scholar]
Chidamber, S.R.; Kemerer, C.F. A Metrics Suite for Object Oriented Design. IEEE Trans. Softw. Eng. 1994, 20, 476–493. [Google Scholar] [CrossRef]
Powers, D.M. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. J. Mach. Learn. Technol. 2011, 2, 37–63. [Google Scholar]
Tantithamthavorn, C.; McIntosh, S.; Hassan, A.E.; Matsumoto, K. The impact of automated parameter optimization on defect prediction models. IEEE Trans. Softw. Eng. 2018, 45, 683–711. [Google Scholar] [CrossRef]
Demšar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
Cliff, N. Dominance statistics: Ordinal analyses to answer ordinal questions. Psychol. Bull. 1993, 114, 494. [Google Scholar] [CrossRef]
Macbeth, G.; Razumiejczyk, E.; Ledesma, R.D. Cliff’s Delta Calculator: A non-parametric effect size program for two groups of observations. Univ. Psychol. 2011, 10, 545–555. [Google Scholar] [CrossRef]
Yao, J.; Shepperd, M. The impact of using biased performance metrics on software defect prediction research. Inf. Softw. Technol. 2021, 139, 106664. [Google Scholar] [CrossRef]
Yao, J.; Shepperd, M. Assessing software defection prediction performance: Why using the Matthews correlation coefficient matters. In Proceedings of the Evaluation and Assessment in Software Engineering, Trondheim, Norway, 15–17 April 2020; pp. 120–129. [Google Scholar]
Menzies, T.; Greenwald, J.; Frank, A. Data mining static code attributes to learn defect predictors. IEEE Trans. Softw. Eng. 2006, 33, 2–13. [Google Scholar] [CrossRef]
Hall, T.; Beecham, S.; Bowes, D.; Gray, D.; Counsell, S. A systematic literature review on fault prediction performance in software engineering. IEEE Trans. Softw. Eng. 2011, 38, 1276–1304. [Google Scholar] [CrossRef]

Figure 1. Strategy 1 of MSHKM for multi-source HDP.

Figure 2. Strategy 2 of MSHKM for multi-source HDP.

Figure 3. Mapping principle of MSHKM.

Figure 4. Flowchart of MSHKM algorithm.

Figure 5. Tendency of Pd under different n between two strategies in MSHKM.

Figure 6. Tendency of Pf under different n between two strategies in MSHKM.

Figure 7. Tendency of GM under different n between two strategies in MSHKM.

Figure 8. Results of different methods, in terms of Pd. Among them, the square represents the mean values and “x” represents outlier values.

Figure 9. Results of different methods, in terms of Pf. Among them, the square represents the mean values and “x” represents outlier values.

Figure 10. Results of different methods, in terms of GM. Among them, the square represents the mean values and “x” represents outlier values.

Figure 11. Results of the Nemenyi test for six methods, in terms of Pd.

Figure 12. Results of the Nemenyi test for six methods, in terms of Pf.

Figure 13. Results of the Nemenyi test for six methods, in terms of GM.

Table 1. Details of the 28 data sets from five groups.

Project	Data Set	# of Metrics	# of Modules	# of Buggy Modules (%)
NASA	CM1	37	327	42 (12.84%)
	MW1	37	253	27 (10.67%)
	PC1	37	705	61 (8.65%)
	PC3	37	1077	134 (12.44%)
	PC4	37	1458	178 12.21%)
SOFTLAB	AR1	29	121	9 (7.44%)
	AR3	29	63	8 (12.70%)
	AR4	29	107	20 (18.69%)
	AR5	29	36	8 (22.22%)
	AR6	29	101	15 (14.85%)
Relink	Apache	26	194	98 (50.52%)
	Safe	26	56	22 (39.29%)
	ZXing	26	399	118 (29.57%)
AEEEM	EQ	61	324	129 (39.81%)
	JDT	61	997	206 (20.66%)
	LC	61	691	64 (9.26%)
	ML	61	1862	245 (13.16%)
	PDE	61	1497	209 (13.96%)
MORPH	ant1.3	20	125	20 (16.00%)
	arc	20	234	27 (11.54%)
	camel1.0	20	339	13 (3.83%)
	poi1.5	20	237	141 (59.49%)
	redaktor	20	176	27 (15.34%)
	skarbonka	20	45	9 (20.00%)
	tomcat	20	858	77 (8.97%)
	velocity1.4	20	196	147 (75.00%)
	Xalan2.4	20	723	110 (15.21%)
	Xerces1.2	20	440	71 (16.14%)

Table 2. Confusion matrix.

	Actual Buggy	Actual Clean
Predict Buggy (Positive)	TP	FP
Predict Clean (Negative)	FN	TN

Table 3. Evaluation measures.

Measure	Definition	Description	Better
Pd	$\frac{T P}{T P + F N}$	The proportion of the buggy modules that were correctly predicted as buggy	High
PF	$\frac{F P}{F P + T P}$	Probability of false alarm. The proportion of the clean modules that were incorrectly predicted as buggy	Low
GM	$\sqrt{P d \times (1 - P f)}$	The geometric mean of Pd and 1 − Pf	High

Table 4. Results of MSHKM and baselines for each target, in terms of Pd.

Target	WPDP	CCA+	HDP-KS	CTKCCA	MHCPDP	MSHKM
CM1	0.818	0.549	0.528	0.967	0.507	0.935
MW1	0.800	0.667	0.672	0.965	0.712	0.923
PC1	0.876	0.582	0.584	0.795	0.547	0.915
PC3	0.824	0.544	0.643	0.462	0.593	0.937
PC4	0.885	0.631	0.553	0.369	0.569	0.950
AR1	0.999	0.583	0.513	0.959	0.533	0.875
AR3	0.910	0.709	0.756	0.904	0.728	0.871
AR4	0.881	0.715	0.647	0.918	0.755	0.943
AR5	0.990	0.798	0.884	0.710	0.742	0.864
AR6	0.830	0.601	0.529	0.952	0.552	0.965
Apache	0.721	0.511	0.514	0.662	0.527	0.958
Safe	0.877	0.648	0.593	0.894	0.636	0.935
ZXing	0.539	0.460	0.423	0.561	0.454	0.908
EQ	0.792	0.572	0.493	0.513	0.521	0.940
JDT	0.762	0.585	0.571	0.320	0.615	0.972
LC	0.746	0.554	0.531	0.927	0.597	0.934
ML	0.754	0.455	0.519	0.324	0.528	0.939
PDE	0.613	0.570	0.570	0.315	0.533	0.969
ant1.3	0.681	0.680	0.733	0.989	0.745	0.959
arc	0.794	0.635	0.543	0.832	0.638	0.911
camel1.0	0.549	0.636	0.657	0.994	0.604	0.931
poi1.5	0.787	0.623	0.544	0.533	0.740	0.927
redaktor	0.813	0.665	0.478	0.809	0.736	0.904
skarbonka	0.997	0.730	0.764	0.977	0.685	0.924
tomcat	0.732	0.645	0.720	0.854	0.706	0.959
Velocity1.4	0.897	0.660	0.271	0.496	0.588	0.947
Xalan2.4	0.732	0.596	0.599	0.582	0.587	0.934
Xerces1.2	0.692	0.671	0.307	0.746	0.667	0.955
Ave.	0.796	0.617	0.576	0.726	0.619	0.932
Std.	0.118	0.078	0.130	0.234	0.088	0.028
Improved	17.09%	51.05%	61.81%	28.37%	50.57%	—

A cell with dark gray background suggests that MSHKM provides a considerable (

0.474 \leq | δ |

) significant (

p < 0.05

) improvement over the corresponding method; a medium gray background indicates that MSHKM provides a moderate (

0.330 \leq | δ | < 0.474

) significant improvement; and a light gray background indicates that MSHKM provides a small (

0.147 \leq | δ | < 0.330

) significant improvement.

Table 5. Results of MSHKM and baselines for each target, in terms of Pf.

Target	WPDP	CCA+	HDP-KS	CTKCCA	MHCPDP	MSHKM
CM1	0.253	0.301	0.267	0.088	0.440	0.037
MW1	0.188	0.358	0.330	0.185	0.322	0.012
PC1	0.234	0.354	0.293	0.034	0.381	0.075
PC3	0.256	0.308	0.297	0.001	0.377	0.029
PC4	0.195	0.289	0.274	0.001	0.329	0.056
AR1	0.075	0.318	0.359	0.598	0.367	0.091
AR3	0.065	0.249	0.297	0.735	0.194	0.084
AR4	0.161	0.282	0.253	0.543	0.279	0.073
AR5	0.024	0.208	0.207	0.660	0.148	0.130
AR6	0.136	0.359	0.319	0.661	0.249	0.235
Apache	0.235	0.178	0.195	0.046	0.153	0.109
Safe	0.053	0.197	0.211	0.697	0.203	0.069
ZXing	0.299	0.296	0.264	0.011	0.372	0.112
EQ	0.164	0.267	0.186	0.001	0.174	0.050
JDT	0.148	0.334	0.227	0.001	0.336	0.062
LC	0.215	0.336	0.281	0.008	0.429	0.062
ML	0.233	0.277	0.288	0.006	0.253	0.037
PDE	0.268	0.346	0.302	0.001	0.384	0.035
ant1.3	0.218	0.395	0.324	0.468	0.293	0.051
arc	0.203	0.337	0.287	0.308	0.332	0.042
camel1.0	0.317	0.329	0.348	0.190	0.332	0.118
poi1.5	0.277	0.415	0.232	0.078	0.180	0.002
redaktor	0.118	0.465	0.398	0.357	0.232	0.031
skarbonka	0.082	0.426	0.318	0.815	0.406	0.060
tomcat	0.231	0.429	0.305	0.031	0.341	0.051
Velocity1.4	0.195	0.418	0.561	0.021	0.449	0.124
Xalan2.4	0.225	0.379	0.317	0.005	0.372	0.015
Xerces1.2	0.377	0.436	0.290	0.131	0.506	0.019
Ave.	0.194	0.332	0.294	0.239	0.315	0.067
Std.	0.085	0.074	0.072	0.283	0.097	0.048
Improved	65.46%	79.82%	77.21%	71.97%	78.73%	—

Same description as Table 4.

Table 6. Results of MSHKM and baselines for each target, in terms of GM.

Target	WPDP	CCA+	HDP-KS	CTKCCA	MHCPDP	MSHKM
CM1	0.781	0.620	0.622	0.939	0.533	0.949
MW1	0.805	0.654	0.671	0.887	0.695	0.955
PC1	0.818	0.613	0.642	0.876	0.582	0.920
PC3	0.782	0.614	0.672	0.679	0.608	0.954
PC4	0.843	0.670	0.633	0.607	0.618	0.947
AR1	0.961	0.631	0.574	0.621	0.581	0.892
AR3	0.922	0.730	0.729	0.489	0.766	0.893
AR4	0.859	0.717	0.695	0.648	0.738	0.935
AR5	0.983	0.795	0.837	0.492	0.795	0.867
AR6	0.846	0.621	0.600	0.568	0.644	0.859
Apache	0.741	0.648	0.643	0.795	0.668	0.924
Safe	0.910	0.721	0.684	0.521	0.712	0.933
ZXing	0.611	0.569	0.558	0.745	0.534	0.898
EQ	0.813	0.647	0.634	0.716	0.656	0.945
JDT	0.805	0.624	0.665	0.565	0.639	0.955
LC	0.763	0.606	0.618	0.959	0.584	0.936
ML	0.759	0.573	0.608	0.567	0.628	0.951
PDE	0.668	0.611	0.631	0.561	0.573	0.967
ant1.3	0.728	0.641	0.704	0.725	0.726	0.954
arc	0.795	0.649	0.622	0.759	0.653	0.934
camel1.0	0.610	0.654	0.655	0.898	0.635	0.906
poi1.5	0.753	0.603	0.646	0.701	0.779	0.962
redaktor	0.846	0.597	0.536	0.721	0.752	0.936
skarbonka	0.956	0.647	0.722	0.425	0.638	0.932
tomcat	0.749	0.606	0.707	0.910	0.682	0.954
Velocity1.4	0.849	0.620	0.345	0.697	0.569	0.911
Xalan2.4	0.752	0.608	0.640	0.761	0.607	0.959
Xerces1.2	0.654	0.615	0.466	0.805	0.574	0.968
Ave.	0.799	0.640	0.634	0.701	0.649	0.932
Std.	0.097	0.050	0.089	0.148	0.074	0.029
Improved	16.65%	45.63%	47%	32.95%	43.61%	—

A cell with dark gray background suggests that MSHKM provides a considerable (

0.474 \leq | δ |

) significant (

p < 0.05

) improvement over the corresponding method; a medium gray background indicates that MSHKM provides a moderate (

0.330 \leq | δ | < 0.474

) significant improvement.

Table 7. Counts and medians of the best results in terms of Pd, Pf, and GM, using different methods.

	Pd		Pf		GM
Method	Count	Median	Count	Median	Count	Median
WPDP	4	0.990	5	0.065	4	0.959
CCA+	0	/	0	/	0	/
HDP-KS	0	/	0	/	0	/
CTKCCA	4	0.965	13	0.010	1	0.959
MHCPDP	0	/	0	/	0	/
MSHKM	20	0.940	10	0.079	23	0.946

Table 8. Friedman test results.

Mean Rank
	Pd	Pf	GM
WPDP	4.536	2.750	4.643
CCA+	2.500	4.929	2.429
HDP-KS	2.214	4.250	2.393
CTKCCA	3.679	2.821	3.179
MHCPDP	2.536	4.607	2.571
MSHKM	5.571	1.643	5.821
p-values	<<0.001	<<0.001	<<0.001

Table 9. Evaluation standard of Cliff’s delta.

Cliff’s Delta (δ)	Effectiveness Levels
−1 ≤ δ < 0.147	Negligible (N)
0.147 ≤ δ < 0.33	Small (S)
0.33 ≤ δ < 0.474	Medium (M)
0.474 ≤ δ ≤ 1	Large (L)

Table 10. The counts of effectiveness levels for MSHKM, with respect to Cliff’s delta test.

Measure	Level	Against
		WPDP	CCA+	HDP-KS	CTKCCA	MHCPDP
Pd	N	3	0	0	6	0
	S	0	0	1	1	0
	M	4	0	0	3	0
	L	21	28	27	18	28
Pf	N	1	0	0	13	0
	S	2	0	0	0	0
	M	3	0	0	1	0
	L	22	28	28	14	28
GM	N	4	0	0	1	0
	S	0	0	0	0	0
	M	2	0	1	2	0
	L	22	28	27	25	28

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yao, J.; Liu, B.; Wu, Y.; Li, Z. Multi-Source Heterogeneous Kernel Mapping in Software Defect Prediction. Appl. Sci. 2023, 13, 5526. https://doi.org/10.3390/app13095526

AMA Style

Yao J, Liu B, Wu Y, Li Z. Multi-Source Heterogeneous Kernel Mapping in Software Defect Prediction. Applied Sciences. 2023; 13(9):5526. https://doi.org/10.3390/app13095526

Chicago/Turabian Style

Yao, Jingxiu, Bin Liu, Yumei Wu, and Zhibo Li. 2023. "Multi-Source Heterogeneous Kernel Mapping in Software Defect Prediction" Applied Sciences 13, no. 9: 5526. https://doi.org/10.3390/app13095526

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Source Heterogeneous Kernel Mapping in Software Defect Prediction

Abstract

1. Introduction

2. Related Works

2.1. Heterogeneous Defect Prediction

2.2. Multi-Source Transfer Learning

3. Research Methodology

3.1. Feature Mapping Strategies

3.2. Data Pre-Processing

3.3. Multi-Source Heterogeneous Kernel Mapping

4. Experiments

4.1. Benchmark Data Sets

4.2. Evaluation Measures

4.3. Experimental Design

5. Results

6. Statistical Testing

6.1. Statistical Significance Test

6.2. Effect Size Test

7. Discussion

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI