5.4. Results
This section presents the results for each RQ.
5.4.1. Results for RQ1
In prediction tasks across various domains, many learning methods have been proposed to predict outcomes, including the CNN, DeepFM, DeepPerf, and HINNPerf. In addition, traditional machine learning methods such as FM can also be used for prediction tasks. DeepFM [
23] is a representative model in the field of click-through rate prediction, while DeepPerf [
20] demonstrates superior prediction accuracy in configurable systems. HINNPerf [
31], proposed in 2023 as an improved version of DeepPerf, has similarly achieved high prediction precision. The CNN leverages parameter sharing and localized connectivity. These properties enable efficient feature extraction from limited datasets while improving the predictive accuracy. As a well-established deep learning architecture, the CNN currently represents one of the most widely adopted models in the field. In this experiment, we verify the effectiveness of our proposed method by comparing our MovePerf model with the CNN, FM, DeepFM, DeepPerf, and HINNPerf.
We evaluate the six models across the four projects shown in
Table 4. Note that the datasets for HINNPerf are not considered for this comparison, as they primarily focus on configurable systems, where the features are strongly related to the specific configuration options. For instance, in the case of ×264, the configuration options include parameters such as the number of reference frames, enabling or disabling the default entropy encoder, and the number of frames for ratecontrol and lookahead.
The hyperparameter settings in DeepFM follow [
23], those for DeepPerf follow [
20], and those for HINNPerf follow [
31]. The hyperparameter settings in FM follow the FM component in DeepFM. The number of epochs for the CNN is 100, the number of epochs for MovePerf is 100, and the batch size is 64.
Table 4 presents the performance prediction results for the six methods across various datasets. Overall, MovePerf demonstrates a more balanced and effective approach across the datasets.
For JUnit, the CNN exhibits a relatively high mean of 15.83% and a small margin of 4.94, indicating that the model’s predictions are somewhat erratic and unstable. In contrast, DeepFM significantly reduces the mean to 13.53%, although its margin increases to 10.58. This suggests that, while DeepFM provides more accurate predictions in most cases, it has larger prediction MREs in some instances. HINNPerf shows a mean of 5.96% and a smaller margin of 2.00, indicating improved prediction stability, although a notable prediction MRE remains. Finally, MovePerf performs exceptionally well in this project, achieving a mean of 5.41% and a margin of 0.03, indicating low prediction MREs and high stability. Therefore, for the dataset with a small number of samples and large label variation, MovePerf performs better, reducing the mean by 0.55% versus HINNPerf.
For LC-problems, the CNN shows a mean of 6.44% and a margin of 5.64, reflecting relatively low prediction accuracy and some instability. DeepFM shows a mean of 1.77% and achieves a margin of 3.56, suggesting that the prediction MRE of the model is reduced and it can provide a high-accuracy prediction, although its stability is moderate. In comparison, the performance of HINNPerf is superior, with a mean of 0.35% and a margin of 0.10, demonstrating smaller prediction MREs and higher stability, thus showing stronger predictive capabilities. MovePerf gives slightly better results than HINNPerf, with a mean of 0.28% and a margin of only 0.01. Although its predictive accuracy is slightly lower than that of HINNPerf, the prediction MRE remains small, and it excels in terms of stability. Therefore, when dealing with datasets with a very small number of samples and minimal label variation, MovePerf performs even better, reducing the mean by 0.07% compared to HINNPerf.
For Concurrency, the CNN has a mean of 12.47% and a margin of 4.39, suggesting moderate prediction MREs with reasonable stability. DeepFM shows a mean of 9.87% and a margin of 3.62, which represents an improvement over the CNN, although the prediction MRE remains. HINNPerf has a mean of 8.99% and a smaller margin of 2.06; while the margin is smaller, the model’s prediction MRE is still significant, resulting in lower prediction accuracy in this project. By contrast, MovePerf performs exceptionally well, with a mean of 4.92% and a margin of 0.02, demonstrating minimal prediction MREs and high stability. Therefore, on datasets with a moderate number of samples and minimal label variation, MovePerf outperforms HINNPerf, reducing the mean by 4.07% compared to HINNPerf.
For Kevin, the CNN achieves a 30.45% mean and 6.34 margin, indicating that the prediction results are highly unreliable, with large prediction MREs. In contrast, DeepFM shows a mean of 23.67% and a margin value of 8.46, demonstrating a significant improvement over the CNN, although its prediction accuracy remains relatively low. HINNPerf achieves a mean of 22.78% and a margin value of 10.94, showing a further improvement compared to DeepFM. Our MovePerf model excels, with a mean of 20.16% and a margin value of 8.53, further validating the effectiveness of our approach. In datasets with a larger number of samples and significant label variation, the prediction accuracy of MovePerf slightly decreases. However, compared to HINNPerf, the prediction MRE is still reduced by 2.62%.
In summary, MovePerf outperforms the CNN, DeepFM, and HINNPerf across multiple test projects, particularly demonstrating higher accuracy and stability on complex and feature-rich datasets.
5.4.2. Results for RQ2
To answer RQ2, we evaluate whether the transmission of order-2 feature cross-terms to the DFNN can truly improve the performance prediction accuracy. To verify this, we conduct a comparative analysis between MovePerf that includes order-2 features and a model that does not. In the absence of order-2 features, both the DFNN and FM use the same input data, i.e., the dataset that has undergone standardization processing.
For both models, we use the same grid search strategy to tune the hyperparameters, ensuring that the process is consistent across both models. This means that the hyperparameter settings are held constant throughout the evaluation so that the only difference between the two models is whether the order-2 feature cross-term is used.
Table 5 presents the prediction results for MovePerf with order-2 and MovePerf without order-2. On the whole, the model incorporating order-2 features achieves an average error value of 7.69%, representing a 2.79% reduction compared to the model without order-2 features. Furthermore, it demonstrates greater stability with a margin value of 2.15, which is 1.20 lower than that of the baseline model.
For JUnit, when MovePerf without order-2 is used, there is a significant drop in accuracy. Specifically, the mean increases by 3.67%, and the margin also shows an increase, indicating a decrease in the overall prediction accuracy and stability. This suggests that the order-2 feature interactions play a crucial role in maintaining the model’s performance for the JUnit project.
For LC-problems, the inclusion of order-2 feature cross-terms leads to a slight increase in the mean by 0.02%, which is a minimal change. However, there is a significant decrease in the margin value, suggesting that the model becomes more stable in its predictions. This indicates that, while the impact on the accuracy is marginal, the order-2 features help to improve the model’s consistency and reduce the variance of the predictions.
For Concurrency, the absence of order-2 feature cross-terms causes a notable drop in accuracy, with the mean increasing by 4.07% and the margin rising by 2.05. This results in a significant decrease in model consistency. In this project, where the complexity of the task is higher, the order-2 feature interactions are particularly important in maintaining stable and accurate predictions.
For Kevin, the mean of MovePerf without order-2 is 3.41% higher than that of MovePerf, and the margin value is somewhat lower, but this still demonstrates the necessity of order-2 feature interactions.
Overall, the introduction of order-2 feature cross-terms proves to be beneficial in improving the stability and accuracy of the model, particularly in more complex scenarios like the Concurrency project. The results suggest that order-2 feature interactions are essential in ensuring that the model performs well across a variety of tasks, especially when dealing with higher complexity. Thus, the order-2 feature cross-term in MovePerf is not only necessary overall but also crucial in maintaining the model’s predictive power and consistency.
5.4.3. Results for RQ3
Two components are involved in our approach, FM and the DFNN. To evaluate the contribution of each component in our proposed method, we conduct ablation experiments by comparing their performance.
Our method still adheres to the previously defined parameter settings, with the FM epoch value set to 100. The primary objective of this experimental setup is to ensure that the selected parameter values or the grid search range remain consistent. As a result, the differences observed in the model’s predictions are most likely attributable to architectural differences.
Table 6 presents the prediction results for MovePerf and its two components. On the whole, MovePerf demonstrates superior performance compared to FM and the DFNN across the four datasets, achieving a lower average mean value of 7.69%—–representing reductions of 4.98% and 2.45% over FM and the DFNN, respectively. Additionally, its average margin value of 2.15 is significantly better, being 6.90 and 0.78 lower than that of FM and the DFNN. These results show that MovePerf significantly improves the prediction accuracy and stability by combining the strengths of FM and the DFNN.
For JUnit, FM exhibits a relatively high mean of 7.82% and a large margin of 15.32, indicating that the model’s predictions are prone to significant fluctuations, with relatively low accuracy and high instability. In contrast, the DFNN reduces the mean to 6.03%, and the margin decreases significantly to 2.26, demonstrating better stability and consistency. While prediction MREs still exist, the DFNN exhibits smaller prediction fluctuations and more reliable results. MovePerf outperforms both FM and the DFNN with a mean of 5.41% and an extremely small margin of 0.03, indicating high precision and remarkable stability and consistency. Clearly, MovePerf provides superior performance to both FM and the DFNN, showing that combining FM and the DFNN improves the model’s overall performance.
For LC-problems, FM shows a mean of 0.87%, indicating relatively small prediction MREs, but the margin is 5.24, suggesting significant prediction fluctuations. This implies that, while the model provides accurate predictions in certain instances, its stability is poor, with the results varying considerably across different data points. Compared to FM, the DFNN performs even better, with the mean dropping to 0.28% and the margin significantly reduced to 0.08. This demonstrates not only a significant improvement in accuracy but also stronger stability and consistency, with the predictions showing almost no fluctuation. MovePerf, which integrates both FM and the DFNN, further enhances the prediction performance, achieving a mean of 0.28% and an almost zero margin of 0.01, indicating extremely high accuracy with virtually no fluctuation. This confirms that, by combining FM and the DFNN, MovePerf significantly improves the prediction accuracy and consistency, overcoming the limitations of individual models and demonstrating exceptional performance for the LC-problems project.
For Concurrency, FM performs poorly, with a mean of 13.24% and a margin of 8.76, indicating large prediction MREs and considerable fluctuations in the results. The DFNN improves upon FM, reducing the mean to 8.99% and the margin to 2.07. While the prediction MRE remains, the DFNN exhibits more stability and smaller fluctuations. MovePerf delivers the best performance, with a mean of 4.92% and a very small margin of 0.02, showing significant improvements in both its prediction accuracy and stability. Compared to FM and the DFNN, MovePerf achieves a notable enhancement in its overall performance.
For Kevin, the prediction MREs of the two components are quite significant, at 28.74% and 25.21%, respectively. Although the prediction MRE of MovePerf has not yet reached the ideal level, its predictive accuracy is superior to that of the individual components. Additionally, MovePerf also demonstrates a notable advantage in terms of stability.
In summary, MovePerf consistently outperforms FM and the DFNN in terms of both accuracy and stability across multiple datasets. Integrating the two components significantly improves the prediction precision, reduces fluctuations, and results in more consistent performance, making MovePerf a superior approach compared to its individual components.
5.4.4. Results for RQ4
To answer RQ4, we perform performance predictions for three different refactoring results under the same environment and parameter settings, with the results shown in
Table 7.
For JUnit, the refactoring results are as follows: getClassName() is moved to a new class in a new package under the parent directory, getAllFields() is moved to a new class under the parent directory, and getStatedClassName() is moved to AnnotationCondition; getClassName() remains unchanged, getAllFields() is moved to the inner class WindowsCompileProcess within CompileProcess, and getStatedClassName() is moved to Test2Benchmark; both getClassName() and getStatedClassName() are moved to Test2BenchmarkAgent, while getAllFields() is moved to a new class in a new package within the same package.
For LC-problems, the refactoring results are as follows: generateUnsortedArray() is moved to ArrayTo2dArrayTest, printMatrix(int[][]) is moved to TestUtils, and printMatrix(List) is moved to a new class in the same package; all three methods are moved to TestUtils; generateUnsortedArray() and printMatrix(List) are moved to a new class in a new package, while printMatrix(int[][]) is moved to another new class.
For Concurrency, the refactoring results are as follows: readFile() and sleepMils() are moved to RunAsyncDemo, sleepSecond() is moved to a new class, and printThreadlog() is moved to SupplyAsyncDemo02; readFile() is moved to RunAsyncDemo02, while sleepMils(), sleepSecond(), and printThreadlog() are moved to CommonUtils; all three methods (readFile(), sleepMils(), and sleepSecond()) are moved to CommonUtils, and printThreadlog() is moved to RunAsyncDemo.
For Kevin, since the project contains two benchmark() methods, one in NorbitBenchmark and one in OrbitBenchmark, we rename these two methods as benchmarkN() and benchmarkO(), respectively. In this experiment, the refactoring results are as follows: benchmarkN() is moved to Benchmarks, prepare() is moved to OrbitEvent, and the benchmarkO() method remains in place; benchmarkN() and benchmarkO() remain in place, and prepare() is moved to a new class within the same package; benchmarkN() and prepare() are moved to OrbitListener, and benchmarkO() remains in place.
From the results in
Table 7, it can be observed that MovePerf consistently maintains high accuracy in its predictions across these datasets. It is capable of effectively predicting the execution time for any restructured results, with its predicted values providing meaningful reference points.
5.4.5. Results for RQ5
To evaluate the practicality and feasibility of our approach, it is essential to measure the time consumed by the training and testing processes of MovePerf. All experiments were conducted on a Dell workstation with a 1.19 GHz Intel Core i5 CPU and 8GB RAM, running 64-bit Windows 10. Due to the differences in the TensorFlow versions used in the models, GPU acceleration may not have functioned properly or been fully utilized. This could have rendered the experiments unstable, potentially resulting in inconsistent or non-reproducible results. Therefore, to ensure the reliability and consistency of the experimental outcomes, we opted to implement the methods under a CPU environment.
We compared DeepFM, DeepPerf, HINNPerf, and our approach, because CNN and FM models have simpler architectures and much shorter computation times. However, as the experimental results indicate, their predictive accuracy is lower; thus, these methods are excluded from the comparison.
From the perspective of the model architecture, DeepPerf, DeepFM, and our approach share certain similarities. Since the deep learning neural network (DNN) component used in DeepFM has a relatively simple architecture, it requires less time to train compared to DeepPerf and MovePerf. When used individually, the FM component used in MovePerf has a smaller time cost, and its contribution to the overall time is negligible. Therefore, the time cost for MovePerf should be roughly equivalent to that of DeepPerf. HINNPerf, which consists of multiple FNN blocks and requires the training of many parameters, requires a longer period of time.
Furthermore, based on the experimental results, the models’ training times are as shown in
Figure 9. For all project datasets with MMR, the time cost of DeepFM in searching for optimal hyperparameters and training a model is the shortest, ranging from 5 to 80 min. Then, it takes DeepPerf and our MovePerf 3–126 min to conduct model training and hyperparameter searching, while HINNPerf spends 27–297 min.
The time costs are slightly larger since we used a CPU for the experiments. When compared to the GPU time used in [
31], despite the differences in the number of features (ranging from 9 to 60 features) and the sample sizes (ranging from 192 to
), the time cost of model training and hyperparameter searching for DeepPerf is from 5 to 13 min, while HINNPerf takes 14 to 16 min. Thus, under the same GPU conditions, the maximum time cost for MovePerf is 13 min, which makes its time cost acceptable.
In summary, MovePerf demonstrates higher predictive accuracy than other methods, and its time cost in searching for optimal hyperparameters and training a model is reasonable.