Predicting a Program’s Execution Time After Move Method Refactoring Based on Deep Learning and Feature Interaction

Yu, Yamei; Lu, Yifan; Liang, Siyi; Zhang, Xuguang; Zhang, Liyan; Bai, Yu; Zhang, Yang

doi:10.3390/app15084270

Open AccessArticle

Predicting a Program’s Execution Time After Move Method Refactoring Based on Deep Learning and Feature Interaction

by

Yamei Yu

^†,

Yifan Lu

^†

,

Siyi Liang

^†

,

Xuguang Zhang

,

Liyan Zhang

^*,

Yu Bai

and

Yang Zhang

^*

School of Information Science and Engineering, Hebei University of Science and Technology, Shijiazhuang 050000, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2025, 15(8), 4270; https://doi.org/10.3390/app15084270

Submission received: 19 March 2025 / Revised: 7 April 2025 / Accepted: 8 April 2025 / Published: 12 April 2025

Download

Browse Figures

Versions Notes

Abstract

:

Move method refactoring (MMR) is one of the most commonly used software maintenance techniques to improve feature envy. Existing works focus on how to identify and recommend MMR. However, little is known about how MMR impacts program performance. There is a gap in knowledge regarding MMR and its performance impact. To address this gap, this paper proposes MovePerf, a novel approach to predicting performance for MMR based on deep learning and feature interaction. On the one hand, MovePerfselects 32 features based on observations from real-world projects. Furthermore, MovePerf obtains the execution time for each project after MMR as the performance label by employing a performance profiling tool, JMH. On the other hand, MovePerf builds a hybrid model to learn features from low-order and high-order interactions by composing a deep feedforward neural network and a factor machine. With this model, it predicts the performance for these projects after MMR. We evaluate MovePerf on real-world projects including JUnit, LC-problems, Kevin, and Concurrency. The experimental results show that MovePerf obtains an average MRE of 7.69%, illustrating that the predicted value is close to the real value. Furthermore, MovePerf improves the MRE from 1.83% to 8.61% compared to existing approaches, including a CNN, DeepFM, DeepPerf, and HINNPerf, demonstrating its effectiveness.

Keywords:

move method; refactoring; performance; prediction; deep learning

1. Introduction

Refactoring is a software evolution technique used to improve a program’s internal structure without altering its external behavior. It has been widely used to enhance source code’s readability, maintainability, and extensibility [1,2]. Move method refactoring (MMR) is one of the most commonly used operations to improve feature envy by transferring a method from one class to a more suitable class [3]. MMR aims to enhance the source code’s structure and optimize class responsibilities, addressing issues including unclear code organization, low code reuse, and high coupling between classes.

Many works have been conducted on MMR [3,4,5,6]. Some works focus on MMR recommendations. JMove [4] provides refactoring recommendations by comparing the similarity of the dependencies established by a method with the dependencies established by the methods in possible target classes. QMove [5] relies on the measurements of the Quality Model for Object Oriented Design (QMOOD) to recommend MMR that improves the software quality. RMove [6] implements a method that automatically learns and concatenates both structural and semantic representations of code fragments, respectively, and further trains machine learning classifiers to guide the movement of the method to suitable classes. Some works evaluate whether MMR can enhance software quality attributes in real-world applications [7,8]. For instance, Alshayeb et al. [7] argued that refactoring does not necessarily improve software’s external quality attributes. Agnihotri et al. [8] studied the influences of various refactoring operations, including the move method, on system quality. Furthermore, some researchers investigate the impact of MMR on software performance [9,10]. For example, Demeyer et al. [9] found that not all refactoring operations effectively improve the program execution performance based on comparisons of the execution times of four benchmark programs. Luca et al. [10] explored this by comparing the execution times of code before and after refactoring.

Although many studies have been conducted on MMR, there is a gap is knowledge regarding MMR and its performance impact. Existing works demonstrate that the impact of refactoring on performance can be bidirectional, which means that it may either improve or degrade the performance depending on the context [8,10]. MMR’s performance impact varies across different projects. The current research has not yet explored refactoring’s influence on performance. As a result, researchers face challenges in accurately predicting whether a particular refactoring operation will lead to a substantial performance gain or loss. While deep learning models for performance prediction have demonstrated strong capabilities, these technologies have not been widely applied to code refactoring. Specifically, performance prediction models for refactoring are still underexplored and represent a critical gap in the field.

Bridging this gap, one could explore the specific degree and scale of the performance impact of refactoring. Two key issues need to be considered. Firstly, there is a scarcity of suitable datasets for this research. An ideal dataset would include many different MMR results and reliable execution times. Secondly, it is necessary to construct a prediction model tailored to the dataset’s characteristics. A well-designed model would enable accurate performance prediction for MMR.

To address these challenges, this paper proposes a novel approach named MovePerf to predict the performance impact of MMR. On the one hand, this method identifies 32 features, including 31 metric-based and 1 semantic-based features, based on observations of actual projects. After this, it uses JMH to measure the execution times of the project under specific configurations, with these times serving as the labels to build the dataset. On the other hand, we build a model composed of a factor machine (FM) and a deep feedforward neural network (DFNN) to predict performance. MovePerf is evaluated on large-scale real-world projects including JUnit, LC-problems, Kevin, and Concurrency. The experimental results demonstrate that it achieves a 7.69% MRE on average, improving the MRE from 1.83% to 8.61% compared to other approaches, including CNN, DeepFM, DeepPerf and HINNPerf.

The main contributions of this paper can be summarized as follows:

We select those features relevant to MMR and provide a tool to construct a dataset automatically;
We propose MovePerf, a performance prediction approach for MMR. To our knowledge, this is the first work to predict the performance impact of code refactoring;
MovePerf is evaluated on real-world projects and compared against existing approaches, demonstrating its effectiveness.

The rest of this paper is structured as follows. Section 2 examines related work. Section 3 introduces the motivation. Section 4 elaborates on the experimental design, including dataset construction, the implementation of an automatic tool, and model development. Section 5 presents the experimental results, and Section 6 concludes this paper.

2. Related Works

In this section, we examine related work on MMR.

2.1. Recommended Approach for Move Method Refactoring

Researchers have extensively studied the timing of MMR, and various methods have been proposed to enhance its effectiveness. Some approaches utilize code smells, such as RAIDE [11], as a semi-automated tool that uses Assertion Roulette and Duplicate Assert to identify and refactor the test flavor in Java test code. Pizzini [12] proposed a method for the automatic refactoring of unit tests by analyzing the behavior of the test and, consequently, the behavior of the system under test (SUT), to identify and refactor the eager test and lazy test flavors.

Other methods use code information. For example, Jehad [3] proposed a metric-based approach to determine whether a class contains methods requiring refactoring. This method considers prerequisites related to MMR, adapts existing metrics related to method cohesion, and merges them to predict the need for MMR. Zhang et al. [13] proposed the MoveRec approach, which is a recommendation technique for move method refactoring based on deep learning and LLM-generated information. In this approach, textual features are generated with LLM to obtain code summaries, and the pre-trained model is used to produce word vectors. By calculating the similarity between the original and target classes to extract semantic features, they constructed a dataset aimed at predicting move method refactoring. In addition, there are also RMove and QMove, as mentioned above.

Alternatively, other researchers have used commit messages to guide refactoring. For instance, Rebai et al. [14] developed a context-driven method that analyzes commits messages and assesses the quality improvements in change files to complement static and dynamic analysis. Nyamawe et al. [15] proposed a machine learning approach that combines a traditional refactoring detector with commit message analysis. This method employs binary and multi-label classifiers to predict and recommend refactoring based on the previous refactoring timing and location.

2.2. Impact of Move Method Refactoring on Software Quality and Performance

Some scholars have noted that the impact of refactoring on software quality and performance is still unclear [16]. Regarding the effect of MMR on software quality, Chavez et al. [17] found that over 94% of refactorings were applied to code with at least one key internal quality attribute, and 65% of these improved the attribute. Additionally, pure refactorings tend to improve, or at least not worsen, the internal quality, and 55% of refactorings improve internal quality attributes while achieving other specific goals. Agnihotri et al. [8] compared “root canal refactoring” (pure refactoring) with “floss refactoring” (refactoring accompanied by other program modifications) through different versions of two open-source projects. It was found that developers performed floss refactoring more often than root canal refactoring, and root canal refactoring had a more positive effect on the software quality. Kaur et al. [18] conducted an empirical study on object-oriented code refactoring, showing that refactoring positively impacts quality in academic settings more than in industry and does not always improve all quality attributes.

Regarding the impact of MMR on software performance, Demeyer [9] examined the performance trade-offs related to introducing virtual functions in C++ programs by comparing the execution times of four benchmark programs. His study found that refactored programs were executed faster than non-refactored ones, depending on the compiler and optimization settings. Chen et al. [19] analyzed function granularity to identify the performance variation between every two versions of a program, indicating that refactoring could sometimes degrade the performance in Python. Similarly, Luca et al. [10] explored how different refactoring types impacted the execution time, concluding that the effects vary and no refactoring type guarantees no performance degradation. For the move method specifically, the impact on performance can be both positive and negative depending on the context.

2.3. Prediction Model

There is currently no specific performance prediction model tailored to MMR, but research on prediction models in other fields has achieved good progress. For example, in configurable systems’ performance prediction, Ha et al. [20] proposed DeepPerf, a novel approach utilizing a deep feedforward neural network combined with a sparse regularization technique. This method is designed to predict system performance under new configurations, even with small sample sizes. An advantage of DeepPerf is the practical search strategy that they developed, which can efficiently and automatically tune the network’s hyperparameters, optimizing the model for better performance predictions. Subsequently, Ha et al. [21] introduced PerLasso, a performance modeling and prediction method that integrates Fourier learning with the Lasso (Least Absolute Shrinkage and Selection Operator) regression technique. Additionally, PerLasso includes a new dimension reduction method, allowing it to more effectively and accurately predict system performance under varying configurations.

In click-through rate (CTR) prediction, cheng et al. [22] proposed Wide & Deep, a model that combines a wide linear model and deep neural network. The wide model uses cross-product feature transformations to effectively capture sparse feature interactions, while the deep model generalizes to unseen feature interactions through low-dimensional embedding. Thus, the memorization and generalization of the model are realized. Additionally, Guo et al. [23] introduced DeepFM, a neural network based on factorization machines. This model combines the recommendation capabilities of factorization machines with the feature learning strength of deep learning, enabling it to capture both low-order and high-order feature interactions.

3. Motivation

This section mainly discusses MMR’s impact on program performance. When evaluating program performance, we choose the execution time as a metric. Due to the variability in the results of a single test, we adopt the average execution time after five independent tests as the final result. Our research reveals that MMR has the potential to improve or degrade the performance, with the specific effects varying depending on the project. We choose Kevin, a Java project on GitHub 2.0 as our research project. Kevin is a Kotlin-based event system, and our research focuses on ad97f90 submission in version 1.0 of the project. In this commit, the developer moved the prepare() method from the OrbitBenchmark class to the Benchmarks class. The prepare() method is responsible for creating and configuring a map of multiple listeners that is critical to the functional implementation of numerous test classes in the project. With this refactoring, the modularity of the code has been improved, significantly enhancing its maintainability and readability.

To comprehensively evaluate the impact of MMR on program performance, we need to consider moving multiple methods, thereby generating various refactoring results. Relying on the prepare() method to move alone is not enough, and it is not possible to move just a single method during actual code maintenance. Through an analysis of the code structure, we find that the benchmark() method in the OrbitBenchmark class and the method with the same name in the NorbitBenchmark class are potential candidates for MMR. The benchmark() method publishes an OrbitEvent event containing a Blackhole object to the event bus for performance testing. In this section, we select the prepare() and benchmark() methods in the OrbitBenchmark class as examples to study the impact of MMR on program performance.

After conducting tests on the execution time of the original project before refactoring in version 1.0, we obtained a result of 101,758.47 ms. Table 1 shows that we performed two refactoring operations and tested their execution times. When methods prepare() and benchmark() are moved to class Benchmarks, the execution time is 51,456.93 ms, which amounts to a decrease of 50,301.54 ms. This refactoring effectively optimizes the program’s execution efficiency. However, only the prepare() method is being moved to the NorbitBenchmark class, while the benchmark() method remains in place. The execution time for this change is 102,263.37 ms, demonstrating an increase of 504.9 ms. In summary, the impact of MMR on program performance is complex and variable. It can lead to both performance improvements and degradation, with the specific effects strongly influenced by the details of the refactoring operation and its context. Therefore, there is an urgent need to study the performance of programs after MMR.

4. Design

An overview of MovePerf is illustrated in Figure 1. To construct the dataset, we selected several projects from GitHub that have experienced at least one MMR. To create feature sets related to MMR, we systematically analyzed these projects and obtained semantic and metric features. We developed an automated refactoring tool to alleviate the workload associated with dataset construction. Subsequently, we used Word2Vec and CK for each refactoring result to extract semantic and metric features. In the labeling process for performance predictions, we used the execution time detected by JMH as the label for each sample. Finally, we built a deep learning model that begins with normalizing the input data. This model comprises two sub-models, FM and DFNN, which are tasked with learning the interactions between low- and high-order features. By working in concert, FM and DFNN collaborate to generate accurate predictions.

4.1. Dataset

The dataset construction process includes project selection, feature extraction, feature interactions, and label measurement. An overview of dataset construction is illustrated in Figure 2.

4.1.1. Projects

To study the quantitative prediction of program performance based on the move method, we established two criteria for the selected projects. Firstly, the projects must have experienced at least one instance of MMR. Secondly, the execution time of these projects had to be tested. Only those projects that fulfilled both of these conditions could be included in the dataset.

To obtain accurate execution time measurements, we collected Java microbenchmark projects from GitHub that were specifically designed by developers for performance evaluation. Then, RefactoringMiner [24] was used to detect whether they involved MMR. The tool can identify a total of 55 refactoring types, including MMR, by analyzing the differences between two code versions (typically pre- and post-commit Git versions) and matching them against predefined refactoring patterns. This process identifies specific code changes and outputs the results in a structured data format. RefactoringMiner’s detection report contains multiple dimensions of information: refactoring types, affected code elements, locations, contextual information, and relevant commit details. We first identified projects to apply MMR based on the type of refactoring reported. Second, we learned the details of the refactoring operation based on the method name and its location information in the code element. This information can help us to restore the process of developers performing refactoring operations as much as possible, so as to effectively eliminate the interference of other code changes. Thus, we were able to focus on the impact of mobile function refactoring on program performance.

Using the RefactoringMiner tool, we identified 15 projects encompassing 36 movable methods. The details of these projects, including submissions, URLs, and the methods involved in MMR operations, are publicly available at https://github.com/Foile145/research/blob/main/details.html (accessed on 1 March 2025). It is worth noting that these projects may encompass other refactoring-type operations or changes to the code logic, in addition to MMR. To exclude the influence of these extraneous code changes on program performance, we performed a manual review to determine the baseline version of each project that was not refactored. Subsequently, we performed the same MMR operations as the original project on this baseline version, ensuring that the sole change within the project was MMR. This step is crucial in precisely identifying the metrics affected by MMR, as it effectively minimizes interference, thereby ensuring the accuracy and reliability of the selected features.

4.1.2. Features Related to Move Method Refactoring

In the exploration of MMR, a key challenge is identifying the key features that are most likely to be affected by refactoring operations. After careful evaluation, the feature set identified included two critical dimensions: metric-based features and semantic-based features.

To accurately identify metrics, we analyzed project versions before and after the refactoring process, aiming to identify metrics that exhibited changes. However, it is important to note that not all observed changes can be solely attributed to refactoring, as some may stem from alterations in the underlying code logic.

Through analyzing the source code before and after refactoring, complemented by domain expertise, we identified 31 metric-based features influenced by MMR. The selected metrics are presented in Table 2, including 23 class-level metrics and 8 method-level metrics, where T indicates whether the metric is a class- or method-level metric. The selection of these metrics was based on three primary considerations. Firstly, they had to remain numerical regardless of the target class’s location, so that the model could effectively learn and predict. Secondly, their values changed before and after the refactoring, and these changes were due to the refactoring operation and not other code logic changes. Thirdly, the selected metrics had to have a certain correlation with the program performance and affect the execution time of the program. The reasons for each metric’s selection are presented in Table 3. Here, CBO, CBO Modified, FAN-IN, FAN-OUT, and RFC contain both class and method levels. CBO Modified is very similar to CBO; however, this metric considers the dependency of a class in terms of both the references that it makes to others and the references that it receives from other types. The metric LCOM* represents a standardized version of LCOM, with values ranging between 0 and 1. Both metrics are retained because LCOM* is not simply a scaled form of LCOM. Instead, it incorporates the concept of a connected graph to provide a more accurate and consistent assessment of the cohesion among methods within a class. Additionally, due to the different calculation methods employed by the two metrics, they may exhibit distinct relationships with other features and exert varying influences on the model.

In addition to these metrics, we further integrate the consideration of semantic information to achieve a more nuanced understanding of the target class’s attributes. Within the context of software code, a class name is not only a unique identifier that differentiates among various classes, but it also contains rich semantic information about the function and purpose of the class. Therefore, we project the class name of each target class into a three-dimensional vector space, representing its coordinates by Embedding_1, Embedding_2, and Embedding_3.

To summarize, the features comprise two main components: metrics that change due to the move method’s refactoring operation and the word vector representation of the target class name.

4.1.3. Feature Interaction

Feature interaction, which encompasses the relationships and combined effects among various features within a dataset, is pivotal in capturing these complex patterns and deep relationships. This capability is instrumental in enhancing the model’s learning efficacy. Therefore, by evaluating the degree of correlation between the variables in the dataset constructed in this work, we find that the complexity of feature-complex interactions is reflected in multiple levels, including order-1 interactions, denoting the direct relationships between individual features; order-2 interactions, involving the interaction between feature pairs; and high-order interactions, which cover interactions between three or more features. The interactions between these features will be explored in more detail next.

We first analyze the order-1 interactions of the features, as shown in Figure 3, which presents a heatmap of several feature pairs. A heatmap, as a representation of correlation coefficients, can map correlation values to a color scale. The correlation coefficient reflects the strength and direction of the linear relationship between two features. When the correlation coefficient is close to +1, it signifies a strong positive correlation, i.e., they tend to increase synchronously, and the heatmap color tends to be red. Conversely, when the correlation coefficient is close to −1, it indicates that there is a strong negative correlation, i.e., one feature decreases as the other increases, and the heatmap color tends to be blue. In addition, if the correlation coefficient is close to 0, it means that there is hardly any linear relationship between the two features, and the color in the heatmap is white or pale.

Four large red areas are prominent in the figure. The correlations among seven metrics (CBO, CBO Modified, FAN-IN, FAN-OUT, WMC, RFC, and LCOM) range between 0.7 and 1 in the first region. For example, the correlation coefficients between CBO and FAN-IN, FAN-OUT, WMC, and RFC are 0.96, 1.00, 0.82, and 0.97, respectively. This indicates that, as CBO increases, FAN-IN, FAN-OUT, WMC, and RFC tend to increase as well, suggesting the potential existence of some form of synergy or a common growth trend among these features. The correlation coefficients between FAN-IN and FAN-OUT and between WMC and RFC are 0.96 and 0.73, respectively, both showing strong positive correlations. This implies that an increase in FAN-IN is accompanied by an increase in FAN-OUT, WMC, and RFC, further supporting the monotonic increasing relationship between these features. Notably, the correlation coefficients between CBO, CBO Modified, and FAN-OUT are 1.00, indicating that their changes are perfectly aligned. This suggests the possibility of a close relationship or a shared driving factor among these three features.

The same analysis can be performed in the second and third regions, showing that these seven metrics, along with privateMethodsQty, defaultMethodsQty, visibleMethodsQty, NOSI, LOC, and assignmentsQty, have correlation values ranging from 0.72 to 1. In the fourth region, we observe strong correlations among six metrics—privateMethodsQty, defaultMethodsQty, visibleMethodsQty, NOSI, LOC, and assignmentsQty—with correlation values ranging from 0.79 to 1.

Additionally, two distinct regions of importance appear in the graph: one near the partition line and the other along the edge. The region along the partition line displays the correlations between totalMethodsQty, publicMethodsQty, and the 13 metrics mentioned above, with values between 0.73 and 0.99. The edge region highlights a strong negative correlation between modifiers, totalMethodsQty, and these 13 metrics, with values from −0.72 to −1. For example, the correlation coefficients between modifiers and CBO, CBO Modified, FAN-IN, and FAN-OUT are −0.96, −0.97, −1.00, and −0.96, respectively, indicating a strong negative correlation. This suggests that as the number of modifiers increases, CBO, CBO Modified, FAN-IN, and FAN-OUT tend to decrease.

We investigate the order-2 interactions among the features, as shown in Figure 4. The order-2 interaction terms are manually constructed by selecting a subset of features from the first region of Figure 3 and then multiplying the features together. This subset is chosen because the number of interaction terms would increase significantly after crossing, while the available space is limited. Figure 4 divides the heatmap into four regions. The first region shows the correlations among the original features, which have been discussed previously. The second and third regions show the correlations between the original features and their cross terms, with a minimum correlation of 0.72, indicating a significant relationship between the features and their order-2 interaction terms. Finally, the fourth region shows the correlations between cross-terms, with a minimum of 0.93, suggesting a strong relationship among these order-2 cross-terms. The FM component is employed to capture and learn the lower-order feature interactions.

While lower-order interactions provide valuable insights, there are also complex nonlinear relationships between the features that cannot be adequately represented by simple pairwise interactions. These high-order interactions play a crucial role in modeling the intricate dependencies among features, and it is essential to explore these interactions to improve the model’s performance.

In exploring complex feature interactions, we have employed a decision tree methodology to reveal high-order interactions that are not readily apparent in the correlation heatmap. While a heatmap is adept in illustrating linear relationships, decision trees excel in identifying nonlinearities and capturing high-order interactions within datasets. Figure 5 presents a segment of the decision tree, highlighting the significant high-order interactions among the features. The suffixes that appear in the figure, such as Embedding_1 and publicMethodsQty.1, represent the features of the second method involved in MMR, while those without suffixes represent the features of the first method. Specifically, within the left subtree of the decision tree, it is evident that Embedding_1.1 exerts a notable predictive influence at the threshold of 0.008. This suggests that, when the value of Embedding_1.1 exceeds this threshold, the variation in publicMethodsQty.1 significantly impacts the model’s prediction for the target variable. Furthermore, when publicMethodsQty.1 is split, it is found that cboModified has a more pronounced effect on the model’s predictions under the condition that the value is more than 2.5. Based on the architecture of the decision tree and the selection of decision nodes, we can assert with confidence the presence of substantial high-order interactions among Embedding_1.1, publicMethodsQty.1, and cboModified under specific conditions. In this work, we have integrated the DFNN component, which is particularly effective in learning complex, nonlinear interactions between features by transforming the original inputs into high-order features.

4.1.4. Label

MovePerf uses the execution time as the label. However, the accurate measurement of the code execution time is a complex task that is easily affected by many factors, such as compilation optimizations, noise interference, and inlining problems. To ensure that the execution time is as accurate as possible, the Java Microbenchmarking Harness (JMH) [10] is employed. JMH is a specialized suite for Java microbenchmarks that improves the accuracy by using warm-up phases and multiple iterations before measurement. Other microbenchmarking tools, such as Caliper [25], Japex [26], and the now discontinued JUnitPerf [27], either lacked automation or did not achieve the same level of popularity and community support as JMH. Consequently, JMH is selected as the primary tool to measure the execution time.

The parameters, including the iteration settings, the number of iterations, and the number of measurements in JMH, are defined by the benchmark developers to ensure accurate execution time measurement. These parameters help to control the benchmarking process, ensuring that the performance metrics reflect accurate and consistent results.

4.2. Automatic Tool

This paper presents an automated feature extraction tool to simplify the process of collecting datasets. The tool functions through a three-stage process. Firstly, it automates the MMR operation, streamlining this crucial step to enhance the efficiency. Secondly, it leverages CK and Word2Vec to automatically extract features from all of the refactored outcomes, thereby extracting valuable information precisely. Thirdly, it systematically organizes the extracted features, making them more accessible and conducive to subsequent analysis or application.

The refactoring operation of the move method involves moving a method from one class to another. The location of the method in the class is negligible regarding the execution time, so the designed tool defaults to placing the method at the end of the target class. The automation process focuses on three primary considerations: identifying which methods are suitable for refactoring, determining potential target classes, and adjusting relevant statements as needed during the move. Through manual refactoring, we identify methods eligible for relocation and then determine appropriate target classes, and we finally record any necessary changes in the corresponding statements during the relocation process, designing specific handling mechanisms for such cases.

We use CK [28] to perform a static analysis to calculate the metrics at the class and method levels. Taking the methods and classes as text, Word2Vec is used to transform target classes into word vectors. The word vector dimension and window width are set to 3 [29,30]. We modify CK and Word2Vec to extract both the metrics and semantics of each refactoring result automatically. The extraction results for the same class are stored in a folder to prepare for subsequent data collation.

The implementation will consolidate features scattered across various files into a single dataset file. We automatically select relevant features from method-level and class-level metrics based on the method and target class names. Subsequently, we eliminate unnecessary metrics according to the chosen features. Finally, we concatenate the selected metrics with the word vectors chosen based on the target class names and store the related information in a structured format, thereby completing the feature extraction for the dataset.

4.3. Model

The datasets exhibit characteristics such as a limited number of samples and complex interactions among features. Based on these considerations, MovePerf must satisfy the following constraints. First, the model should effectively handle the limited sample size present in the dataset. Second, it should effectively capture and fully leverage the complex interactions among the features.

To deal with the interactions between features, the model needs to address both low- and high-order interactions. Considering that high-order interactions often build upon low-order interactions [31], our approach involves using FM to complete low-order interaction information. The order-2 crossover terms of the features are then fed into a component designed to learn high-order interactions. For high-order interactive information learning, we choose the DFNN, which can achieve high accuracy in the case of small samples. Furthermore, we perform preprocessing on the input data by normalizing the maximum values of metric features and converting semantic features into word vector representations.

Based on the aforementioned considerations, the model of MovePerf is composed of two sub-models, including the FM and DFNN components, as illustrated in Figure 6. The input is the processed features, and the data are transformed and processed through the embedding layer to facilitate more efficient learning and processing by the model. The processed data are then entered into the FM component. FM learns low-order feature interactions and uses the resulting order-2 feature cross-terms as inputs to the DFNN component to capture and learn high-order interactions between the features. The final execution time prediction is obtained by combining the outputs from FM and the DFNN. The calculation is shown in Equation (1).

y = sigmoid (y_{f m} + y_{d f n n})

(1)

where

y_{f m}

is the output result of FM and

y_{d f n n}

is the output result of the DFNN.

4.3.1. Factor Machine

Based on the characteristics of the datasets, it is essential to first focus on capturing low-order interactions between the features. To achieve this, we compare several regression models capable of learning these interactions, including linear regression, generalized linear models (GLM), FM, and polynomial regression. These models are evaluated for their effectiveness in modeling low-order feature relationships. Among them, FM stands out due to its ability to handle large-scale sparse data, automatically learn feature interaction information, and offer strong generalization abilities. In contrast, other models may perform less effectively on small sample datasets and large-scale sparse data. Consequently, FM is selected as the component for the learning of low-order interaction information.

FM [32] extends the linear model to capture low-order interactions in sparse datasets [33]. Structurally, FM consists of two components: order-1 and order-2. The order-1 component is the LR linear combination, which is used to learn the linear weight

w_{i}

of each feature

x_{i}

. The order-2 component is the cross-term to address feature combination by using the hidden vectors

v_{i}

to measure the interaction effects between features. To avoid dimension explosion when dealing with feature interactions, FM expresses these interactions as the inner products of implicit vectors, rather than resorting to a comprehensive parameter matrix, reducing the model’s parameter count. Moreover, FM capitalizes on data sparsity, computing the inner products of implicit vectors solely when the feature values are non-zero. This approach not only captures the interactions between feature pairs but also enhances the model’s ability to represent complex relationships, thus improving the prediction accuracy. The function of FM is depicted in Equation (2), which illustrates how these components work together to achieve accurate predictions.

f (x) = ω_{0} + \sum_{i = 1}^{n} w_{i} x_{i} + \sum_{i = 1}^{n - 1} \sum_{j = i + 1}^{n} (v_{i}^{T} v_{j}) x_{i} x_{j}

(2)

where

ω_{0}

is the bias term and

w_{i}

is the weight of a single feature.

(v_{i}^{T} v_{j})

represents the inner product of the feature hidden vectors as a way to define the feature cross-weight values of features

x_{i}

and

x_{j}

.

ω_{0} + \sum_{i = 1}^{n} w_{i} x_{i}

represents the expression in order-1, and

\sum_{i = 1}^{n - 1} \sum_{j = i + 1}^{n} (v_{i}^{T} v_{j}) x_{i} x_{j}

represents the cross-term of features in order-2.

In Figure 7, the calculation process of the order-2 cross-terms for a total of n features is illustrated. The inputs are features from the dataset; they are then processed by the embedding layer, which transforms them into hidden vectors. The order-2 cross-terms are derived from the hidden vector and the features. Then, these order-2 cross-terms are used as inputs to the DFNN. The DFNN learns high-order feature interactions by processing these cross-terms and further refining the predictions.

This process allows the model to integrate both low- and high-order feature interactions, enhancing its predictive performance.

4.3.2. Deep Feedforward Neural Network

Neural networks demonstrate significant efficacy in capturing high-order interactions between features. Specifically, each layer of the network processes input data through forward propagation, gradually integrating raw features into more abstract and high-order feature representations. By stacking multiple layers, the network can learn increasingly complex feature combinations, capturing intricate interactions that might be overlooked. However, these networks typically rely on a large volume of samples to achieve accurate predictions. This is due to the vast number of parameters that they contain, which, without proper regularization or a sufficiently large dataset, may lead to overfitting.

In the multiple datasets that we constructed, some datasets have limited sample numbers, so the overfitting problem cannot be ignored. Therefore, we choose DeepPerf with an improved DFNN, which can maintain high prediction accuracy even in the case of small samples. This is largely attributed to DeepPerf’s adoption of techniques such as L1 regularization, which is particularly beneficial when dealing with small datasets. L1 regularization addresses the sparsity of the model parameters, reduces the risk of overfitting, and improves the generalization capabilities. Furthermore, the hyperparameter training mechanism in DeepPerf enhances the training efficiency by optimizing the training process, making it feasible to train deep networks effectively even when the available data are limited.

MovePerf borrows DeepPerf’s strategy of preventing overfitting and improving the efficiency of hyperparameter searching to apply the DFNN. The architecture of the component is shown in Figure 8. The input of this part is the order-2 cross-terms of FM; the loss function is the regression loss function, which is most commonly used in machine learning, i.e., the mean square error between the real output and the predicted output, and the output is the predicted value of the component, whose prediction can be recursively obtained via Equation (3).

\begin{matrix} h_{1} = ReLU (h_{0} W_{1} + b_{1}), \\ h_{2} = ReLU (h_{1} W_{2} + b_{2}), \\ \dots, \\ h_{j} = ReLU (h_{j - 1} W_{j} + b_{j}), \\ y_{d f n n} = h_{j} W_{f} + b_{f} \end{matrix}

(3)

where

h_{j}

is the output vector of the jth hidden layer,

w_{j}

and

b_{j}

are the weights and biases of the jth hidden layer, and ReLU is the activation function of the DFNN.

y_{d f n n}

is the result predicted by the model, and

w_{f}

and

b_{f}

are the weights and biases used to predict the performance values.

5. Evaluation

5.1. Projects

Our experiments consider four real-world projects from different domains, including software testing, algorithms and data structures, concurrency and multithreading, and software development and event-driven architectures. These projects come from the 15 projects used to determine the features, while some projects were abandoned due to three reasons. Firstly, the project’s performance benchmarks did not cover methods affected by refactoring, preventing reliable label acquisition. Secondly, the project was “unstable”, such as having frequent bugs that caused unexpected errors and crashes at runtime, making benchmarking execution impossible. Thirdly, the lack of unit tests for the affected methods in the project may lead to undetected changes in functionality after refactoring, reducing the effectiveness and significance of the refactoring process.

It is necessary to construct a corresponding dataset for each project to predict the program performance after MMR for a specific project. Each sample in the dataset represents the result of MMR within that project. These results may involve moving the method to another class within the same package as the source class, to a class in a different package, or even to a new, arbitrary class, to cover as many refactoring results as possible. In MMR operations, the number of methods and potential target classes generally fails to meet the threshold necessary for generating tens of thousands of combinations, and the dataset of MMR tends to be comparatively small when contrasted with those of other domains.

5.2. Research Questions

We evaluate the effectiveness of MovePerf by answering the following research questions.

RQ1: How effective is MovePerf in predicting the performance for MMR?

RQ2: How effective is MovePerf in handling with order-2 feature cross-terms?

RQ3: How effective is MovePerf compared to a single FM or DFNN model?

RQ4: How applicable is the generalization of MovePerf?

RQ5: How effective is MovePerf when considering its time cost?

5.3. Evaluation Metric

In general, to compare the prediction accuracy between the learning methods, we use two thirds of the sample size of the dataset to train each model, and we then use this model to predict the performance values on the testing dataset. To evaluate the prediction accuracy, we use the mean relative error (MRE), which is calculated as shown in Equation (4).

MRE = \frac{1}{n} \sum_{i = 1}^{n} |\frac{y_{i} - {\hat{y}}_{i}}{y_{i}}| \times 100 .

(4)

where n is the number of samples in the validation set,

y_{i}

is the true value, and

{\hat{y}}_{i}

is the predicted value.

Mean represents the mean of the MREs in 30 experiments. The margin represents the 95% confidence interval of the MREs in 30 experiments, which is calculated as in Equation (5):

Margin = \frac{Z \times σ}{\sqrt{n}} .

(5)

where Z is the critical value of the standard normal distribution,

σ

is the standard deviation of the population, and n is the sample size. After examining related works, the value of Z is set to 1.96 [34].

5.4. Results

This section presents the results for each RQ.

5.4.1. Results for RQ1

In prediction tasks across various domains, many learning methods have been proposed to predict outcomes, including the CNN, DeepFM, DeepPerf, and HINNPerf. In addition, traditional machine learning methods such as FM can also be used for prediction tasks. DeepFM [23] is a representative model in the field of click-through rate prediction, while DeepPerf [20] demonstrates superior prediction accuracy in configurable systems. HINNPerf [31], proposed in 2023 as an improved version of DeepPerf, has similarly achieved high prediction precision. The CNN leverages parameter sharing and localized connectivity. These properties enable efficient feature extraction from limited datasets while improving the predictive accuracy. As a well-established deep learning architecture, the CNN currently represents one of the most widely adopted models in the field. In this experiment, we verify the effectiveness of our proposed method by comparing our MovePerf model with the CNN, FM, DeepFM, DeepPerf, and HINNPerf.

We evaluate the six models across the four projects shown in Table 4. Note that the datasets for HINNPerf are not considered for this comparison, as they primarily focus on configurable systems, where the features are strongly related to the specific configuration options. For instance, in the case of ×264, the configuration options include parameters such as the number of reference frames, enabling or disabling the default entropy encoder, and the number of frames for ratecontrol and lookahead.

The hyperparameter settings in DeepFM follow [23], those for DeepPerf follow [20], and those for HINNPerf follow [31]. The hyperparameter settings in FM follow the FM component in DeepFM. The number of epochs for the CNN is 100, the number of epochs for MovePerf is 100, and the batch size is 64.

Table 4 presents the performance prediction results for the six methods across various datasets. Overall, MovePerf demonstrates a more balanced and effective approach across the datasets.

For JUnit, the CNN exhibits a relatively high mean of 15.83% and a small margin of 4.94, indicating that the model’s predictions are somewhat erratic and unstable. In contrast, DeepFM significantly reduces the mean to 13.53%, although its margin increases to 10.58. This suggests that, while DeepFM provides more accurate predictions in most cases, it has larger prediction MREs in some instances. HINNPerf shows a mean of 5.96% and a smaller margin of 2.00, indicating improved prediction stability, although a notable prediction MRE remains. Finally, MovePerf performs exceptionally well in this project, achieving a mean of 5.41% and a margin of 0.03, indicating low prediction MREs and high stability. Therefore, for the dataset with a small number of samples and large label variation, MovePerf performs better, reducing the mean by 0.55% versus HINNPerf.

For LC-problems, the CNN shows a mean of 6.44% and a margin of 5.64, reflecting relatively low prediction accuracy and some instability. DeepFM shows a mean of 1.77% and achieves a margin of 3.56, suggesting that the prediction MRE of the model is reduced and it can provide a high-accuracy prediction, although its stability is moderate. In comparison, the performance of HINNPerf is superior, with a mean of 0.35% and a margin of 0.10, demonstrating smaller prediction MREs and higher stability, thus showing stronger predictive capabilities. MovePerf gives slightly better results than HINNPerf, with a mean of 0.28% and a margin of only 0.01. Although its predictive accuracy is slightly lower than that of HINNPerf, the prediction MRE remains small, and it excels in terms of stability. Therefore, when dealing with datasets with a very small number of samples and minimal label variation, MovePerf performs even better, reducing the mean by 0.07% compared to HINNPerf.

For Concurrency, the CNN has a mean of 12.47% and a margin of 4.39, suggesting moderate prediction MREs with reasonable stability. DeepFM shows a mean of 9.87% and a margin of 3.62, which represents an improvement over the CNN, although the prediction MRE remains. HINNPerf has a mean of 8.99% and a smaller margin of 2.06; while the margin is smaller, the model’s prediction MRE is still significant, resulting in lower prediction accuracy in this project. By contrast, MovePerf performs exceptionally well, with a mean of 4.92% and a margin of 0.02, demonstrating minimal prediction MREs and high stability. Therefore, on datasets with a moderate number of samples and minimal label variation, MovePerf outperforms HINNPerf, reducing the mean by 4.07% compared to HINNPerf.

For Kevin, the CNN achieves a 30.45% mean and 6.34 margin, indicating that the prediction results are highly unreliable, with large prediction MREs. In contrast, DeepFM shows a mean of 23.67% and a margin value of 8.46, demonstrating a significant improvement over the CNN, although its prediction accuracy remains relatively low. HINNPerf achieves a mean of 22.78% and a margin value of 10.94, showing a further improvement compared to DeepFM. Our MovePerf model excels, with a mean of 20.16% and a margin value of 8.53, further validating the effectiveness of our approach. In datasets with a larger number of samples and significant label variation, the prediction accuracy of MovePerf slightly decreases. However, compared to HINNPerf, the prediction MRE is still reduced by 2.62%.

In summary, MovePerf outperforms the CNN, DeepFM, and HINNPerf across multiple test projects, particularly demonstrating higher accuracy and stability on complex and feature-rich datasets.

5.4.2. Results for RQ2

To answer RQ2, we evaluate whether the transmission of order-2 feature cross-terms to the DFNN can truly improve the performance prediction accuracy. To verify this, we conduct a comparative analysis between MovePerf that includes order-2 features and a model that does not. In the absence of order-2 features, both the DFNN and FM use the same input data, i.e., the dataset that has undergone standardization processing.

For both models, we use the same grid search strategy to tune the hyperparameters, ensuring that the process is consistent across both models. This means that the hyperparameter settings are held constant throughout the evaluation so that the only difference between the two models is whether the order-2 feature cross-term is used.

Table 5 presents the prediction results for MovePerf with order-2 and MovePerf without order-2. On the whole, the model incorporating order-2 features achieves an average error value of 7.69%, representing a 2.79% reduction compared to the model without order-2 features. Furthermore, it demonstrates greater stability with a margin value of 2.15, which is 1.20 lower than that of the baseline model.

For JUnit, when MovePerf without order-2 is used, there is a significant drop in accuracy. Specifically, the mean increases by 3.67%, and the margin also shows an increase, indicating a decrease in the overall prediction accuracy and stability. This suggests that the order-2 feature interactions play a crucial role in maintaining the model’s performance for the JUnit project.

For LC-problems, the inclusion of order-2 feature cross-terms leads to a slight increase in the mean by 0.02%, which is a minimal change. However, there is a significant decrease in the margin value, suggesting that the model becomes more stable in its predictions. This indicates that, while the impact on the accuracy is marginal, the order-2 features help to improve the model’s consistency and reduce the variance of the predictions.

For Concurrency, the absence of order-2 feature cross-terms causes a notable drop in accuracy, with the mean increasing by 4.07% and the margin rising by 2.05. This results in a significant decrease in model consistency. In this project, where the complexity of the task is higher, the order-2 feature interactions are particularly important in maintaining stable and accurate predictions.

For Kevin, the mean of MovePerf without order-2 is 3.41% higher than that of MovePerf, and the margin value is somewhat lower, but this still demonstrates the necessity of order-2 feature interactions.

Overall, the introduction of order-2 feature cross-terms proves to be beneficial in improving the stability and accuracy of the model, particularly in more complex scenarios like the Concurrency project. The results suggest that order-2 feature interactions are essential in ensuring that the model performs well across a variety of tasks, especially when dealing with higher complexity. Thus, the order-2 feature cross-term in MovePerf is not only necessary overall but also crucial in maintaining the model’s predictive power and consistency.

5.4.3. Results for RQ3

Two components are involved in our approach, FM and the DFNN. To evaluate the contribution of each component in our proposed method, we conduct ablation experiments by comparing their performance.

Our method still adheres to the previously defined parameter settings, with the FM epoch value set to 100. The primary objective of this experimental setup is to ensure that the selected parameter values or the grid search range remain consistent. As a result, the differences observed in the model’s predictions are most likely attributable to architectural differences.

Table 6 presents the prediction results for MovePerf and its two components. On the whole, MovePerf demonstrates superior performance compared to FM and the DFNN across the four datasets, achieving a lower average mean value of 7.69%—–representing reductions of 4.98% and 2.45% over FM and the DFNN, respectively. Additionally, its average margin value of 2.15 is significantly better, being 6.90 and 0.78 lower than that of FM and the DFNN. These results show that MovePerf significantly improves the prediction accuracy and stability by combining the strengths of FM and the DFNN.

For JUnit, FM exhibits a relatively high mean of 7.82% and a large margin of 15.32, indicating that the model’s predictions are prone to significant fluctuations, with relatively low accuracy and high instability. In contrast, the DFNN reduces the mean to 6.03%, and the margin decreases significantly to 2.26, demonstrating better stability and consistency. While prediction MREs still exist, the DFNN exhibits smaller prediction fluctuations and more reliable results. MovePerf outperforms both FM and the DFNN with a mean of 5.41% and an extremely small margin of 0.03, indicating high precision and remarkable stability and consistency. Clearly, MovePerf provides superior performance to both FM and the DFNN, showing that combining FM and the DFNN improves the model’s overall performance.

For LC-problems, FM shows a mean of 0.87%, indicating relatively small prediction MREs, but the margin is 5.24, suggesting significant prediction fluctuations. This implies that, while the model provides accurate predictions in certain instances, its stability is poor, with the results varying considerably across different data points. Compared to FM, the DFNN performs even better, with the mean dropping to 0.28% and the margin significantly reduced to 0.08. This demonstrates not only a significant improvement in accuracy but also stronger stability and consistency, with the predictions showing almost no fluctuation. MovePerf, which integrates both FM and the DFNN, further enhances the prediction performance, achieving a mean of 0.28% and an almost zero margin of 0.01, indicating extremely high accuracy with virtually no fluctuation. This confirms that, by combining FM and the DFNN, MovePerf significantly improves the prediction accuracy and consistency, overcoming the limitations of individual models and demonstrating exceptional performance for the LC-problems project.

For Concurrency, FM performs poorly, with a mean of 13.24% and a margin of 8.76, indicating large prediction MREs and considerable fluctuations in the results. The DFNN improves upon FM, reducing the mean to 8.99% and the margin to 2.07. While the prediction MRE remains, the DFNN exhibits more stability and smaller fluctuations. MovePerf delivers the best performance, with a mean of 4.92% and a very small margin of 0.02, showing significant improvements in both its prediction accuracy and stability. Compared to FM and the DFNN, MovePerf achieves a notable enhancement in its overall performance.

For Kevin, the prediction MREs of the two components are quite significant, at 28.74% and 25.21%, respectively. Although the prediction MRE of MovePerf has not yet reached the ideal level, its predictive accuracy is superior to that of the individual components. Additionally, MovePerf also demonstrates a notable advantage in terms of stability.

In summary, MovePerf consistently outperforms FM and the DFNN in terms of both accuracy and stability across multiple datasets. Integrating the two components significantly improves the prediction precision, reduces fluctuations, and results in more consistent performance, making MovePerf a superior approach compared to its individual components.

5.4.4. Results for RQ4

To answer RQ4, we perform performance predictions for three different refactoring results under the same environment and parameter settings, with the results shown in Table 7.

For JUnit, the refactoring results are as follows: getClassName() is moved to a new class in a new package under the parent directory, getAllFields() is moved to a new class under the parent directory, and getStatedClassName() is moved to AnnotationCondition; getClassName() remains unchanged, getAllFields() is moved to the inner class WindowsCompileProcess within CompileProcess, and getStatedClassName() is moved to Test2Benchmark; both getClassName() and getStatedClassName() are moved to Test2BenchmarkAgent, while getAllFields() is moved to a new class in a new package within the same package.

For LC-problems, the refactoring results are as follows: generateUnsortedArray() is moved to ArrayTo2dArrayTest, printMatrix(int[][]) is moved to TestUtils, and printMatrix(List

< L i s t < T > >

) is moved to a new class in the same package; all three methods are moved to TestUtils; generateUnsortedArray() and printMatrix(List

< L i s t < T > >

) are moved to a new class in a new package, while printMatrix(int[][]) is moved to another new class.

For Concurrency, the refactoring results are as follows: readFile() and sleepMils() are moved to RunAsyncDemo, sleepSecond() is moved to a new class, and printThreadlog() is moved to SupplyAsyncDemo02; readFile() is moved to RunAsyncDemo02, while sleepMils(), sleepSecond(), and printThreadlog() are moved to CommonUtils; all three methods (readFile(), sleepMils(), and sleepSecond()) are moved to CommonUtils, and printThreadlog() is moved to RunAsyncDemo.

For Kevin, since the project contains two benchmark() methods, one in NorbitBenchmark and one in OrbitBenchmark, we rename these two methods as benchmarkN() and benchmarkO(), respectively. In this experiment, the refactoring results are as follows: benchmarkN() is moved to Benchmarks, prepare() is moved to OrbitEvent, and the benchmarkO() method remains in place; benchmarkN() and benchmarkO() remain in place, and prepare() is moved to a new class within the same package; benchmarkN() and prepare() are moved to OrbitListener, and benchmarkO() remains in place.

From the results in Table 7, it can be observed that MovePerf consistently maintains high accuracy in its predictions across these datasets. It is capable of effectively predicting the execution time for any restructured results, with its predicted values providing meaningful reference points.

5.4.5. Results for RQ5

To evaluate the practicality and feasibility of our approach, it is essential to measure the time consumed by the training and testing processes of MovePerf. All experiments were conducted on a Dell workstation with a 1.19 GHz Intel Core i5 CPU and 8GB RAM, running 64-bit Windows 10. Due to the differences in the TensorFlow versions used in the models, GPU acceleration may not have functioned properly or been fully utilized. This could have rendered the experiments unstable, potentially resulting in inconsistent or non-reproducible results. Therefore, to ensure the reliability and consistency of the experimental outcomes, we opted to implement the methods under a CPU environment.

We compared DeepFM, DeepPerf, HINNPerf, and our approach, because CNN and FM models have simpler architectures and much shorter computation times. However, as the experimental results indicate, their predictive accuracy is lower; thus, these methods are excluded from the comparison.

From the perspective of the model architecture, DeepPerf, DeepFM, and our approach share certain similarities. Since the deep learning neural network (DNN) component used in DeepFM has a relatively simple architecture, it requires less time to train compared to DeepPerf and MovePerf. When used individually, the FM component used in MovePerf has a smaller time cost, and its contribution to the overall time is negligible. Therefore, the time cost for MovePerf should be roughly equivalent to that of DeepPerf. HINNPerf, which consists of multiple FNN blocks and requires the training of many parameters, requires a longer period of time.

Furthermore, based on the experimental results, the models’ training times are as shown in Figure 9. For all project datasets with MMR, the time cost of DeepFM in searching for optimal hyperparameters and training a model is the shortest, ranging from 5 to 80 min. Then, it takes DeepPerf and our MovePerf 3–126 min to conduct model training and hyperparameter searching, while HINNPerf spends 27–297 min.

The time costs are slightly larger since we used a CPU for the experiments. When compared to the GPU time used in [31], despite the differences in the number of features (ranging from 9 to 60 features) and the sample sizes (ranging from 192 to

10^{31}

), the time cost of model training and hyperparameter searching for DeepPerf is from 5 to 13 min, while HINNPerf takes 14 to 16 min. Thus, under the same GPU conditions, the maximum time cost for MovePerf is 13 min, which makes its time cost acceptable.

In summary, MovePerf demonstrates higher predictive accuracy than other methods, and its time cost in searching for optimal hyperparameters and training a model is reasonable.

5.5. Threats to Validity

This section outlines the threats to the validity of the experimental results. The first threat is the limited number of selected items. Since we manually collected the results regarding the execution times for each project, constructing the dataset was time-consuming. To mitigate this, we carefully chose projects from diverse areas, including software testing and performance evaluation, algorithm design and optimization, and event-driven programming, to ensure data diversity. Future work will further expand the project’s scope to address this limitation.

The second threat to the validity concerns the accuracy of the execution times in the dataset. To address this, we selected Java projects that used microbenchmarks for performance evaluation and employed JMH to measure the execution times to ensure that the labels in the dataset were as accurate as possible. JMH is designed for precise performance measurement, featuring isolated test runs, multiple iterative tests, and nanosecond-level accuracy. This approach aims to ensure that the dataset labels are as accurate as possible and that the model’s predictions are reliable.

6. Conclusions

This paper proposes a novel approach, MovePerf, to predict the performance impact of MMR. Firstly, we manually applied MMR to 15 Java projects on GitHub, ensuring that only the move method operation was performed. By analyzing the projects before and after refactoring, 23 class-level, 8 method-level, and 1 semantic-level features that were most relevant to the refactoring operation were determined. Secondly, JMH was used to measure the execution time with high precision, ensuring the accuracy of the performance labels used in the dataset. Finally, a prediction model consisting of FM and a DFNN was constructed, where FM was responsible for capturing low-order interactions between the features, while the DFNN modeled high-order interactions to provide more accurate predictions. MovePerf was evaluated in four real-world large-scale projects. The experimental results showed that MovePerf achieved higher prediction accuracy, with an average MRE of 7.69%, outperforming the CNN, FM, DeepFM, DeepPerf, and HINNPerf by 1.83% to 8.61%. Additionally, the model exhibited greater stability in most projects. Our proposed approach provides developers with a predictive framework to assess the performance impacts of specific refactoring operations prior to implementation, thereby mitigating the potential performance degradation in production environments. The operational workflow comprises three key phases: (1) the feature extraction of target classes and methods post-refactoring, (2) model inference to predict the execution time, and (3) data-driven refactoring decisions based on the prediction outcomes. It should be noted that, while pre-trained models are available for established project datasets, new projects require dataset construction and model fine-tuning prior to deploying MovePerf. Future research directions include exploring the locations of classes that can be moved to by the move method, expanding the dataset’s size to enhance the model’s robustness and generalizability, and continuing to select suitable projects to further enrich the diversity and quantity of datasets used in building the model, which would allow for the broader application and refinement of the MovePerf framework.

Author Contributions

Y.Y.: experimental analysis, writing—original draft, review and editing; Y.L.: experimentation, writing—original draft, review and editing; S.L.: experimentation, writing—original draft, review and editing; X.Z.: writing—review and editing; L.Z.: writing—review and editing; Y.B.: proposal of idea, writing—review and editing; Y.Z.: proposal of idea, writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the Natural Science Foundation of Hebei Province under Grant No. F2023208001 and the Overseas High-Level Talent Foundation of Hebei Province under Grant No. C20230358.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

The authors gratefully acknowledge the insightful comments and suggestions of the reviewers, which improved the presentation of this work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Fernandes, E.; Chávez, A.; Garcia, A.; Ferreira, I.; Cedrim, D.; Sousa, L.; Oizumi, W. Refactoring effect on internal quality attributes: What haven’t they told you yet? Inf. Softw. Technol. 2020, 126, 106347. [Google Scholar] [CrossRef]
Li, T.; Zhang, Y. Multilingual code refactoring detection based on deep learning. Expert Syst. Appl. 2024, 258, 125164. [Google Scholar] [CrossRef]
Al Dallal, J. Predicting move method refactoring opportunities in object-oriented code. Inf. Softw. Technol. 2017, 92, 105–120. [Google Scholar] [CrossRef]
Terra, R.; Valente, M.T.; Miranda, S.; Sales, V. JMove: A novel heuristic and tool to detect move method refactoring opportunities. J. Syst. Softw. 2018, 138, 19–36. [Google Scholar] [CrossRef]
Couto, C.M.S.; Rocha, H.; Terra, R. A quality-oriented approach to recommend move method refactorings. In Proceedings of the XVII Brazilian Symposium on Software Quality, Curitiba, Brazil, 17–19 October 2018; pp. 11–20. [Google Scholar]
Cui, D.; Wang, S.; Luo, Y.; Li, X.; Dai, J.; Wang, L.; Li, Q. Rmove: Recommending move method refactoring opportunities using structural and semantic representations of code. In Proceedings of the 2022 IEEE International Conference on Software Maintenance and Evolution (ICSME), Limassol, Cyprus, 3–7 October 2022; pp. 281–292. [Google Scholar]
Alshayeb, M. Empirical investigation of refactoring effect on software quality. Inf. Softw. Technol. 2009, 51, 1319–1326. [Google Scholar] [CrossRef]
Agnihotri, M.; Chug, A. Understanding refactoring tactics and their effect on software quality. In Proceedings of the 2022 12th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, India, 27–28 January 2022; pp. 41–46. [Google Scholar]
Demeyer, S. Refactor conditionals into polymorphism: What’s the performance cost of introducing virtual calls? In Proceedings of the 21st IEEE International Conference on Software Maintenance (ICSM’05), Budapest, Hungary, 26–29 September 2005; pp. 627–630. [Google Scholar]
Traini, L.; Di Pompeo, D.; Tucci, M.; Lin, B.; Scalabrino, S.; Bavota, G.; Lanza, M.; Oliveto, R.; Cortellessa, V. How software refactoring impacts execution time. ACM Trans. Softw. Eng. Methodol. (TOSEM) 2021, 31, 1–23. [Google Scholar] [CrossRef]
Santana, R.; Martins, L.; Virgnio, T.; Rocha, L.; Costa, H.; Machado, I. An empirical evaluation of RAIDE: A semi-automated approach for test smells detection and refactoring. Sci. Comput. Program. 2024, 231, 103013. [Google Scholar] [CrossRef]
Pizzini, A. Behavior-based test smells refactoring: Toward an automatic approach to refactoring Eager Test and Lazy Test smells. In Proceedings of the 2022 IEEE/ACM 44th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion), Pittsburgh, PA, USA, 21–29 May 2022. [Google Scholar]
Zhang, Y.; Li, Y.; Meredith, G.; Zheng, K.; Li, X. Move method refactoring recommendation based on deep learning and LLM-generated information. Inf. Sci. 2025, 697, 121753. [Google Scholar] [CrossRef]
Rebai, S.; Kessentini, M.; Alizadeh, V.; Sghaier, O.B.; Kazman, R. Recommending refactorings via commit message analysis. Inf. Softw. Technol. 2020, 126, 106332. [Google Scholar] [CrossRef]
Nyamawe, A.S. Mining commit messages to enhance software refactorings recommendation: A machine learning approach. Mach. Learn. Appl. 2022, 9, 100316. [Google Scholar] [CrossRef]
Alharbi, M.; Alshayeb, M. A Comparative Study of Automated Refactoring Tools. IEEE Access 2024, 12, 18764–18781. [Google Scholar] [CrossRef]
Chávez, A.; Ferreira, I.; Fernandes, E.; Cedrim, D.; Garcia, A. How does refactoring affect internal quality attributes? A multi-project study. In Proceedings of the XXXI Brazilian Symposium on Software Engineering, Fortaleza, Brazil, 20–22 September 2017; pp. 74–83. [Google Scholar]
Kaur, S.; Singh, P. How does object-oriented code refactoring influence software quality? Research landscape and challenges. J. Syst. Softw. 2019, 157, 110394. [Google Scholar] [CrossRef]
Chen, J.; Yu, D.; Hu, H.; Li, Z.; Hu, H. Analyzing performance-aware code changes in software development process. In Proceedings of the 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC), Montreal, QC, Canada, 25–26 May 2019; pp. 300–310. [Google Scholar]
Ha, H.; Zhang, H. DeepPerf: Performance Prediction for Configurable Software with Deep Sparse Neural Network. In Proceedings of the 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), Montreal, QC, Canada, 25–31 May 2019; pp. 1095–1106. [Google Scholar] [CrossRef]
Siegmund, N.; Grebhahn, A.; Apel, S.; Kästner, C. Performance-influence models for highly configurable systems. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, Bergamo, Italy, 30 August–4 September 2015; pp. 284–294. [Google Scholar]
Cheng, H.T.; Koc, L.; Harmsen, J.; Shaked, T.; Chandra, T.; Aradhye, H.; Anderson, G.; Corrado, G.; Chai, W.; Ispir, M.; et al. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, Boston, MA, USA, 15 September 2016; pp. 7–10. [Google Scholar]
Guo, H.; Tang, R.; Ye, Y.; Li, Z.; He, X. DeepFM: A factorization-machine based neural network for CTR prediction. arXiv 2017, arXiv:1703.04247. [Google Scholar]
Laaber, C.; Würsten, S.; Gall, H.C.; Leitner, P. Dynamically reconfiguring software microbenchmarks: Reducing execution time without sacrificing result quality. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual, 8–13 November 2020; pp. 989–1001. [Google Scholar]
Povirk, C. Caliper. 2024. Available online: https://github.com/google/caliper (accessed on 12 March 2023).
Kawaguchi, K. Japex. 2011. Available online: https://github.com/kohsuke/japex (accessed on 26 February 2023).
noconnor. JUnitPerf. 2024. Available online: https://github.com/noconnor/JUnitPerf (accessed on 3 March 2023).
Aniche, M. CK. 2022. Available online: https://github.com/mauricioaniche/ck (accessed on 6 July 2023).
Goldberg, Y. Word2vec Explained: Deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv 2014, arXiv:1402.3722. [Google Scholar]
Zhang, Y.; Ge, C.; Hong, S.; Tian, R.; Dong, C.; Liu, J. DeleSmell: Code smell detection based on deep learning and latent semantic analysis. Knowl.-Based Syst. 2022, 255, 109737. [Google Scholar] [CrossRef]
Cheng, J.; Gao, C.; Zheng, Z. HINNPerf: Hierarchical Interaction Neural Network for Performance Prediction of Configurable Systems. ACM Trans. Softw. Eng. Methodol. 2022, 32, 1–30. [Google Scholar] [CrossRef]
Rendle, S. Factorization machines. In Proceedings of the 2010 IEEE International Conference on Data Mining, Sydney, NSW, Australia, 13–17 December 2010; pp. 995–1000. [Google Scholar]
Chanaa, A.; El Faddouli, E.-E. An Analysis of learners’ affective and cognitive traits in Context-Aware Recommender Systems (CARS) using feature interactions and Factorization Machines (FMs). J. King Saud-Univ.-Comput. Inf. Sci. 2022, 34, 4796–4809. [Google Scholar] [CrossRef]
Mohammad, A.; Aljarrah, F.F.; Lee, C. A new generalized normal distribution: Properties and applications. Commun. Stat.-Theory Methods 2019, 48, 4474–4491. [Google Scholar] [CrossRef]

Figure 1. An overview of MovePerf.

Figure 2. Dataset construction.

Figure 3. Heatmap of order−1 feature interactions.

Figure 4. Heatmap of order−2 feature interactions.

Figure 5. Decision tree fragment demonstrating high-order feature interactions.

Figure 6. Architecture of MovePerf’s prediction model.

Figure 7. Calculation and application of order−2 cross-terms.

Figure 8. Architecture of DFNN component.

Figure 9. Time costs of the four models on the four datasets.

Table 1. Performance changes when conducting MMR in project Kevin.

Project	Method	Source Class	Target Class	Execution Time Before (ms)	Execution Time After (ms)	Difference (ms)
Kevin	prepare() and benchmark()	OrbitBenchmark	Benchmarks	101,758.47	51,456.9	−50,301.54
Kevin	prepare()	OrbitBenchmark	NorbitBenchmark	101,758.47	102,263.37	+504.9

Table 2. Selected metrics.

Metric	Class	Method	Description
CBO	T	T	Coupling between objects
CBO Modified	T	T	Coupling between objects
FAN-IN	T	T	Input dependencies
FAN-OUT	T	T	Output dependencies
WMC	T		Weight method class
RFC	T	T	Response for a class
DIT	T		Depth inheritance trees
NOC	T		Number of children
LOC	T		Lines of code
NOSI	T		Number of static invocations
LCOM	T		Lack of cohesion of methods
LCOM*	T		Lack of cohesion of methods
modifiers	T		Public/abstract/private/protected/native
totalMethodsQty	T		Total number of methods
publicMethodsQty	T		Number of public methods
protectedMethodsQty	T		Number of protected methods
defaultMethodsQty	T		Number of default methods
privateMethodsQty	T		Number of private methods
staticMethodsQty	T		Number of static methods
finalMethodsQty	T		Number of final methods
abstractMethodsQty	T		Number of abstract methods
visibleMethodsQty	T		Number of visible methods
assignmentsQty	T		Number of tasks
methodsInvokedQty		T	Directly invoked methods
methodsInvokedLocalQty		T	Locally invoked methods
methodsInvokedIndirectLocalQty		T	Indirect local invoked methods

Table 3. Metric selection rationale.

Metric	Impact on Execution Time
CBO	Reduces method invocation overhead by
	decreasing inter-class coupling
CBO Modified	Extended coupling metric incorporating
	bidirectional dependencies
FAN-IN	Directly affects execution path through
	call frequency changes
FAN-OUT	Influences call chain length via external
	method invocations
WMC	Control flow complexity positively correlates
	with execution time
RFC	Response set complexity affects
	method call overhead
DIT	Inheritance depth increases vtable
	lookup overhead
NOC	Number of subclasses impacts polymorphic
	call resolution cost
LOC	Code size directly correlates with
	instruction execution time
NOSI	Excessive static calls increase
	class loading time
LCOM	Cohesion changes affect cache locality
LCOM*	Graph-based cohesion measurement better
	reflects execution efficiency
modifiers	Access modifiers determine invocation mechanism
	(static/dynamic binding)
totalMethodsQty	Total methods positively correlate
	with invocation overhead
publicMethodsQty	Number of public interfaces affects vtable size
protectedMethodsQty	Protected methods increase dynamic binding overhead
defaultMethodsQty	Package-private calls avoid access control checks
privateMethodsQty	Higher JIT inlining probability for private methods
staticMethodsQty	Static binding eliminates vtable lookup
finalMethodsQty	Final modifier increases JIT optimization probability
abstractMethodsQty	Abstract methods enforce virtual call indirection
visibleMethodsQty	Number of visible methods affects class loading time
assignmentsQty	Assignment operation density impacts
	instruction pipeline efficiency
methodsInvokedQty	Number of call sites directly affects execution time
methodsInvokedLocalQty	Local call frequency affects register allocation
methodsInvokedIndirectLocalQty	Indirect calls increase pointer jumping overhead

Table 4. Prediction results for each model on the datasets.

Model	Project	Mean	Margin
CNN	JUnit	15.83	4.94
	LC-problems	6.44	5.64
	Concurrency	12.47	4.39
	Kevin	30.45	6.34
FM	JUnit	7.82	15.32
	LC-problems	0.87	5.24
	Concurrency	13.24	8.76
	Kevin	28.74	6.89
DeepFM	JUnit	13.53	10.58
	LC-problems	1.77	3.56
	Concurrency	9.87	3.62
	Kevin	23.67	8.46
DeepPerf	JUnit	6.03	2.26
	LC-problems	0.32	0.10
	Concurrency	8.99	2.07
	Kevin	25.21	7.29
HINNPerf	JUnit	5.96	2.00
	LC-problems	0.35	0.10
	Concurrency	8.99	2.06
	Kevin	22.78	10.94
MovePerf	JUnit	5.41	0.03
	LC-problems	0.28	0.01
	Concurrency	4.92	0.02
	Kevin	20.16	8.53

Table 5. Comparison between cases with and without order-2 feature cross-terms.

Project	MovePerf with Order-2		MovePerf Without Order-2
Project	Mean	Margin	Mean	Margin
JUnit	5.41	0.03	9.08	2.82
LC-problems	0.28	0.01	0.26	0.07
Concurrency	4.92	0.02	8.99	2.07
Kevin	20.16	8.53	23.57	8.43
Average	7.69	2.15	10.48	3.35

Table 6. Ablation experiment.

Project	FM		DFNN		MovePerf
Project	Mean	Margin	Mean	Margin	Mean	Margin
JUnit	7.82	15.32	6.03	2.26	5.41	0.03
LC-problems	0.87	5.24	0.32	0.10	0.28	0.01
Concurrency	13.24	8.76	8.99	2.07	4.92	0.02
Kevin	28.74	6.89	25.21	7.29	20.16	8.53
Average	12.67	9.05	10.14	2.93	7.69	2.15

Table 7. Generalization of the experimental results.

Project	True Value	Predicted Value	MRE
JUnit	477.36	501.80	5.12
	469.76	442.19	5.87
	472.92	496.38	4.96
LC-problems	6117	6136.57	0.32
	6105	6090.68	0.23
	6100	6122.55	0.37
Concurrency	4026.15	4155.39	3.21
	3700.09	3909.89	5.67
	3155.02	2997.69	4.99
Kevin	51,621.00	60,824.84	17.83
	102,812.96	81,872.22	20.34
	51,697.67	62,088.90	20.10

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, Y.; Lu, Y.; Liang, S.; Zhang, X.; Zhang, L.; Bai, Y.; Zhang, Y. Predicting a Program’s Execution Time After Move Method Refactoring Based on Deep Learning and Feature Interaction. Appl. Sci. 2025, 15, 4270. https://doi.org/10.3390/app15084270

AMA Style

Yu Y, Lu Y, Liang S, Zhang X, Zhang L, Bai Y, Zhang Y. Predicting a Program’s Execution Time After Move Method Refactoring Based on Deep Learning and Feature Interaction. Applied Sciences. 2025; 15(8):4270. https://doi.org/10.3390/app15084270

Chicago/Turabian Style

Yu, Yamei, Yifan Lu, Siyi Liang, Xuguang Zhang, Liyan Zhang, Yu Bai, and Yang Zhang. 2025. "Predicting a Program’s Execution Time After Move Method Refactoring Based on Deep Learning and Feature Interaction" Applied Sciences 15, no. 8: 4270. https://doi.org/10.3390/app15084270

APA Style

Yu, Y., Lu, Y., Liang, S., Zhang, X., Zhang, L., Bai, Y., & Zhang, Y. (2025). Predicting a Program’s Execution Time After Move Method Refactoring Based on Deep Learning and Feature Interaction. Applied Sciences, 15(8), 4270. https://doi.org/10.3390/app15084270

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Predicting a Program’s Execution Time After Move Method Refactoring Based on Deep Learning and Feature Interaction

Abstract

1. Introduction

2. Related Works

2.1. Recommended Approach for Move Method Refactoring

2.2. Impact of Move Method Refactoring on Software Quality and Performance

2.3. Prediction Model

3. Motivation

4. Design

4.1. Dataset

4.1.1. Projects

4.1.2. Features Related to Move Method Refactoring

4.1.3. Feature Interaction

4.1.4. Label

4.2. Automatic Tool

4.3. Model

4.3.1. Factor Machine

4.3.2. Deep Feedforward Neural Network

5. Evaluation

5.1. Projects

5.2. Research Questions

5.3. Evaluation Metric

5.4. Results

5.4.1. Results for RQ1

5.4.2. Results for RQ2

5.4.3. Results for RQ3

5.4.4. Results for RQ4

5.4.5. Results for RQ5

5.5. Threats to Validity

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI