An Automatic Software Defect Repair Method Based on Multi-Objective Genetic Programming

Tiantian Han; Yonghe Chu; Fangzheng Liu

doi:10.3390/app14188550

Abstract

Due to the explosive growth of software quantity and the mixed ability of software developers, a large number of software defects emerge during the later stages of software maintenance. The search method based on genetic programming is one of the most popular in search algorithms, but it also has some issues. The single-objective approach to validate and select offspring patches without considering other constraints can affect the efficiency of patch generation. To address this issue, this paper proposes an automatic software repair method based on Multi-objective Genetic Programming (MGPRepair). Firstly, the method adopts a lightweight context analysis strategy to find suitable repair materials. Secondly, it decouples the replacement statements and insertion statements in the repair materials, using a lower-granularity patch representation method to encode the patches in the search space. Then, the automatic software defect repair is treated as a multi-objective search problem, and the NSGA-II multi-objective optimization algorithm is used to find simpler repair patches. Finally, the test case filtering technique is used to accelerate the patch validation process and generate correct patches. MGPRepair was experimentally evaluated on 395 real Java software defects from the Defects4J dataset. The experimental results show that MGPRepair can generate test case-passing patches for 51 defects, of which 35 defect patches are equivalent to manually generated patches. Its repair the efficiency and success rate are higher to other excellent automatic software defect repair methods such as jGenProg, RSRepair, ARJA, Nopol, Capgen, and SequenceR.

Keywords:

genetic programming; automatic software repair; multi-objective optimization; context analysis

1. Introduction

With the rapid development of global technology, computer software has gradually become an essential tool for the construction of an information-based society, thus making it an inevitable global trend in the future. However, during the development and use of software, there are a large number of defects, making defect repair one of the most important aspects of software development. Consequently, automatic software defect repair technology has emerged and become a hot topic in the field of software engineering research [1]. Automatic defect repair technology refers to the ability to quickly and accurately fix program failures in software to reduce debugging costs and improve software quality [2].

In recent years, search-based repair methods have been favored by researchers. These methods primarily utilize heuristic search algorithms such as genetic programming to search for correct patches in potential spaces and test them to verify their correctness. Automatic program repair techniques based on heuristic search guide the generation of repair patches according to manually defined heuristic rules. GenProg (Genetic Program Repair) [3,4] is a renowned program repair method in this category. It breaks program semantics constraints, applies genetic programming algorithms to automatic program defect repair with good results, and ushers in a new era of automatic program repair but causes controversy in fitness function studies. In the search process, GenProg utilizes genetic programming by defining crossover and mutation operations of code snippets to increase the search space for patches, thereby recombining existing code snippets. However, Ghanbari et al. [5] reported PraPR (practical program repair), a program repair technique based on JVM bytecode, which mitigates some of the drawbacks of source-code-level repair, such as complex structure and long iteration compilation time. Bytecode-level repair can effectively overcome these disadvantages, but there is still no fundamental solution for the significant computational resources consumed by genetic programming. Recognizing the drawbacks of genetic programming, Qi et al. [6] proposed RSRepair (Random-Search-based Repair) in 2014. This method adopts the same repair framework as GenProg but replaces the genetic algorithm with random search. Comparative experimental analysis shows that the genetic programming algorithm in GenProg does not play a significant role. The use of random search algorithms can not only correctly fix the same bugs but also achieve higher repair efficiency for 23 bugs compared to genetic algorithms. To improve upon the search efficiency of genetic programming, Yuan [7] proposed ARJA (automated repair of Java), which not only enriches GenProg’s search space but also improves the search efficiency through test case filtering. This alleviates the infinite loop caused by the computational complexity of the fitness function and prevents the deterioration of the patch search process. However, while ARJA improves GenProg in many ways, it does not fundamentally alleviate the problem of patch overfitting. In 2020, Yuan et al. [8,9] proposed a new solution called ARJA-e (an enhanced version of ARJA) to address this issue. By detecting overfitting patches, it effectively solves the problem of patch overfitting, but the experimental results remain the same. The aforementioned repair methods, including GenProg, RSRepair, and ARJA, adopt a single objective to select offspring patches. This does not perfect the selection process, resulting in the inability to select higher-quality offspring patches during genetic iteration. This leads to an increase in the number of evolution iterations and more complex calculations, thereby affecting the entire process of defect repair.

Sun et al. [10] introduced a repair method that mitigates the impact of fault localization results by considering multiple candidate fault locations simultaneously during population initialization, generating candidate patches to address inaccurate fault localization. Furthermore, when obtaining repair materials, methods such as code similarity are often used. However, without effective filtering and classification, a large number of ineffective repair materials are searched, resulting in slow search processes and low repair success rates. Kim [11] proposed an automatic repair tool, ConFix (context-based fix), which directly applies historical manual repair templates. It extracts relevant context information based on the location of the modified code on the code’s abstract syntax tree, reducing unnecessary code modifications by an average of 48% according to experimental results. Li et al. [12] proposed ARJANMT, which creates candidate repair statements by leveraging the redundancy assumption and sequence-to-sequence learning of correct patches, and inputs them into a multi-objective evolutionary search algorithm to find test-suite-adequate patches.

In addition to these heuristic-based approaches, program repair based on code similarity has also gained attention. For example, the repair method ssFix [13] proposed by Xin et al. also searches for code similar to the defective code from the code repository as the original components for repair patches. During the patch generation phase, instead of directly reusing similar code, ssFix utilizes Change Distiller to extract the differences between the defective code and the similar code, and applies the changes one by one. ssFix’s low granularity of code reuse effectively enhances its patch generation capability. Jiang et al. [14] argue that code modules with similar functionality exhibit high similarity in code style. Based on this idea, they proposed an automatic repair tool called SimFix. It searches for similar code within the project using both code structural features and code semantic features. Then, it compares the differences between the erroneous code and the similar code, extracting fine-grained code modification operations. Compared to ssFix, its repair accuracy is almost doubled. Cao et al. [15] proposed RSCSRepair, which calculates similarity using custom and system identifiers. Hu et al. [16] proposed recommending repair patches for student-submitted homework code based on correct code. They implemented a repair tool called Refactory. The difference between this method and the above approaches is that it incorporates a code refactoring component to make semantically similar code as structurally similar as possible, which facilitates subsequent code matching. However, the application scenario is tailored for student submissions, requiring different implementations of the same function code. In industrial development environments, identical function code does not always exist, limiting its applicability.

Although search-based automatic program repair methods [17,18] and code similarity-based repair methods have improved the efficiency and accuracy of software defect repair to some extent, these approaches still have certain limitations. First, search-based methods rely on single-objective optimization algorithms, which often overlook diversity when selecting repair materials and generating patches [6]. This can result in lower-quality patches and even lead to overfitting. Second, while code similarity methods generate patches by searching for similar code in existing code repositories, which effectively improves the accuracy of patch generation, the search process is often time-consuming and less efficient when dealing with complex code structures [5]. Moreover, these methods typically struggle to balance repair efficiency and patch quality when addressing different types of software defects. On one hand, the high computational complexity and multiple iterations inherent in search-based methods often lead to lengthy repair processes; on the other hand, code similarity methods are susceptible to differences in code style and semantics when searching for matching code snippets, resulting in slower search processes and lower repair success rates.

Addressing the issues mentioned above, this paper proposes a defect automatic repair method called MGPRepair based on multi-objective genetic programming. Firstly, the method employs a lightweight context analysis strategy to filter out unpromising repair materials. Secondly, a lower-granularity patch representation is used to encode patches in the search space, and classification rules are applied to decouple replacement and insertion statements in repair materials. This method describes automatic defect repair as a multi-objective search problem and uses the multi-objective optimization algorithm NSGA-II [19] to find simpler repairs. Then, to reduce computational complexity and search space, a test filtering and patch overfitting detection process is introduced, which can accelerate the fitness evaluation of genetic programming and reduce patch verification time. Finally, tests were conducted on 395 defects in six projects from Defects4J [20]. The experimental results show that MGPRepair can generate patches that pass test cases for 51 defects, with 35 of them equivalent to manually generated patches. Compared to other similar methods, MGPRepair shows significant performance improvements. The contributions of this paper are as follows:

(1): To address the issue of low-quality repair material selection, a lightweight context analysis strategy is employed to identify suitable repair materials. The decoupling of replacement and insertion statements within the repair materials is implemented to enhance the accuracy of defect matching.
(2): Targeting the problem of inefficient search in the search space caused by traditional patch encoding methods, a lower-granularity patch representation is proposed to encode patches in the search space.
(3): To solve the issue of incomplete offspring patch selection resulting from single-objective search. An improved fitness function utilizing the multi-objective optimization algorithm NSGA-II [19] is proposed to select offspring populations. Thus, it addresses the offspring selection problem.

2. Related Concepts

2.1. Genetic Programming

Genetic Programming (GP) [21] is a random search technique inspired by biological evolution processes, often derived from Genetic Algorithms. It evolves computer programs towards specific functional or quality goals under certain constraint conditions. In GP, computer programs are encoded as genomes, which can be syntactic trees, instruction sequences, or other linear and hierarchical data structures. Each genome is evaluated using a fitness function to select those that better match predefined rules. GP starts with a set of genomes, typically generated randomly, and they gradually evolve through iterative processes. In each generation, GP first selects a portion of the current population based on fitness, and then applies crossover and mutation operations to the selected population to generate new genomes, forming the next population.

2.2. Multi-Objective Optimization Problem

Most previous genetic programming approaches only considered the number of test cases passed when selecting offspring variants. However, in real-world tasks, there are often multiple competing objectives that need to be optimized simultaneously, which can be formulated as a multi-objective optimization problem (MOP) [22]. Mathematically, a general MOP can be expressed as:

\min F (x) = {(f_{1} (x), f_{2} (x), \dots, f_{n} (x))}^{T}

(1)

The value of F(x) is composed of n objective functions, and the value of x is an n-dimensional decision vector. The n objective functions in Equation (1) are often conflicting, meaning that increasing one objective function may lead to a decrease in others. Therefore, there is usually no single solution that can optimize all objective functions simultaneously. To solve MOPs, attention needs to be paid to approximating the Pareto Front (PF), which represents the best trade-off among objectives. The concept of PF is formally defined as follows:

Definition 1.

Pareto Dominance [23], where vectors

p = {(p_{1}, p_{2}, \dots, p_{m})}^{T}

and

q = {(q_{1}, q_{2}, \dots, q_{m})}^{T}

, p dominate q, denoted

p ≺ q

, if and only if

\forall i \in {1, 2, \dots, m}

,

p_{i} \leq q_{i}

, and

p = {(p_{1}, p_{2}, \dots, p_{m})}^{T}

,

\exists j \in {1, 2, \dots, m}

,

p_{j} \leq q_{j}

.

Definition 2.

The PF is the subset of solutions that are not dominated by any other solutions.

Due to the population-based nature of Evolutionary Algorithms (EAs), they are able to approximate the PF of a MOP in a single run by obtaining a set of non-dominated objective vectors, from which a decision-maker can select one or more to satisfy their specific needs. These evolutionary algorithms are referred to as Multi-Objective Evolutionary Algorithms [24]. In the context of a suitable multi-objective scenario, Multi-Objective Genetic Programming utilizes a Multi-Objective Evolutionary Algorithm approach to evolve a set of candidate patch programs for multiple objectives.

2.3. Search Space

The automatic software defect repair method based on genetic programming search can be regarded as a process of finding the correct patch corresponding to a defect from a vast patch search space [3,4]. Researchers have found that the setting of the search space has a significant impact on the effect of defect repair. Therefore, how to set a reasonable patch search space has become the key content of the research on search methods. The patch search space should not be set too large. An overly large search space would lead to excessive repair time or the search algorithm being unable to efficiently find the correct patch. It would also generate a large number of suspected but incorrect patches, greatly increasing the difficulty of verifying the patches in the later stage. At the same time, the patch search space should not be set too small, as a too small search space may not contain the correct repair patch, making it impossible to find the optimal solution. This paper aims to add constraints to the patch search space to effectively control the scope of the search space, while also expanding the search space according to certain rules to ensure that it can contain more correct patches.

3. Method and Implementation

This paper proposes MGPRepair, a multi-objective genetic programming-based method for automatic software defect repair. It is designed to address the inefficiencies and test case overfitting issues inherent in traditional methods that rely solely on single-objective validation and selection of offspring patches. First, MGPRepair introduces a new fitness function that incorporates code similarity and positional information. This approach allows for a more accurate assessment of patch effectiveness and helps avoid the local optima problems often seen in traditional fitness functions when dealing with high computational complexity. Additionally, MGPRepair integrates the multi-objective optimization algorithm NSGA-II [19], which considers both the ability of offspring patches to pass test cases and the impact of patch size on quality. This dual consideration helps reduce the dependency on test cases during the repair process. By selecting a high-quality initial population prior to the genetic search process, MGPRepair accelerates the evolutionary process and reduces defect repair time. Figure 1 illustrates the technical framework introduced in this chapter.

Figure 1. Framework Diagram of Multi-Objective Genetic Programming-Based Automatic Software Defect Repair Method.

As shown in Figure 1, the proposed method is mainly divided into three stages. The first stage is defect localization. It starts by inputting a test case set and the source code with defects. By using mature defect localization methods (such as Ochiai [25,26]), the possible locations of defects are detected, the suspicion values of each suspicious location are calculated, and the locations are sorted according to the suspicion values. The second stage is the patch generation process. First, based on the location of the defect, a large number of candidate repair statements are extracted from the source code. Lightweight context analysis is used to filter out repair materials unrelated to the defect location. Then, the repair materials are further filtered and classified according to established rules. A higher-quality initial population is generated based on the defect location and corresponding repair materials. The multi-objective genetic algorithm NSGA-II is then employed to more quickly find high-quality offspring that pass more test cases and have smaller patch sizes. During the verification of offspring patches, a test case filtering technique is added, which can significantly reduce the number of test cases and accelerate the search speed of the algorithm. The third stage is patch validation. The generated candidate patches are validated through a complete test case set, and the validated patches undergo overfitting detection. If a patch passes the overfitting detection, it is considered a credible patch. Finally, manual verification tests determine whether this patch can truly be used as a correct patch for the defective program.

3.1. Repair Material

This paper adopts the concept of redundancy hypothesis, assuming that every statement may potentially appear again in other parts of the program. It starts by obtaining a set of potentially defective statements from the defect localization stage and identifies the packages where they reside. Then it scans and examines all the code within the packages one by one. Figure 2 shows a real code snippet. Specifically, for each statement (denoted as S), it first checks whether the variables and variables in method calls of the statement fall within the scope of the suspected defective statements. If not, the repair statement can be discarded directly. Next, it verifies whether the collected statements conform to the rules in Table 1. For example, the first rule in the table states that when S is a Break or Continue statement, due to the nature of the Java language, they can only be replaced or inserted into similar loop statements. Therefore, this repair material statement can only be effective when the suspected modification point is located in a loop such as a For or While, otherwise, this repair statement will be discarded. After that, the S statement begins to verify the rules in Table 2. There are three rules related to insertion and another three related to replacement in Table 2. When statement S passes the three rules for insertion, it can proceed to the judgment of whether it can become an insertion material statement. If it passes the replacement rules, it can proceed to the judgment of a replacement material statement. If statement S passes all the rules in Table 2, it can be considered for both insertion and replacement material statement judgments. Among them, when S is selected as a replacement statement, S should have a certain similarity to the suspected modification point statement. When S is selected as an insertion statement, S should have a certain connection with the context of the modifiable point.

Figure 2. Real Code Snippet.

Table 1. Rules for Filtering Components at Each Modification Point.

Table 2. Operations to Disable Certain Specific Rules.

Assuming

V_{S}

and

V_{L B S}

are the sets of variables (including local variables and fields) used by method S and method LBS respectively. After derivation, the similarity between S and LBS is defined as the Jaccard similarity coefficient between sets

V_{S}

and

V_{L B S}

as stated in Formula (2). When collecting the fields used by a statement, it is also necessary to consider the fields accessed through method calls within the current class. For the following code, if LBS is the statement return b + geta(), then

V_{L B S} = {a, b}

.

J a c c a r d 1 (S, L B S) = \frac{| V_{S} \cap V_{L B S} |}{| V_{S} \cup V_{L B S} |}

(2)

To determine whether statement S can be inserted before statement LBS, this paper selects a total of 2n lines of code before and after LBS, and extracts the sets of variables used in these lines. Assuming

V_{B}

and

V_{A}

are the sets of variables used in the n lines of code before and after LBS respectively. After derivation, the relevance of statement S to the contextual code of LBS is defined by Formula (3).

J a c c a r d 2 (S, L B S) = \frac{1}{2} (\frac{| V_{S} \cap V_{B} |}{| V_{S} |} + \frac{| V_{S} \cap V_{A} |}{| V_{S} |})

(3)

Predefine

β_{1} = 0.5

and

β_{2} = 0.7

as two threshold parameters. After multiple experiments, it was found that when

β_{1} = 0.5

and

β_{2} = 0.7

are used, the search space for repair and the quality of repair materials can be balanced, thereby optimizing the efficiency and effectiveness of the repair process. If

J a c c a r d 1 > β_{1}

, then statement S can be added to the set of candidate replacement repair statements. If

J a c c a r d 2 > β_{2}

, then statement S can be added to the set of candidate insertion repair statements. Finally, the search space is determined, and the sets of insertion and replacement repair materials are returned. Experimental results demonstrate that the improved method for selecting repair materials proposed in this paper significantly enhances the quality of repair material selection.

3.2. New Way to Represent Patches

Previous versions of repair methods used abstract syntax trees as the genetic representation for genetic programming. However, such a representation method often results in unmanageable memory consumption for large programs, limiting its scalability. Tuple datasets usually occupy less memory, which is very important for handling large-scale problems or data. In automatic software defect repair, a large number of potential defects and repair materials need to be processed. Using tuples can reduce memory consumption and improve the scalability of the algorithm. Therefore, this paper proposes a new representation method. Specifically, each solution is represented as a tuple

x = (b, v)

, which consists of two parts, each part being a vector of size n, where n is the number of potentially defective locations selected. The first part

b = (b_{1}, b_{2}, \dots, b_{n})

represents the locations of defects, expressed as a binary vector

b_{j} (b_{j} \in \{0, 1\}, j = 1, 2, \dots, n)

. As shown in Figure 3, it can be derived from the content of the figure that when

b_{j} = 0

, it indicates that the j-th potentially defective location in x does not require editing or modification, whereas when

b_{j} = 1

, it does require editing or modification.

Figure 3. Explanation and Illustration of Defect Location b.

Similar to b, the second part of

v = (v_{1}, v_{2}, \dots, v_{n})

is also a vector of size n, where

v_{j} (j = 1, 2, \dots, n)

represents the set of repair materials for the j-th modification point in x. As detailed in Figure 4, the set of repair materials is composed of an insertion statement set and a replacement statement set. When the j-th position in v is displayed as I4, it indicates that the 4th repair statement from the insertion set is selected for insertion repair.

Figure 4. Explanation and Illustration of Repair Material Set v.

3.3. The Use of NSGA-II

In the method proposed in this paper, defect localization is first used to determine the first part b of x, and the values in the vector b are assigned based on the degree of suspiciousness. Positions with a higher suspiciousness are more likely to be assigned a value of 1, while positions with a lower suspiciousness are more likely to be assigned a value of 0. Then, the selection of values in v is determined based on the first part, and a random repair statement is selected from the repair material set for each suspicious location. After the first and second parts are determined, an initial variant can be obtained. This process continues until N variants are selected to form the initial population of the first generation.

This paper formulates the problem as a multi-objective search problem. To evaluate the fitness of each variant in the population, the objective functions

F_{1}

and

F_{2}

are used to assess the variants.

F_{1}

and

F_{2}

are functions related to the number of test cases passed and the patch size, respectively, with the goal of minimizing both objective functions to achieve optimal results.

In Formula (4),

T_{N}

and

T_{S}

represent the failed test cases and successful test cases in the test case set, respectively.

T_{n}

is the set of failed test cases in

T_{N}

,

T_{s}

is the set of failed test cases in

T_{S}

, and

ω \in (0, 1]

is a global parameter that can introduce bias for failed tests. Formula (5) represents the size of the patch x, which actually refers to the number of patch operations.

F_{1} (x) = \frac{T_{n}}{T_{N}} + ω \times \frac{T_{s}}{T_{S}}

(4)

F_{2} (x) = \sum_{i = 1}^{n} b_{i}

(5)

Based on the proposed solution, NSGA-II [19] is used as the solution for the multi-objective genetic search algorithm. Firstly, the population is initialized according to the initialization strategy mentioned in the beginning of this section. Then, the algorithm enters a loop state until the maximum number of iterations is reached or the maximum time for repair is exceeded. In each generation, a child population is generated using binary tournament selection, and N best solutions are selected using fast non-dominated sorting and crowding distance comparison. These N best solutions constitute the next generation of population variants. Finally, the non-dominated solutions for

F_{1} = 0

are output, as they can only be considered as correct patches when

F_{1} = 0

. The results need to be discarded when

F_{2} = 0

, because

F_{2} = 0

indicates that the patch makes no changes to the source code, which is meaningless.

3.4. Test Case Filtering

The patch validation process requires executing a large number of test cases, which greatly increases the computational load of the entire defect repair process. Therefore, this paper introduces a test case filtering technique that filters out most of the test cases before using the test case set to validate the generated patches, significantly accelerating the validation efficiency of candidate patches [27]. The test case filtering method is illustrated in Figure 5. Specifically, test cases are divided into successful test cases and failed test cases. Successful test cases can pass the defect test, while failed test cases cannot. For each successful test case, all the code lines covered during its execution are recorded. If these lines do not include any lines related to the selected potentially defective statements, this successful test case can be filtered out. This strategy can significantly speed up the validation of candidate patches.

Figure 5. Test Case Filtering Framework Diagram.

This technique is applied in validating the fitness of offspring patches, filtering out most test cases to accelerate the speed of patch evolution. However, test case filtering can also affect the accuracy of patch validation [28]. In the experimental evaluation phase, this paper will analyze the uncertainty caused by the test case filtering technique to demonstrate the reliability of the technique.

4. Experimental Evaluation

4.1. Experimental Setup

To make the experiment more convincing, a high-quality defect dataset is needed to evaluate the effectiveness of automated defect repair methods. Defects4J [20] has been one of the most popular defect datasets for evaluating Java systems, as it extracts datasets from large-scale Java open-source projects in reality and covers almost all types of defects. Due to its ease of use and reliable framework, Defects4J provides corresponding test cases for each defect, making it a benchmark dataset for most automated defect repair methods in experimental evaluations. In this section, we adopt the Defects4J defect dataset for the experimental evaluation phase, which allows for a good comparison of the repair effect with other repair methods. As shown in Table 3, we have selected six of the most frequently used projects: Chart, Closure, Lang, Math, Mockito, and Time, with a total of 395 defects. This section will verify the performance of the repair method through experiments in terms of the number of repaired defects, the time required to repair defects, and other factors. We have chosen several popular repair methods such as Genprog, Simfix, ARJA, and Nopol to compare with MGPRepair.

Table 3. Defects4J Dataset.

To keep the experiment within acceptable limits, in this method, the initial population size N is set to 40, the maximum number of population evolution iterations is set to 50, the probability of crossover is 1, and the mutation probability is set to 0.1. The maximum number of modifications per variant is set to n = 100. After numerous experiments, it was found that when the minimum suspicion threshold is set to 0.2, it can effectively detect defects while better balancing the situations of false positives and false negatives. If this threshold is lowered, it may lead to more false positives, that is, some locations that are unlikely to be defects are also marked as suspicious, thereby increasing the workload of subsequent processing. If the threshold is increased, some real defects may be missed, affecting the repair effect. Setting ω to 0.5 can better balance the importance of the two objective functions in Formula (4). If the value of ω changes, it will change the weights of the two objective functions in the optimization process, thereby affecting the decision-making of the NSGA-II algorithm in multi-objective optimization.

4.2. Experimental Results and Analysis

To experimentally evaluate the MGPRepair method, this experiment selected six automated defect repair methods: jGenprog [29], RSRepair [6], ARJA [7], Nopol [30], Capgen [31], and SequenceR [32]. Among them, jGenprog is the Java version of the classic repair method Genprog, while RSRepair, Capgen, and ARJA are among the most advanced search-based repair methods. Nopol is one of the most advanced methods based on semantic constraints. SequenceR is one of the most popular methods based on machine learning. The experiment was conducted on 395 defects in the Defects4J defect dataset, reproducing these methods. The number of successfully repaired defects and the time used for each successful repair were recorded. Then, ablation experiments were performed to demonstrate the improvements brought by the optimization methods used in MGPRepair to genetic search-based repair methods. The specific details are as follows:

(1): Research Question 1: Compare the improvement in the success rate of defect repair achieved by the MGPRepair method compared to other excellent automated defect repair methods.

This question has selected six representative defect repair methods to compare with MGPRepair, and these methods have been reproduced. The experimental data is shown in Table 4 below.

Table 4. Comparison of Repair Outcomes between MGPRepair and Other Six Automated Defect Repair Methods. p (%) refers to the defect repair rate on Defects4J.

Table 4 presents the repair results of MGPRepair and the other six repair methods on Defects4J. It details the number of defects repaired by each method in each project, as well as the total number and proportion of defects repaired across all six projects. A total of 395 defects were selected from six of the most popular projects, including Chart, Closure, Lang, Math, Time, and Mockito. MGPRepair generated reasonable and credible patches for 8, 3, 10, 27, 1, and 2 defects respectively. In total, 51 credible patches were generated, accounting for 12.91% of all defects, which is 24, 16, 26, 30, 7, and 19 more defects repaired than jGenProg, Nopol, Capgen, SequenceR, ARJA, and RSRepair respectively. MGPRepair achieved the highest number and proportion of defects repaired among these methods, especially in the Chart, Lang, and Math projects where it successfully repaired the largest number of defects. In the Time and Mockito projects, the number of successfully repaired defects was on par with the highest number repaired by the other six methods. Only in the Closure project did MGPRepair repair 2 fewer defects than the SequenceR method. In summary, it can be initially concluded that MGPRepair outperforms the other six methods in terms of defect repair success rate. This is mainly due to the fact that search-based methods are currently one of the most mature methods in the field of defect repair. The approach presented in this chapter filters and classifies repair materials based on genetic programming search. This can generate a higher-quality initial population and effectively reduce the number of iterations of the algorithm. The approach also optimizes the process of selecting offspring variants using the NSGA-II algorithm for multi-objective genetic evolution. Therefore, MGPRepair achieves a higher success rate in repair patch generation than the other six methods.

However, generating a credible patch that passes the test cases does not equate to generating a correct patch. It is necessary to compare the generated credible patch with the manually created patch to determine if a correct patch has been generated. Only when the generated patch and the manually created patch are fully interchangeable in syntax and semantics is the patch considered correct. Next, we further conduct a manual comparative analysis of the patches generated by MGPRepair and similar methods jGenprog and ARJA to determine the number of truly effective patches that our method can produce for specific project defects.

Table 5 is a statistical table showing the specific defect IDs repaired in the Chart, Closure, Lang, Math, Time, and Mockito projects by MGPRepair, jGenprog, and ARJA. The repair patches corresponding to each defect in the table have been manually verified. According to the data in the table, MGPRepair successfully generated correct patches for 35 defects, jGenprog successfully generated correct patches for 17 defects, and ARJA successfully generated correct patches for 25 defects. Among the six projects, MGPRepair repaired the highest number of defects in any given project. The percentage of confirmed repaired defects by MGPRepair is 8.86%, approximately 2.06 times higher than jGenprog and 1.4 times higher than ARJA. It also achieved successful repairs for 2 defects in the Time and Closure projects, which the other two methods failed to do. Moreover, the defects repaired by MGPRepair almost cover those repaired by the other two methods. Unfortunately, none of the three methods were able to successfully repair defects in the Mockito project. This may be due to the special nature of the code in this project. The search space did not contain the repair materials corresponding to the defects, making it difficult to generate correct patches for the defects.

Table 5. Comparison of Repair Outcomes between MGPRepair, jGenprog, and ARJA.

Figure 6 presents a Venn diagram comparing the repair capabilities of MGPRepair, jGenprog, and ARJA. From the diagram, it is clear that MGPRepair is able to successfully repair most of the defects that jGenprog and ARJA can repair, while jGenprog and ARJA can only repair a small portion of the defects repaired by the other two methods. In the Closure and Mockito projects, only MGPRepair was able to successfully repair 2 defects. In addition, there are 9 defects repaired by MGPRepair that neither of the other two methods was able to repair, indicating that the repair capabilities of these two methods are significantly weaker than MGPRepair. In summary, it can be concluded that MGPRepair has achieved significant improvements in automated defect repair based on search techniques, especially in the genetic programming approach by optimizing the initial population and selection of offspring patches. MGPRepair, to some extent, alleviates the problem of slow search speed caused by poor initial population quality. Changing single-objective search to multi-objective search also accelerates patch generation and makes it easier to find optimal solutions, thus gaining more advantages in repairing defects on Defects4J.

Figure 6. Venn Diagram Showing the Repaired Patches by Three Different Repair Methods.

(2): Research Question 2: Compare the time performance of MGPRepair in defect repair with other outstanding automatic defect repair methods.

Time is an important metric for evaluating the performance of an automatic defect repair method, and the time required to generate a correct patch for a defect should be within a reasonable range acceptable to people. Excessively long or uncertain repair times clearly do not meet people’s needs. Therefore, this research question analyzes the time required for four repair methods: MGPRepair, ARJA, RSRepair, and jGenprog, to repair defects. The reason for selecting these methods is that jGenprog is the most classic genetic programming-based repair method, ARJA is a relatively advanced method, and RSRepair uses the genetic programming framework but omits the process of selecting offspring using a fitness function. The reason for not choosing other types of repair methods for comparison with MGPRepair is that different types of repair methods can vary greatly in the time required to repair defects. For example, machine learning-based repair methods require a significant amount of time to train the data model in the early stages, while the defect repair process itself may take a short time. Therefore, comparing with other types of methods is not meaningful.

This paper conducts a statistical analysis of the time required for each repair method to successfully repair defects. Each reparable defect is repaired 10 times, and the average value is taken as the comparison standard. Table 6 shows the repair time for the four methods, with the left column representing the four repair methods and the right column showing the corresponding repair time. The statistics include the minimum time, median time, maximum time, and average time required for each method to repair defects. Among them, the median and average time best reflect the general performance of the repair method. The median time for MGPRepair to repair defects is 3.49 min, and the average time is 9.91 min, which is the shortest among the four methods. Additionally, the minimum time required for MGPRepair to generate a patch is 0.71 min, and the maximum time is 58.35 min. It is evident from the statistical data of the experimental results that the time required for MGPRepair to repair defects is within an acceptable range for people.

Table 6. Time Consumption Comparison of MGPRepair and Other Genetic Programming-Based Repair Methods.

To provide a more detailed view of the time distribution for each repair method, the data has been summarized and presented in the form of a box plot. As shown in Figure 7, the vertical axis represents the time required for defect repair, and the horizontal axis represents MGPRepair and the other three methods. The small white square represents the average value, and the line within the rectangle represents the median. As can be seen from the figure, in addition to the average and median mentioned earlier, the upper quartile and lower quartile of the time consumed by each repair method can also be observed. MGPRepair has both the upper and lower quartiles lower than the other three methods. The narrow yellow rectangle represents that the time required for MGPRepair to repair defects has relatively small fluctuations and is more stable.

Figure 7. Box Plot of Time Consumption Comparison between MGPRepair and Other Genetic Programming-Based Repair Methods.

In summary, MGPRepair outperforms both the classic and the most advanced methods in the same category in terms of the time required for defect repair. The main reason for this is that the method in this chapter optimizes the initial population and makes changes in selecting offspring patches, which accelerates and optimizes the genetic search process. The addition of test case filtering techniques significantly speeds up the offspring validation process, saving a considerable amount of time.

(3): Research Question 3: The substantial improvements brought by the methodology and techniques used in MGPRepair for automated defect repair.

The MGPRepair method utilizes a lightweight context filtering approach and further classifies and filters the repair materials to generate a higher-quality population. The NSGA-II algorithm is employed in the genetic search process to find patches that can pass more test cases with minimal changes. Test case filtering is also introduced in the process of selecting offspring patches, which speeds up the selection process. To investigate whether these methods play a role in MGPRepair and to what extent, this experiment designs and implements three variants of MGPRepair: MGPRepair-A, MGPRepair-B, and MGPRepair-C. These three variants share the same basic framework as MGPRepair, but differ in that they only utilize some of the techniques in MGPRepair. As detailed in Table 7, the table lists the three most important techniques of MGPRepair and introduces the techniques used by each variant. It can be concluded from the table that MGPRepair-A does not utilize test case filtering, MGPRepair-B does not use the technique to optimize the initial population, and MGPRepair-C does not employ the multi-objective genetic algorithm.

Table 7. Techniques Used in MGPRepair and Its Three Variants.

Next, an ablation study will be conducted to verify the improvements brought by each technique. This experiment was validated on the Defects4J defect dataset using the three variants. To reduce labor costs, this experiment only counts the credible patches that pass the test case set, without comparing the credible patches with manual patches. As shown in Figure 8, MGPRepair-A, MGPRepair-B, and MGPRepair-C successfully generated credible patches for 48, 39, and 45 defects, respectively, representing a reduction of 3, 12, and 6 patches compared to MGPRepair. This indicates that all three techniques contribute to an increase in the success rate of defect repair, with the optimization of the initial population showing the most significant improvement. Subsequently, this experiment recorded the time taken to successfully repair these defects. The results showed that the average time for MGPRepair-A, MGPRepair-B, and MGPRepair-C to generate a credible patch was 10.21 min, 13.59 min, and 12.81 min, respectively. This demonstrates that MGPRepair also brings corresponding improvements in defect repair time, with the optimization of the initial population showing the greatest improvement and the use of the multi-objective genetic algorithm showing the least improvement. Through this ablation study, it can be concluded that the techniques used in MGPRepair have corresponding improvements in both the success rate and time efficiency of defect repair.

Figure 8. Number of Defects Repaired and Time Consumed by MGPRepair and Its Three Variants.

5. Conclusions

During the search process of genetic algorithms, only using the number of passing test cases to validate and select offspring patches. This approach, without employing other constraints to select more optimal offspring, affects the efficiency of patch generation and results in poor search performance. Therefore, this paper proposes a multi-objective genetic programming-based software defect automatic repair method called MGPRepair. Firstly, the method employs a lightweight context analysis strategy for potential defect locations. This method also uses classification rules to decouple the substitution and insertion statements in the repair material statements, which can filter out some low-quality repair materials. Secondly, a novel encoding pattern is used to re-encode the patches in the search space, allowing a more thorough genetic programming search. We describe defect automatic repair as a multi-objective search problem and use the multi-objective optimization algorithm NSGA-II to find repairs that pass more test cases while being simpler. Then, to reduce computational complexity and search space, a test filtering and patch overfitting detection process is introduced, which can accelerate the fitness evaluation process in genetic programming and reduce verification time. Verified on the real defect library Defects4J, MGPRepair successfully generated repair patches for 35 defects and accelerated the defect repair speed, to a certain extent, addressing the issue of offspring selection.

Author Contributions

Conceptualization, T.H.; writing—original draft preparation, T.H.; writing—review and editing, Y.C.; validation, F.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

N	Number of Variants
$T_{S}$	Successful Test Cases
$T_{N}$	Failed Test Cases
$V_{S}$	Set of Variables for Method S
$V_{L B S}$	Set of Variables for Method LBS
$V_{B}$	Set of Variables In the First n Lines of LBS Code
$V_{A}$	Set of Variables In the Last n Lines of LBS Code
p	vectors
S	Statement in program
q	vectors
n	The Number of Objective Functions

References

Monperrus, M. Automatic software repair: A bibliography. ACM Comput. Surv. (CSUR) 2018, 51, 1–24. [Google Scholar] [CrossRef]
Britton, T.; Jeng, L.; Carver, G.; Cheak, P.; Katzenellenbogen, T. Reversible Debugging Software-Quantify the Time and Cost Saved Using Reversible Debuggers; University of Cambridge: Cambridge, UK, 2013. [Google Scholar]
Weimer, W.; Forrest, S.; Le Goues, C.; Nguyen, T. Automatic program repair with evolutionary computation. Commun. ACM 2010, 53, 109–116. [Google Scholar] [CrossRef]
Le Goues, C.; Nguyen, T.; Forrest, S.; Weimer, W. GenProg: A generic method for automatic software repair. IEEE Trans. Softw. Eng. 2012, 38, 54–72. [Google Scholar] [CrossRef]
Ghanbari, A.; Zhang, L.M. PraPR: Practical program repair via bytecode mutation. In Proceedings of the 2019 34th IEEE/ACM International Conference on Automated Software Engineering, San Diego, CA, USA, 11–15 November 2019; pp. 1118–1121. [Google Scholar]
Qi, Y.; Mao, X.; Lei, Y.; Dai, Z.; Wang, C. The strength of random search on automated program repair. In Proceedings of the 36th International Conference on Software Engineering (ICSE), Hyderabad, India, 31 May–7 June 2014; pp. 254–265. [Google Scholar]
Yuan, Y.; Banzhaf, W. ARJA: Automated repair of java programs via multi-objective genetic programming. IEEE Trans. Softw. Eng. 2018, 46, 1040–1067. [Google Scholar] [CrossRef]
Yuan, Y.; Banzhaf, W. A hybrid evolutionary system for automatic software repair. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), Prague, Czech Republic, 13–17 July 2019; pp. 1417–1425. [Google Scholar]
Yuan, Y.; Banzhaf, W. Toward better evolutionary program repair: An integrated approach. ACM Trans. Softw. Eng. Methodol. 2020, 29, 1–53. [Google Scholar] [CrossRef]
Sun, S.; Guo, J.; Zhao, R.; Li, Z. Search-based efficient automated program repair using mutation and fault localization. In Proceedings of the 2018 IEEE 42nd Annual Computer Software and Applications Conference, Tokyo, Japan, 23–27 July 2018; pp. 174–183. [Google Scholar]
Kim, J.; Kim, S. Automatic patch generation with context-based change application. Empir. Softw. Eng. 2019, 24, 4071–4106. [Google Scholar] [CrossRef]
Li, D.; Wong, W.E.; Jian, M.; Geng, Y.; Chau, M. Improving search-based automatic program repair with Neural Machine Translation. IEEE Access 2022, 10, 51167–51175. [Google Scholar] [CrossRef]
Xin, Q.; Reiss, S.P. Leveraging syntax-related code for automated program repair. In Proceedings of the 2017 32nd IEEE/ACM International Conference on Automated Software Engineering, Urbana, IL, USA, 30 October–3 November 2017; pp. 660–670. [Google Scholar]
Jiang, J.; Xiong, Y.; Zhang, H.; Gao, Q.; Chen, X. Shaping program repair space with existing patches and similar code. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, Amsterdam, The Netherlands, 16–24 July 2018; pp. 298–309. [Google Scholar]
Cao, H.; Liu, F.; Shi, J.; Chu, Y.; Deng, M. Random search and code similarity-based automatic program repair. J. Shanghai Jiaotong Univ. (Sci.) 2023, 28, 738–752. [Google Scholar] [CrossRef]
Hu, Y.; Ahmed, U.Z.; Mechtaev, S.; Leong, B.; Roychoudhury, A. Re-factoring based program repair applied to programming assignments. In Proceedings of the 2019 34th IEEE/ACM International Conference on Automated Software Engineering, San Diego, CA, USA, 11–15 November 2019; pp. 388–398. [Google Scholar]
Trujillo, L.; Villanueva, O.M.; Hernandez, D.E. A novel approach for search-based program repair. IEEE Softw. 2021, 38, 36–42. [Google Scholar] [CrossRef]
Zhang, Q.; Fang, C.; Ma, Y.; Sun, W.; Chen, Z. A Survey of Learning-based Automated Program Repair. ACM Trans. Softw. Eng. Methodol. 2023, 33, 1–69. [Google Scholar] [CrossRef]
Deb, K.; Pratap, A.; Agarwal, S.; Meyarivan, T.A.M.T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 2002, 6, 182–197. [Google Scholar] [CrossRef]
Just, R.; Jalali, D.; Ernst, M.D. Defects4J: A database of existing faults to enable controlled testing studies for java programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis (ISSTA), San Jose, CA, USA, 21–25 July 2014; pp. 437–440. [Google Scholar]
Sánchez-García, A.; Loaiza-Meseguer, L.; Ocharán-Hernández, J.O.; Pérez-Arriaga, J.C. Genetic Programming in Software Engineering: A Systematic Literature Review. Int. J. Comb. Optim. Probl. Inform. 2023, 14, 61. [Google Scholar] [CrossRef]
Augusto, O.B.; Bennis, F.; Caro, S. A new method for decision making in multi-objective optimization problems. Pesqui. Oper. 2012, 32, 331–369. [Google Scholar] [CrossRef]
Brunelli, M.; Fedrizzi, M. Inconsistency indices for pairwise comparisons and the Pareto dominance principle. Eur. J. Oper. Res. 2024, 312, 273–282. [Google Scholar] [CrossRef]
Zhou, A.; Qu, B.Y.; Li, H.; Zhao, S.Z.; Suganthan, P.N.; Zhang, Q. Multiobjective evolutionary algorithms: A survey of the state of the art. Swarm Evol. Comput. 2011, 1, 32–49. [Google Scholar] [CrossRef]
Abreu, R.; Zoeteweij, P.; Van Gemund, A.J.C. An evaluation of similarity coefficients for software fault localization. In Proceedings of the 2006 12th Pacific Rim International Symposium on Dependable Computing (PRDC), Riverside, CA, USA, 18–20 December 2006; pp. 39–46. [Google Scholar]
Abreu, R.; Zoeteweij, P.; Van Gemund, A.J.C. On the accuracy of spectrum-based fault localization. In Proceedings of the Testing: Academic and Industrial Conference Practice and Research Techniques-MUTATION (TAICPART-MUTATION), Windsor, UK, 10–14 September 2007; pp. 89–98. [Google Scholar]
Gao, X.; Mechtaev, S.; Roychoudhury, A. Crash-avoiding program repair. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis(ISSTA), Beijing, China, 15–19 July 2019; pp. 8–18. [Google Scholar]
Wloka, J.; Hoest, E.; Ryder, B.G. Tool support for change-centric test development. IEEE Softw. 2009, 27, 66–71. [Google Scholar] [CrossRef]
Martinez, M.; Durieux, T.; Sommerard, R.; Xuan, J.; Monperrus, M. Automatic repair of real bugs in java: A large-scale experiment on the defects4j dataset. Empir. Softw. Eng. 2017, 22, 1936–1964. [Google Scholar] [CrossRef]
Xuan, J.; Martinez, M.; Demarco, F.; Clement, M.; Marcote, S.L.; Durieux, T.; Le Berre, D.; Monperrus, M. Nopol: Automatic repair of conditional statement bugs in java programs. IEEE Trans. Softw. Eng. 2017, 43, 34–55. [Google Scholar] [CrossRef]
Wen, M.; Chen, J.; Wu, R.; Hao, D.; Cheung, S.C. Context-aware patch generation for better automated program repair. In Proceedings of the 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE), Gothenburg, Sweden, 27 May–3 June 2018; pp. 1–11. [Google Scholar]
Chen, Z.; Kommrusch, S.; Tufano, M.; Pouchet, L.N.; Poshyvanyk, D.; Monperrus, M. SequenceR: Sequence-to-sequence learning for end-to-end program repair. IEEE Trans. Softw. Eng. 2019, 47, 1943–1959. [Google Scholar] [CrossRef]

Figure 1. Framework Diagram of Multi-Objective Genetic Programming-Based Automatic Software Defect Repair Method.

Figure 2. Real Code Snippet.

Figure 3. Explanation and Illustration of Defect Location b.

Figure 4. Explanation and Illustration of Repair Material Set v.

Figure 5. Test Case Filtering Framework Diagram.

Figure 6. Venn Diagram Showing the Repaired Patches by Three Different Repair Methods.

Figure 7. Box Plot of Time Consumption Comparison between MGPRepair and Other Genetic Programming-Based Repair Methods.

Figure 8. Number of Defects Repaired and Time Consumed by MGPRepair and Its Three Variants.

Table 1. Rules for Filtering Components at Each Modification Point.

Number	Rules
1	Break and Continue statements can be used only when the suspect statement is in a For, While, Switch, etc. module.
2	The Case statement can be used only when the suspicious statement is in the Switch module.
3	Return can be used only when the method in which the suspicious statement is located declares Return.
4	Return can be used when the suspicious statement is on the last line of the block code.

Table 2. Operations to Disable Certain Specific Rules.

Operation Type	Rule
Insert	Do not insert a declaration statement before a declaration statement.
Insert	Do not insert a Return statement before a statement.
Insert	Do not insert an assignment statement with the same left-hand side before an assignment statement.
Replace	Do not replace a declaration statement with a statement of another type.
Replace	Do not replace the last return statement in a method with a statement of another type.
Replace	Do not use other types of statements to replace if-conditional statements.

Table 3. Defects4J Dataset.

Project Abbreviation	Project Name	Number of Defects	Number of Lines of Code	Number of Test Cases
Chart	JFreeChart	26	96	2205
Closure	Closure Compiler	133	90	7927
Lang	Commons Lang	65	22	2245
Math	Commons Math	106	85	3602
Mockito	Mockito	38	11	1457
Time	Joda-Time	27	28	4130
Total	-	395	332	21,566

Table 4. Comparison of Repair Outcomes between MGPRepair and Other Six Automated Defect Repair Methods. p (%) refers to the defect repair rate on Defects4J.

Methodology Tools	Chart	Closure	Lang	Math	Time	Mockito	Total	p (%)
Methodology Tools	26	133	65	106	38	27	395	p (%)
jGenProg	7	0	0	18	0	2	27	6.84
Nopol	6	0	7	21	0	1	35	8.86
Capgen	4	0	5	16	0	0	25	6.33
SequenceR	3	5	2	10	1	0	21	5.32
ARJA	8	2	8	23	1	2	44	11.4
RSRepair	7	1	6	17	0	1	32	8.1
MGPRepair	8	3	10	27	1	2	51	12.91

Table 5. Comparison of Repair Outcomes between MGPRepair, jGenprog, and ARJA.

Project	Number of Defects	Fixable Defect ID
Project	Number of Defects	MGPRepair	jGenprog	ARJA
Chart	26	C1, C3, C5, C13, C15, C19, C25	C3, C7, C13, C15, C25	C1, C5, C13, C15, C19
Lang	65	L7, L43, L45, L46, L51, L61, L63	-	L7, L43, L51, L55, L59, L61
Math	106	M2, M8, M20, M28, M32, M40, M49, M50, M53, M60, M64, M73, M74, M78, M80, M81, M82, M84, M95	M2, M8, M28, M40, M49, M50, M70, M73, M82, M84, M85, M95	M2, M20, M28, M32, M39, M40, M49, M50, M53, M60, M73, M74, M82, M84, M85
Time	27	T4	-	-
Closure	133	Clo5	-	-
Mockito	27	-	-	-
Total	395	35	17	25
Success Rate		8.86%	4.3%	6.33%

Table 6. Time Consumption Comparison of MGPRepair and Other Genetic Programming-Based Repair Methods.

Repair Method	Consumption Time
Repair Method	Min Time	Median	Max Time	Average Time
MGPRepair	0.71	3.49	58.35	9.91
jGenProg	0.63	8.13	78.65	15.82
RSRepair	0.88	7.92	79.34	16.59
ARJA	0.73	4.91	63.73	11.62

Table 7. Techniques Used in MGPRepair and Its Three Variants.

Repair Method	Multi-Target Genetics	Repair Material Boosting	Overfitting Detection
MGPRepair	√	√	√
MGPRepair-A	√	√	×
MGPRepair-B	√	×	√
MGPRepair-C	×	√	√

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.