CodeTranFix: A Neural Machine Translation Approach for Context-Aware Java Program Repair with CodeBERT

Lu, Yiwei; Ye, Shuxia; Qi, Liang

doi:10.3390/app15073632

Open AccessArticle

CodeTranFix: A Neural Machine Translation Approach for Context-Aware Java Program Repair with CodeBERT

by

Yiwei Lu

¹,

Shuxia Ye

^1,2

and

Liang Qi

^1,2,*

¹

School of Automation, Jiangsu University of Science and Technology, Zhenjiang 212100, China

²

Jiangsu Shipbuilding and Ocean Engineering Design and Research Institute, Zhenjiang 212100, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(7), 3632; https://doi.org/10.3390/app15073632

Submission received: 2 March 2025 / Revised: 22 March 2025 / Accepted: 25 March 2025 / Published: 26 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

Automated program repair (APR) plays a vital role in enhancing software quality and reducing developer maintenance efforts. Neural Machine Translation (NMT)-based methods demonstrate notable potential by learning translation patterns from bug-fix code pairs. However, traditional approaches are constrained by limited model capacity and training data scale, leading to performance bottlenecks in generalizing to unseen defect patterns. In this paper, we propose CodeTransFix, a novel APR approach that synergistically combines neural machine translation (NMT) methods with code-specific large language models of code (LLMCs) such as CodeBERT. The CodeTransFix approach innovatively learns contextual embeddings of bug-related code through CodeBERT and integrates these representations as supplementary inputs to the Transformer model, enabling context-aware patch generation. The repair performance is evaluated on the widely used Defects4j v1.2 benchmark. Our experimental results showed that CodeTransFix achieved a 54.1% performance improvement compared to the best NMT-based baseline model and a 23.3% performance improvement compared to the best LLMCs for fixing bugs. In addition, CodeTransFix outperformed existing APR methods in the Defects4j v2.0 generalization test.

Keywords:

automated program repair (APR); neural machine translation (NMT); context-aware patch generation

1. Introduction

Automatic program repair (APR) technology [1,2] plays a vital role in software quality assurance by automatically generating patches for defects discovered during debugging. This technology significantly reduces developer maintenance efforts while improving software reliability. Among existing APR approaches, methods based on neural machine translation (NMT) models [3,4] have gained prominence due to their ability to learn translation patterns from bug-fix code pairs (BFPs). In recent years, APR techniques have seen yet more breakthroughs with the application of LLMCs [5,6] to the APR domain. In particular, fine-tuned large language models of code (LLMCs) such as CodeT5 [7] and PLBART [8] have shown impressive performance in bug fixing [9,10].

Compared to natural language, source code exhibits strict syntactic features. Existing NMT-based APR approaches [3,11,12] learn source code semantics by incorporating knowledge from different domains to repair defective programs. For instance, CoCoNut [3] proposes a novel NMT architecture that employs two convolutional neural network encoders to separately encode contextual information (i.e., the immediate surrounding code within the same method or code block) and erroneous lines. DLFIX [11] utilizes tree-structured recurrent neural networks to learn contextual patterns and defect repair mechanisms. Recoder [12] integrates programming syntax knowledge to guide the generation of syntactically correct patches. However, due to the limited scale of model parameters and training data, NMT-based APR methods may fail to adequately learn strict syntactic features, consequently limiting their generalizability to unseen error patterns during training [9,10]. In contrast, large language models of code pre-trained on extensive code corpora demonstrate enhanced capabilities in bug repair tasks [13,14]. For example, VulRepair [14] leverages the CodeT5 model for vulnerability repair and generally outperforms NMT-based APR approaches. Nevertheless, this method only considers defective statements while neglecting the contextual dependencies (e.g., variable interactions, method dependencies) of error-prone code, despite the critical role of contextual code (defined as the immediate surrounding code within the same method or code block) in understanding erroneous behaviors.

To address the aforementioned limitations, we propose CodeTransFix, a novel approach that incorporates the CodeBERT model to more effectively repair erroneous code. Our method leverages the vast parameters of CodeBERT to comprehend the strict syntactic features of programming languages and integrates the surrounding context of erroneous statements to guide the repair process. The method operates in two phases:

First, CodeBERT learns contextual information by replacing bug lines in code segments with the special token <BUG>, generating modified context as model input. This design is motivated by the critical role of contextual code semantics (e.g., variable interactions, method dependencies) in understanding bug behavior. The <BUG> token explicitly marks the error location, allowing the model to focus on learning repair patterns from a structural context.

Second, a Transformer encoder-decoder architecture learns repair operations for buggy code. To fully utilize the knowledge learned by CodeBERT, we implement a Drop-net mechanism [15] to fuse contextual representations from CodeBERT’s output.

In this study, we evaluate two large language models of code based on the encoder-decoder architecture (CodeT5 [7] and PLBART [8]) and three state-of-the-art NMT-based APR approaches (CURE [16], RewardRepair [17], and Recoder [12]) on two benchmarks: Defects4J v1.2 [18] and Defects4J v2.0 [18].

In summary, this study makes the following contributions to the field:

(1): We propose a method to construct bug code contexts and use the CodeBERT model to learn feature representations of these contexts.
(2): In this paper, an innovative APR architecture is designed by integrating large language models of code with neural machine translation techniques.
(3): An in-depth experimental study was conducted in which CodeTransFix was evaluated on 130 single-block bugs for Defects4j v1.2 [18] and 82 single-block bugs for Defects4j v2.0 [18]. The study results show that CodeTransFix outperforms current state-of-the-art methods in both repair performance and generalization ability.

The remainder of this paper is organized as follows: In Section 2, related work is introduced; In Section 3, we present our proposed CodeTranFix approach; In Section 4 we describe the experimental setup and results; In Section 5, we discuss our approach; Lastly, Section 6 concludes the paper.

2. Related Work

The primary objective of research in the field of automated program repair (APR) is to automatically generate efficient patches for bugged code in order to significantly reduce the time and expense involved in manual debugging. The three main categories of APR techniques currently in use are traditional APR and NMT-based APR, in addition to the more recent use of large language models of code (LLMCs) for bug fixing.

Traditional APR techniques are divided into three main categories: heuristic-based methods [19,20,21], constraint-based methods [22,23,24,25,26,27], and template-based methods [28,29,30,31]. Among the traditional APR methods, optimal performance can be achieved through the use of template-based APR methods, in which each template is predefined based on the experience of the researcher and is designed to fix a specific type of bug. For example, Tbar [31] systematically summarizes commonly used repair templates in the literature and applies them to fix bugs in programs. Although template-based APR demonstrates significant capability in program repair, it cannot fix bugs that fall outside the operational scope of the repair templates.

To address the limitations of template-based APR approaches, researchers have leveraged neural machine translation (NMT) techniques to develop APR methods [3,4,17]. These methods treat APR as a translation task, that is, translating code with bugs into fixed code. For example, NMT-based APR methods such as CoCoNut [3], SequenceR [4], and RewardRepair [17] employ supervised training of bug-fix code pairs (BFPs) and utilize an encoder-decoder architecture. In this architecture, the encoder is responsible for encoding the bug code and its context into intermediate vectors, whereas the decoder is responsible for decoding these intermediate vectors into a repair patch. However, they remain constrained by the dependency on bug-fix code pairs as training data. This indicates that NMT-based APR methods face challenges in repairing bug patterns not present in their training datasets.

Compared to NMT-based APR techniques, large language models of code achieve state-of-the-art bug-fixing performance. Large language models of code, which comprise billions of parameters trained on open-source code repositories, demonstrate exceptional proficiency in understanding programming languages [9,10]. These models can subsequently be transferred to downstream APR tasks through fine-tuning with limited training data. Some researchers have begun to explore the use of LLMCs for program repair [7,8,13], and these models outperform both traditional and NMT-based techniques. For example, Mashhadi et al. [13] made the first attempt to address the problem of single-line bug fixing using a fine-tuned CodeBERT. However, the fine-tuning process for applying LLMCs is straightforward and simplistic. Existing NMT-based APR techniques incorporate specific designs tailored for APR tasks: Recoder [12] leverages programming language knowledge, while CoCoNut [3] utilizes contextual information surrounding buggy statements. In contrast, current LLMCs lack such specialized architectural designs. CURE [16] enhances the repair capability of LLMCs by integrating them with NMT-based APR techniques.

In this paper, a new approach is proposed that utilizes the CodeBERT model to learn the buggy code context representation and then incorporates it into the NMT architecture to enhance repair capabilities.

3. The Proposed Approach

In this section, we discuss the design of CodeTransFix, an approach that combines the large language models of code, CodeBERT, with NMT technology.

3.1. Overall Workflow

The approach used in the present study is summarized in Figure 1. CodeTransFix comprises three phases: training, inference, and validation. In the training phase, CodeTransFix first preprocesses the data (as shown in step ①) by extracting the bug lines and their surrounding contexts from the bug projects. After preprocessing, tokenized sequence data are obtained for the bug code and its context environment. Next, these sequences of context tokens are used to fine-tune the CodeBERT model to deeply understand the contextual features of the bug lines (as shown in step ②). Subsequently, CodeTransFix takes the context token sequences and bug code token sequences obtained from the preprocessing stage as inputs to train an APR model that focuses on bug code repair (as shown in step ③). This APR model is constructed by combining the fine-tuned CodeBERT model with the Transformer model to learn how to repair buggy codes based on contextual information.

During the inference stage, the user provides our method with the projects containing bugs as well as the locations of the buggy lines. These factors are the inputs required by currently available automated bug-fixing methods [4,11]. CodeTransFix will then preprocess the acquired data (step ①). A collection of candidate patches will be produced by the patch generation component (step ④) using a beam search technique.

In the validation stage, CodeTransFix will validate the candidate patches by executing the test suite in the patch project. Developers can review the list of reasonable patches that CodeTransFix will generate (step ⑤).

3.2. Data Pre-Processing

The data pre-processing phase aims to convert the raw source code into a format that CodeTransFix can process efficiently.

CodeTransFix has two separate inputs: the bug line and its local context. In this work, we define the “context” as the immediate surrounding code within the same method or code block where the buggy line resides. To obtain these inputs, the bug lines and the surrounding contexts are first extracted from the bug project. Specifically, as shown in Figure 2, the third line in Figure 2a is the bug line that needs to be fixed, whereas the processed context is illustrated in Figure 2b, where the bug line is replaced by the special placeholder <BUG>. During the data pre-processing phase, the data are prepared for two different purposes: the bug lines are used directly for subsequent model training, in particular, to learn the transitions from bugs to fixes, whereas the processed context information is used to fine-tune the pre-trained model to enable the model to learn the context information of the bug code.

Following the method used in a previous study [3,13], CodeTransFix employs a subword-level tokenizer, known as a byte-pair encoding (BPE) [32], to segment error lines, context lines, and repair lines into subword token sequences. This tokenizer mitigates vocabulary overflow by decomposing rare tokens into frequent subword units through iterative symbol merging. Specifically, The byte-pair encoding (BPE) algorithm iteratively learns subword units: starting with individual characters, it merges frequent character pairs to build a vocabulary that preserves common words (e.g., segmenting ‘ValueMarker’ into [‘Value’,‘Marker’]) while decomposing rare ones. This frequency-driven merging creates combinatorial subword combinations, enabling effective handling of unseen words without predefined mappings, thus reducing out-of-vocabulary issues compared to word-level tokenization.

3.3. Fine-Tuning the CodeBERT Model

Fine-tuning is performed by extending the CodeBERT model to the NMT architecture. Since CodeBERT is an encoder-only LLMCs, a decoder is added. This construction forms a Sequence-to-Sequence (Seq2Seq) architecture, which is then fine-tuned using supervised methods. After constructing the NMT model, the training dataset is used in multiple iterations to enable CodeBERT to learn the contextual representation of the buggy lines. As shown in Figure 3, this is an example of CodeBERT learning contextual representations.

The fine-tuning of the CodeBERT model takes as input the context of the preprocessed buggy lines and aims to learn the buggy code context representation. In the process of fine-tuning, the CodeBERT model is trained to understand how to translate the changed context into the fixed code. The context is denoted as

x_{c} = (x_{1}, \dots, x_{n})

and the correct fixes are denoted as

y = (y_{1}, \dots, y_{m})

, where

x_{i}

and

y_{i}

represent tokens of the context and the correct fixes, respectively.

n

and

m

represent the lengths of the context and the correct fixes, respectively. The weight of the Seq2Seq model is denoted as

Φ

. By updating

Φ

, the CodeBERT model can be fine-tuned to maximize the average likelihood:

L (x_{c}, y) = \frac{1}{m} \sum_{i = 1}^{m} \log P (y_{i} | x_{c}, y_{1}, \dots, y_{i - 1}; Φ)

(1)

P (y_{i} | x_{c}, y_{1}, \dots, y_{i - 1}; Φ)

is the conditional probability calculated by the Seq2Seq architecture using weights

Φ

, where

y_{i}

is the token after the correct repair sequence

(y_{1}, \dots, y_{i - 1})

given the context

x_{c}

. After fine-tuning, only the hidden results of CodeBERT in the Seq2Seq architecture will be kept and then used as a context for bug fixes. The hidden results will be passed into the Context-Aware NMT Architecture for bug fixes.

3.4. Context-Aware NMT Architecture

CodeTransFix constructs an APR model by integrating a fine-tuned CodeBERT with an NMT model, as illustrated in Figure 4. In this architecture, the fine-tuned CodeBERT exclusively computes the context vectors around the buggy lines of code; in comparison, the Transformer-based encoder-decoder framework receives both the buggy lines of code themselves and their context vectors and learns the transition pattern from the buggy code to the repaired code. Notably, the model innovatively introduces a Drop-net mechanism [15] in the Transformer decoder to fuse the context vectors generated by CodeBERT with the hidden state information of the decoder itself.

As shown in Figure 4, to simplify the schematic, we only show the key structures in the model.

Since CodeBERT was fine-tuned in the previous step, its parameters must be kept fixed (frozen state) while training the APR model incorporating CodeBERT. In the training phase, the APR model takes the buggy code lines, contexts, and corresponding fix patches as the training data and learns how to generate the correct fix patches based on the buggy code lines with contexts. The model parameters are optimized through multiple training epochs to ultimately obtain the weight combination with the best performance.

In the inference phase, since the APR model can only access buggy lines and their contexts, the decoder generates patch sequences token-by-token using the start token <s> as a starting point. Specifically, the encoder-decoder multi-head attention module processes two sources of information in parallel: feature vectors from buggy code lines and their context vectors. The outputs of the encoder-decoder attention mechanisms are combined through a Drop-net mechanism to predict the current output token. This generated token is subsequently fed into the decoder as the input for the subsequent time step. The iterative generation process continues until the termination token </s> is produced, ultimately forming the complete repair patch.

When the user inputs the error statement “int g = (int) ((value − this.lowerBound)/(this.upperBound − this.lowerBound) * 255.0)”; (shown as an orange square) and its context (shown as a purple square) to CodeTransFix. After tokenization, the error statement is fed into the Transformer encoder, while the context is fed into the CodeBERT model. Subsequently, the hidden state

H_{C}

output by the Transformer encoder and the context representation

H_{E}

generated by CodeBERT are fed into the encoder-decoder multi-head attention layer of the Transformer decoder for joint modeling. In the patch generation phase, the decoder starts with a start token <s> and generates a sequence of patches token by autoregressive approach until a termination token </s> is generated to end the generation process.

In the following section, we detail the Drop-net mechanism [15] employed during decoding.

Specifically, a parameter

P_{n e t} \in [0, 1]

is introduced for the decoder layer. During each training iteration, for the l-th layer of the Transformer decoder, a random variable

U^{l}

is uniformly sampled from the interval

[0, 1]

. The attention output of the Transformer decoder is then computed as follows:

\begin{array}{l} h = I (U^{l} < \frac{p_{n e t}}{2}) \cdot a t t n_{C} (s^{l}, H_{C}, H_{C}) + I (U^{l} > 1 - \frac{p_{n e t}}{2}) \cdot a t t n_{E} (s^{l}, H_{E}, H_{E}) \\ + \frac{1}{2} I (\frac{p_{n e t}}{2} \leq U^{l} \leq \frac{p_{n e t}}{2}) \cdot (a t t n_{C} (s^{l}, H_{C}, H_{C}) + a t t n_{E} (s^{l}, H_{E}, H_{E})) \end{array}

(2)

where

I (\cdot)

is the indicator function,

a t t n_{C}

and

a t t n_{E}

are the additional multi-head attention mechanism [33] and the original encoder-decoder attention mechanism, respectively,

s^{l}

is the Transformer decoder layer l hidden variable, and

H_{C}

and

H_{E}

are the context vector computed by CodeBERT and the error code vector computed by the Transformer encoder, respectively. During the inference stage, the Transformer decoder attention output is as follows:

h = \frac{1}{2} (a t t n_{C} (s^{l}, H_{C}, H_{C}) + a t t n_{E} (s^{l}, H_{E}, H_{E}))

(3)

3.5. Patch Generation and Validation

In the patch generation phase, we use the beam search strategy [3,4], which is widely used in NMT, to generate a large number of candidate patches. The results of a recent study [10] show that 93% of developers are only willing to review up to 10 patches. To be consistent with this practical limitation, we select the top 10 patches with the highest sequence probability as candidate patches from the list of candidate patches generated in the beam search. In the next part of the experiment, we setup the following validation process: for each bug, we configure each APR method to generate 10 candidate patches (consistent with the practical limitation of patch review [10]). Subsequently, we execute the full developer-written test suite on these candidate patches. We labeled the first patch that passed all the test cases as a plausible patch, and then manually verified the true correctness of these plausible patches (excluding overfitting patches). The experiments eventually measure the fixing capability of different APR approaches by using the number of bugs for which each approach can produce manually verified correct patches as the core evaluation metric [10].

4. Experiments

In this section, to validate the repair performance of CodeTransFix, it is compared to state-of-the-art NMT-based automated repair methods and to the capability of fixing buggy code using large language models of code. All methods employ perfect fault localization, in which the precise repair site of the defect is known, to guarantee equity in repair performance comparisons. The best APR methods target Java single-block defects; therefore, they are the focus of this paper. The generalization of CodeTransFix was also tested using the Defects4J v2.0 dataset.

4.1. Implementation Details

CodeTransFix focuses on Java single-block bug fixes and therefore uses the dataset provided by Jiang et al. [10] for model training. The dataset is divided 90% into a training dataset and 10% into a validation dataset. For CodeBERT models, the pre-trained model hyperparameters are directly followed. We implemented our framework using Python 3.8 and PyTorch 1.10, leveraging the HuggingFace Transformers library (v4.24.0) to load the pre-trained CodeBERT model along with its dedicated tokenizer. Based on the fact that CodeBERT only acts as an encoder, an additional Transformer decoder was added to allow the model to learn the context vector of the buggy code. The chosen Transformer decoder consists of 12 attention heads, 6 layers, and 768-dimensional hidden states. The following ranges were used in a Tree-structured Parzen Estimator (TPE) search to fine-tune the APR model’s hyperparameters: learning rate (10⁻⁵–10⁻³), dropout (0–0.5), number of attention heads (2–8), hidden layer dimension (128–768), and number of encoder and decoder layers (6–8). In addition, the Adam optimizer was employed to update the parameters of the model. In the inference mode, a beam search with a beam width of 1000 was used. Model training and evaluation were performed on a server with a 20-core Intel Xeon Platinum 8457C CPU and 100 GB of RAM running the Ubuntu 20.04 LTS operating system with an NVIDIA L20 GPU using CUDA 11.3 drivers. The single-block defects were fixed by applying a 5 h end-to-end time constraint to all tests, which is consistent with previous methods [3,11,16].

4.2. Compared Techniques

The real bug benchmarks, Defects4j v1.2 [18] and Defects4j v2.0 [18], which are widely recognized in the APR field, were selected for evaluation. The most used version of Defects4j for benchmarking, v1.2, comprises 393 bugs, 130 of which are single-block bugs. Defects4j v2.0, the latest version, adds 438 errors to v1.2, including 82 single-block errors. CodeTransFix was tested on 130 single-block defects in Defects4j v1.2 and 82 single-block bugs in Defects4j v2.0 since it is intended to fix single-block bugs. Considering the fact that CodeTransFix combines NMT-based techniques and LLMCs, its performance was evaluated by comparing it with open-source, state-of-the-art NMT-based methods and LLMCs for bug fixing. For NMT-based methods, CURE [16], RewardRepair [17], and Recoder [12] were chosen, with other APR methods fixing fewer bugs [4,11]. For LLMCs, PLBART [8] and CodeT5 [7] were chosen for comparison. Specifically, for PLBART, its two different-sized models were evaluated: PLBART-base (140 M parameters) and PLBART-large (400 M parameters). For the CodeT5 model, two versions were also selected for evaluation: CodeT5-small (60 M parameters) and CodeT5-base (220 M parameters). In addition, to verify the generalization of CodeTransFix on Defects4j v2.0, two APR methods that previously performed optimally on Defects4j v1.2 were selected for further comparative experiments: CodeT5-base, a large language model of code, and Recoder [12], an NMT-based APR method.

4.3. Results

CodeTransFix was initially compared with state-of-the-art NMT-based APR methods and LLMCs for bug fixes under perfect fault localization settings. The performance of CodeTransFix and other baselines that also employ perfect fault localization is displayed in Table 1. Specifically, CodeTransFix generated 37 correct patches on the Defects4j v1.2 benchmark. CodeTransFix fixed 13 more bugs compared to the best NMT-based APR method and 7 more bugs compared to the best-performing LLMCs.

The extent to which CodeTransFix is able to complement existing NMT-based methods and approaches for buggy code repair using LLMCs is studied in further detail. In Figure 5, to explicitly demonstrate our method’s unique repair characteristics, the NMT-based APR approaches encompass three conventional systems (CURE [16], Recoder [12], and Reward [17]); LLMCs feature scaled implementations where CodeT5 [7] employs both small and base parameterizations while PLBART [8] adopts base and large model configurations. Figure 5 illustrates the overlap of bugs fixed using the different approaches. As shown in the figure, CodeTransFix fixes eight unique bugs. This finding indicates that it can be used in conjunction with other methods to significantly boost the number of correct patches produced.

Below, we have provided some examples of CodeTransFix fixing bugs. Unlike compiler diagnostics that detect lexical and syntactic errors (e.g., missing semicolons) or basic semantic violations (e.g., type mismatches), our approach specifically targets logic-driven defects that pass compilation but exhibit incorrect runtime behaviors, such as improper API usages or flawed control flow logic. Figure 6a shows how CodeTransFix fixes a Math 41 bug that stems from the improper setting of the start value and termination condition in a loop statement. It is based on the fixes in the historical data that CodeTransFix was able to successfully fix this type of Math 41 bug (as shown in Figure 6b). Figure 6c shows an example of CodeTransFix successfully fixing the Closure 123 bug that other APR methods failed to fix. In the original buggy code, the assignment statement of the CodeGenerator class directly uses the enumerated member Context.OTHER for static state identification, which causes the test case to fail. The correct fix requires calling the getContextForNoInOperator method, also belonging to the CodeGenerator class, which dynamically returns the appropriate state by analyzing the current scope for the inclusion of the ‘in’ operator (indicated by the parameter context). CodeTransFix achieves the correct fix by analyzing the context of the defective code and recognizing that the function should be called instead of hard-coding the enumeration value.

In order to confirm that CodeTransFix can be used to fix defects in other projects and that it does not simply overfit Defects4j v1.2 bugs, it was tested on 82 single-line bugs from the Defects4j v2.0 benchmark. The comparative findings with different baselines on Defects4j v2.0 are displayed in Table 2. As seen in Table 2, all three methods fix a smaller number of bugs on Defects4j v2.0, indicating that it is more difficult to fix bugs on Defects4j v2.0. Nevertheless, CodeTransFix still outperforms the other baseline methods; it fixes 1.6 times more bugs than Recoder and outperforms CodeT5-base in the LLMCs model by fixing 2 more bugs.

5. Discussion

As mentioned above, the advantage of the CodeTransFix approach compared to directly using large language code models (LLMCs) to generate patches and other automatic program repair (APR) methods based on neural machine translation (NMT) is the fact that it can significantly increase the number of correct fixes and generate many unique fixes.

However, the CodeTransFix approach is constrained by the following limitations. First, this method requires perfect fault location when identifying bug statements. If the fault location is not accurate enough, its repair performance will be significantly reduced. Second, while the current definition of context (i.e., the immediate surrounding code within the same method or code block) enables efficient patch generation, it does not cover broader semantic dependencies such as inter-class method invocations, external library calls, or API usage patterns. For example, certain bugs may depend on constraints defined in other modules or libraries, which cannot be captured by local code context alone. A concrete example of this limitation is illustrated in Figure 7, which demonstrates a failed repair case for the Math 32 defect in Defects4J v1.2. In Figure 7, the buggy line is highlighted in yellow, the ground-truth fix is marked in green, and the patch generated by CodeTransFix is shown in orange. Although this patch (the first candidate among 10 generated by CodeTransFix to pass all test cases in Defects4J v1.2) passed the automated validation, manual inspection revealed semantic discrepancies between the generated patch and the correct fix. Specifically, the failure stems from CodeTransFix’s inability to capture inter-class method dependencies. This case highlights how the current context definition—restricted to the immediate method scope—fails to address defects involving cross-class interactions.

6. Conclusions

In this paper, we present a novel program repair technique called CodeTransFix, which is based on neural machine translation (NMT) technology. Our approach incorporates large language models of code, CodeBERT, with the Transformer architecture and aims to significantly improve the ability to fix buggy code. The primary aims of our approach are as follows: (1) to adopt a modular architecture that separately learns the conversion from bugs to fixes and the context around the buggy code, thereby reducing the noise introduced by the surrounding context during the fixing process; (2) to utilize the enriched parameters learned from the CodeBERT model to improve the comprehension of the strict syntactic features and complex semantic dependencies of the code.

To validate the ability of CodeTransFix to fix bugs, we conducted experiments on the Defects4j v1.2 dataset and compared it with the current state-of-the-art Automated Program Repair (APR) methods targeting single-block bugs. The experimental results demonstrated that CodeTransFix achieved a 23.3% performance improvement (seven more bugs fixed). In addition, we performed generalization tests on the Defects4j v2.0 benchmark, with the results showing that CodeTransFix outperformed other state-of-the-art APR methods.

However, challenges remain. First, the current context definition focuses on local code blocks, potentially missing cross-module dependencies. Second, while our evaluation is Java-centric, CodeTransFix’s architecture is language-agnostic. Future work will extend it to multi-block bugs and explore cross-language adaptation by leveraging multilingual LLMCs (e.g., CodeBERT) and domain-specific fine-tuning for languages like Python.

Author Contributions

Conceptualization, Y.L.; methodology, Y.L.; software, Y.L; validation, Y.L.; formal analysis, L.Q.; writing—original draft preparation, Y.L.; writing—review and editing, S.Y.; funding acquisition, L.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the following two projects: “Research on the Application of Artificial Intelligence Technology in Software Precision Testing” (Grant Number 25422208) and “Research on Comprehensive Safeguard Technology for Complex Equipment of Large Ships Based on V-DT Driving and AI Enablement“ (Grant Number JC2024021).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Goues, C.L.; Pradel, M.; Roychoudhury, A. Automated program repair. Commun. ACM 2019, 62, 56–65. [Google Scholar]
Gao, X.; Noller, Y.; Roychoudhury, A. Program repair. arXiv 2022, arXiv:2211.12787. [Google Scholar]
Lutellier, T.; Pham, H.V.; Pang, L.; Li, Y.; Wei, M.; Tan, L. CoCoNuT: Combining context-aware neural translation models using ensemble for program repair. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual Event, 18–22 July 2020; pp. 101–114. [Google Scholar]
Chen, Z.; Kommrusch, S.; Tufano, M.; Pouchet, L.N.; Poshyvanyk, D.; Monperrus, M. SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair. IEEE Trans. Softw. Eng. 2021, 47, 1943–1959. [Google Scholar]
Jin, M.; Shahriar, S.; Tufano, M.; Shi, X.; Lu, S.; Sundaresan, N.; Svyatkovskiy, A. InferFix: End-to-End Program Repair with LLMs. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, San Francisco, CA, USA, 3–9 December 2023; pp. 1646–1656. [Google Scholar]
Wang, W.; Wang, Y.; Joty, S.; Hoi, S.C.H. RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic Program Repair. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, San Francisco, CA, USA, 3–9 December 2023; pp. 146–158. [Google Scholar]
Wang, Y.; Wang, W.; Joty, S.; Hoi, S.C. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv 2021, arXiv:2109.00859. [Google Scholar]
Ahmad, W.U.; Chakraborty, S.; Ray, B.; Chang, K.W. Unified pre-training for program understanding and generation. arXiv 2021, arXiv:2103.06333. [Google Scholar]
Huang, K.; Meng, X.; Zhang, J.; Liu, Y.; Wang, W.; Li, S.; Zhang, Y. An Empirical Study on Fine-Tuning Large Language Models of Code for Automated Program Repair. In Proceedings of the 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), Luxembourg, 11–15 September 2023; pp. 1162–1174. [Google Scholar]
Jiang, N.; Liu, K.; Lutellier, T.; Tan, L. Impact of Code Language Models on Automated Program Repair. In Proceedings of the 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), Melbourne, Australia, 14–20 May 2023; pp. 1430–1442. [Google Scholar]
Li, Y.; Wang, S.; Nguyen, T.N. DLFix: Context-based code transformation learning for automated program repair. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, Seoul, Republic of Korea, 27 June–19 July 2020; pp. 602–614. [Google Scholar]
Zhu, Q.; Sun, Z.; Xiao, Y.a.; Zhang, W.; Yuan, K.; Xiong, Y.; Zhang, L. A syntax-guided edit decoder for neural program repair. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, 23–28 August 2021; pp. 341–353. [Google Scholar]
Mashhadi, E.; Hemmati, H. Applying CodeBERT for Automated Program Repair of Java Simple Bugs. In Proceedings of the 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), Madrid, Spain, 17–19 May 2021; pp. 505–509. [Google Scholar]
Fu, M.; Tantithamthavorn, C.; Le, T.; Nguyen, V.; Phung, D. VulRepair: A T5-based automated software vulnerability repair. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Singapore, 14–18 November 2022; pp. 935–947. [Google Scholar]
Zhu, J.; Xia, Y.; Wu, L.; He, D.; Qin, T.; Zhou, W.; Li, H.; Liu, T.Y. Incorporating bert into neural machine translation. arXiv 2020, arXiv:2211.12787. [Google Scholar]
Jiang, N.; Lutellier, T.; Tan, L. CURE: Code-Aware Neural Machine Translation for Automatic Program Repair. In Proceedings of the 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), Madrid, Spain, 22–30 May 2021; pp. 1161–1173. [Google Scholar]
Ye, H.; Martinez, M.; Monperrus, M. Neural program repair with execution-based backpropagation. In Proceedings of the 44th International Conference on Software Engineering, Pittsburgh, PA, USA, 21–29 May 2022; pp. 1506–1518. [Google Scholar]
Just, R.; Jalali, D.; Ernst, M.D. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis, San Jose, CA, USA, 21–25 July 2014; pp. 437–440. [Google Scholar]
Martinez, M.; Monperrus, M. ASTOR: A program repair library for Java. In Proceedings of the 25th International Symposium on Software Testing and Analysis, Saarbrücken, Germany, 18–20 July 2016; pp. 441–444. [Google Scholar]
Wen, M.; Chen, J.; Wu, R.; Hao, D.; Cheung, S.C. Context-aware patch generation for better automated program repair. In Proceedings of the 40th International Conference on Software Engineering, Gothenburg, Sweden, 27 May–3 June 2018; pp. 1–11. [Google Scholar]
Goues, C.L.; Nguyen, T.; Forrest, S.; Weimer, W. GenProg: A Generic Method for Automatic Software Repair. IEEE Trans. Softw. Eng. 2012, 38, 54–72. [Google Scholar]
Mechtaev, S.; Yi, J.; Roychoudhury, A. Angelix: Scalable multiline program patch synthesis via symbolic analysis. In Proceedings of the 38th International Conference on Software Engineering, Austin, TX, USA, 14–22 May 2016; pp. 691–701. [Google Scholar]
DeMarco, F.; Xuan, J.; Berre, D.L.; Monperrus, M. Automatic repair of buggy if conditions and missing preconditions with SMT. In Proceedings of the 6th International Workshop on Constraints in Software Testing, Verification, and Analysis, Hyderabad, India, 31 May 2014; pp. 30–39. [Google Scholar]
Durieux, T.; Monperrus, M. DynaMoth: Dynamic code synthesis for automatic program repair. In Proceedings of the 11th International Workshop on Automation of Software Test, Austin, TX, USA, 14–22 May 2016; pp. 85–91. [Google Scholar]
Xuan, J.; Martinez, M.; DeMarco, F.; Clément, M.; Marcote, S.L.; Durieux, T.; Berre, D.L.; Monperrus, M. Nopol: Automatic Repair of Conditional Statement Bugs in Java Programs. IEEE Trans. Softw. Eng. 2017, 43, 34–55. [Google Scholar] [CrossRef]
Le, X.-B.D.; Chu, D.H.; Lo, D.; Goues, C.L.; Visser, W. S3: Syntax- and semantic-guided repair synthesis via programming by examples. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, Paderborn, Germany, 4–8 September 2017; pp. 593–604. [Google Scholar]
Long, F.; Rinard, M. Staged program repair with condition synthesis. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, Bergamo, Italy, 30 August–4 September 2015; pp. 166–178. [Google Scholar]
Koyuncu, A.; Liu, K.; Bissyandé, T.F.; Kim, D.; Klein, J.; Monperrus, M.; Le Traon, Y. FixMiner: Mining relevant fix patterns for automated program repair. Empir. Softw. Eng. 2020, 25, 1980–2024. [Google Scholar]
Ghanbari, A.; Benton, S.; Zhang, L. Practical program repair via bytecode mutation. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, Beijing, China, 15–19 July 2019; pp. 19–30. [Google Scholar]
Hua, J.; Zhang, M.; Wang, K.; Khurshid, S. SketchFix: A tool for automated program repair approach using lazy candidate generation. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Lake Buena Vista, FL, USA, 4–9 November 2018; pp. 888–891. [Google Scholar]
Liu, K.; Koyuncu, A.; Kim, D.; Bissyandé, T.F. TBar: Revisiting template-based automated program repair. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, Beijing, China, 15–19 July 2019; pp. 31–42. [Google Scholar]
Kudo, T.; Richardson, J. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv 2018, arXiv:1808.06226. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inform. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]

Figure 1. An overview of the CodeTransFix architecture.

Figure 2. Example of a CodeTransFix processing context: (a) unprocessed bug code and (b) context of the bug code.

Figure 3. An example of CodeBERT learning contextual representations.

Figure 4. The architecture of the APR models used in CodeTransFix. The orange box highlights the buggy line, the purple box represents the context, and the blue box indicates the generated patch.

Figure 5. Correct patch Venn diagrams for Defects4J v1.2.

Figure 6. Examples of bugs fixed by CodeTransFix: (a) fixes for Math 41 in Defects4j v1.2; (b) a bug similar to Math 41 appears in the training data; (c) closure 123 in Defects4J v1.2 is a bug only fixed by CodeTransFix.

Figure 7. CodeTransFix’s failed fix for Math 32 in Defects4j v1.2.

Table 1. Repair performance comparison between CodeTransFix and existing baselines under perfect fault localization settings.

Projects	CodeTransFix	PLBART		CodeT5		NMT-Based APR Techniques
Projects	CodeTransFix	Base	Large	Small	Base	CURE	Recoder	Reward
Chart	3	3	4	2	4	0	6	2
Closure	15	6	5	2	6	2	5	4
Lang	4	2	4	3	5	1	3	5
Math	15	11	13	9	13	2	9	6
Mockito	0	2	2	2	1	0	1	2
Time	1	1	2	1	1	1	0	1
Total	37	25	30	19	30	6	24	20

Table 2. Comparisons of baselines on Defects4J v2.0.

Projects	CodeTransFix	Recoder	CodeT5-Base
Cli	3	1	3
Codec	3	2	2
Collections	0	0	0
Compress	1	1	1
Csv	2	1	2
Gson	0	0	0
JacksonCore	2	2	3
JacksonDatabind	1	2	2
JacksonXml	0	0	0
Jsoup	6	2	3
JxPath	0	0	0
Total	18	11	16

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lu, Y.; Ye, S.; Qi, L. CodeTranFix: A Neural Machine Translation Approach for Context-Aware Java Program Repair with CodeBERT. Appl. Sci. 2025, 15, 3632. https://doi.org/10.3390/app15073632

AMA Style

Lu Y, Ye S, Qi L. CodeTranFix: A Neural Machine Translation Approach for Context-Aware Java Program Repair with CodeBERT. Applied Sciences. 2025; 15(7):3632. https://doi.org/10.3390/app15073632

Chicago/Turabian Style

Lu, Yiwei, Shuxia Ye, and Liang Qi. 2025. "CodeTranFix: A Neural Machine Translation Approach for Context-Aware Java Program Repair with CodeBERT" Applied Sciences 15, no. 7: 3632. https://doi.org/10.3390/app15073632

APA Style

Lu, Y., Ye, S., & Qi, L. (2025). CodeTranFix: A Neural Machine Translation Approach for Context-Aware Java Program Repair with CodeBERT. Applied Sciences, 15(7), 3632. https://doi.org/10.3390/app15073632

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CodeTranFix: A Neural Machine Translation Approach for Context-Aware Java Program Repair with CodeBERT

Abstract

1. Introduction

2. Related Work

3. The Proposed Approach

3.1. Overall Workflow

3.2. Data Pre-Processing

3.3. Fine-Tuning the CodeBERT Model

3.4. Context-Aware NMT Architecture

3.5. Patch Generation and Validation

4. Experiments

4.1. Implementation Details

4.2. Compared Techniques

4.3. Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI