4.1. RQ1: Semantic Preserving Transformation
The goal is to update the code to a more modern style while preserving the original semantics, ensuring that the new implementation does not encounter compilation problems, i.e., errors that were not present before.
Table 5 illustrates the results of the automated refactoring process carried out by ChatGPT on the five open-source repositories. The table columns display the repository name, the total number of for-loops identified, the number of successfully refactored loops, and the resulting status of the refactored code—specifically, whether it was compilable or uncompilable.
From the data, we see that in all cases, ChatGPT attempted to refactor the majority of for-loops in each repository. In the smallest repository, i.e., iridium, all 36 for-loops were refactored, and just 16 were compilable, leaving 20 uncompilable. Similar trends are seen in other repositories, such as magpie and j2clmavenplugin, where the majority of the for-loops were refactored; however, a significant portion of the resulting code was uncompilable. This points to the complexity involved in fully automating the refactoring process, especially when converting legacy for-loops to streams.
The larger repositories, rapiddweller-benerator-ce and openrefine, demonstrate the challenges of scaling the refactoring process. In rapiddweller-benerator-ce, 725 out of 729 loops were refactored by ChatGPT; however, only 357 were compilable, showing that larger codebases present more obstacles to successful refactoring. The largest repository, openrefine, had 1200 loops refactored, but only 196 were compilable, leaving over 1000 loops uncompilable.
In total, ChatGPT refactored 2101 out of 2132 identified for-loops, achieving a high success rate in terms of completing the refactoring itself. Only 605 of the refactored loops were compilable, while 1496 were uncompilable. This suggests that while ChatGPT is effective in automating loop refactoring, there are limitations to the process when it comes to ensuring that the resulting code is complete and readily compilable. The reasons for uncompilable code included contextual issues, syntax challenges, or differences in how stream-based code interacts with the rest of the codebase.
The inability of ChatGPT to refactor 31 for-loops is primarily influenced by the complex dependencies and potential side effects inherent to traditional loops. Streams, being designed with a functional and side-effect-free paradigm, contrast with the imperative style of for-loops. Additionally, streams are not always optimal for scenarios involving multi-level iteration, particularly when loops are interdependent or when there is intricate state sharing.
We manually evaluated a significant subset of the cases that successfully compiled. The manually inspected snippets were around 25% of the ones that compiled. We found that the refactored code preserved the original semantics; hence, the transformations were correct, most of the time, with only around 10% of code not being semantically correct. For the cases that failed to compile, we analyzed the code to identify the causes of compilation errors, categorizing them according to the type of issue encountered. The classification, along with examples and detailed discussion, is presented in
Section 4.3. This process allowed us to rigorously assess both the correctness of successful refactoring and the limitations of the automatic approach in handling more complex or unconventional code patterns.
Gathered data underscore the potential of AI-driven code refactoring tools and highlight the need for human intervention to review and fine-tune the results, especially in complex or large-scale projects. ChatGPT is a helpful tool for programmers; however, it is not yet reliable enough to be used without human oversight [
8].
4.2. RQ2: Comparison with SOTA Approaches
State-of-the-art approaches for automated for-loop refactoring typically impose strict preconditions on the loops they can transform, limiting their applicability to only a subset of ‘simple’ or canonical loops. These preconditions often exclude loops with multiple return statements, break statements, local variables referenced from outside the loop, or throwing clauses, under the assumption that such constructs are too complex for reliable transformation. With this research question, we aim to challenge this perspective by showing that the GPT is capable of refactoring many loops that do not satisfy these traditional preconditions. Our results highlight that these ‘old’ limitations are overly conservative and that LLM-based approaches can handle a wider range of for-loop patterns, generating compilable stream-based code in scenarios that previous SOTA techniques would have deemed ineligible.
Table 6,
Table 7 and
Table 8 show the number of for-loops generated by ChatGPT and classified according to the categories discussed in
Section 3.2. Each column is a category, and the two sub-columns ‘c’ and ‘u’ give the number of compilable and uncompilable for-loops generated, respectively.
Table 6 shows the number of for-loops that were classified according to their characteristics, i.e., having local variables (LV), having throwing clauses (TC), having break statements (BR), and having more than one return (MR); however, they do not satisfy the respective preconditions presented in
Table 1. Despite this, ChatGPT proposed a corresponding stream-based code fragment, whose successful compilation showed that its refactoring was possible. Although some transformations were successful, others posed significant challenges. A loop that does not satisfy the Throw Clause (TC) precondition is less likely to have a corresponding successful stream-based code. Specifically, in the rapiddweller and openrefine repositories, fewer than 22% of the generated fragments were compilable. Similarly, loops not satisfying the more return (MR) precondition resulted in very few compilable generated fragments for both repositories: 100% and 50%, respectively. These findings suggest that certain code blocks, i.e., those involving throwing exceptions and having many return statements, are more complex or less suitable for automated refactoring into stream pipelines since their paradigms are more relatable to an imperative style of programming rather than a functional one.
The largest number of refactoring attempts, as well as the highest rate of successful refactorings, occurred in the openrefine and rapiddweller-benerator-ce repositories. In openrefine, 31 loops not satisfying the precondition local variables (LV) were refactored successfully; however, 43 failed to compile. This indicates a substantial attempt to convert complex for-loops that involved more than one reference to non-effectively final variables defined outside the loop, albeit with a significant failure rate. Similarly, this repository saw notable success in refactoring loops not satisfying the BR precondition, with 27 successful compilations, although again accompanied by a comparable number of failures (28).
In contrast, smaller repositories, such as iridium, magpie, and j2clmavenplugin, saw relatively few refactoring attempts, with mixed results. In iridium, e.g., the Throw Clause (TC) precondition resulted in a full success rate with four compilable loops and no failures. This indicates that simpler loop structures, may be more amenable to automated refactoring despite containing exception-handling statements.
The refactoring of loops when the Break (BR) precondition was not satisfied consistently showed a higher success rate across the repositories, particularly in rapiddweller and openrefine, where the majority of transformed code snippets compiled successfully. This suggests that loops containing break statements can be effectively transformed into streams, even though state-of-the-art approaches have claimed that such cases are not feasible (we could more appropriately say not straightforward) for stream refactoring.
The overall success rates of compilable refactorings vary widely between repositories and preconditions, with certain combinations that prove more conducive to stream-based transformations. Notably, the openrefine repository shows both the highest volume of attempted refactorings and a reasonable number of successful compilations across multiple preconditions, indicating that it contains a broad variety of loops that are somewhat compatible with Java’s Stream API. However, the relatively high number of failures in all preconditions indicates that while ChatGPT can handle simpler cases, more intricate loop structures involving complex control flow or variable manipulation remain challenging.
Listing 5 shows an example of a for-loop that does not satisfy the precondition local variables (LV); however, it was successfully refactored by ChatGPT to an equivalent stream pipeline. In the original loop, the local variable columnIndex is computed within the loop and used in the conditional check and to compute the local variable cellIndex, which is then assigned to an array. In the refactored version, these local variables are handled implicitly: the stream maps the loop index i to columnIndex, filters according to the same logic, maps to cellIndex, and finally performs the side-effect update on keyedGroup.cellIndices. This transformation demonstrates that ChatGPT can correctly manage multiple intermediate variables by embedding their computation within the stream pipeline stages.
Listing 5. Example of for-loop refactoring not satisfying the local variables (LV) precondition as it contains more than one local variable and successfully refactored by ChatGPT. |
1 // original for loop |
2 for ( int i = 0; i < group . columnSpan ; i++) { |
3 int columnIndex = group . startColumnIndex + i; |
4 if ( columnIndex != group . keyColumnIndex && columnIndex < |
columnModel . columns . size ()) { |
5 int cellIndex = columnModel . columns . get( columnIndex ). getCellIndex (); |
6 keyedGroup . cellIndices [c ++] = cellIndex ; |
7 } |
8 } |
9 |
10 // refactored stream pipeline |
11 IntStream . range (0, group . columnSpan ) |
12 . mapToObj (i -> group . startColumnIndex + i) |
13 .filter ( columnIndex -> columnIndex != group . keyColumnIndex && |
columnIndex < columnModel . columns . size ()) |
14 .map ( columnIndex -> columnModel . columns . get ( columnIndex ). getCellIndex ()) |
15 .forEach ( cellIndex -> keyedGroup . cellIndices [c ++] = cellIndex ); |
Table 7 shows the results of the refactoring transformations attempted when considering the loops that exhibit the characteristics shown in
Table 1 (max statements, MS; nested loops, NL; switch statements, SW; and continue statement, CO); however, such loops do not satisfy the respective preconditions given.
Transforming loops is less feasible when the loop body has more than five statements (hence they are classified as max statements (MS) loops). The results show that for repositories such as rapiddweller-benerator-ce and openrefine, ChatGPT struggles with this constraint, with 124 and 351 uncompilable cases, respectively. In contrast, iridium has a relatively balanced ratio of compilable (7) to uncompilable (2) cases. For repositories magpie and j2clmavenplugin, a higher number of uncompilable cases were given by ChatGPT, indicating that many loops exceed the complexity threshold for stream-based refactoring.
Nested loops are notoriously difficult to refactor using streams as they introduce more intricate control flows. This is reflected in the high number of uncompilable cases across the board. Repositories openrefine and rapiddweller-benerator-ce are particularly affected, with 68 and 32 uncompilable cases, respectively. Only iridium and magpie manage a more balanced ratio of compilable to uncompilable cases, but still, the presence of nested loops is a substantial barrier to stream refactoring.
Switch statements pose another challenge for stream refactoring. However, the number of switch statements in loops is relatively small compared to other constraints. Repositories rapiddweller-benerator-ce and openrefine have a few compilable and uncompilable cases, but the impact is less severe overall. This suggests that switch statements are less common within loops in the analyzed repositories, or developers might be avoiding their usage in situations conducive to refactoring.
In the repositories, the number of statements in the loop (MS) and the presence of nested loops (NL) are the most significant barriers to refactoring. Particularly, openrefine and rapiddweller-benerator-ce repositories show that a large portion of their loops exceed the complexity threshold, indicating that functional programming styles may not be suitable for these parts of the codebase. The results also show that control flow constructs such as switch statements (SW) are relatively rare in compilable stream refactoring.
Regarding for-loops including continue statements (CO), only j2clmavenplugin, rapid-dweller-benerator-ce, and openrefine repositories have a few compilable cases, with openrefine standing out with 32 compilable cases. However, the large number of uncompilable cases in openrefine (48) and other repositories highlights that this remains a significant limitation.
While some repositories, such as iridium, have a more balanced number of compilable and uncompilable cases across multiple categories, repositories like openrefine and rapiddweller-benerator-ce present challenges in refactoring, primarily due to the complexity of loop bodies and control flow structures. These results suggest that while streams offer a more readable and functional style of programming, they are not always suitable for all types of loops, especially those with complex control flows or extensive nested structures.
Listing 6 shows an example of a for-loop that does not satisfy the precondition continue (CO); however, it was successfully refactored by ChatGPT to an equivalent stream pipeline. The original loop skips empty strings using a continue statement and applies a sequence of transformations to build a normalizedLocalName String. In the refactored version, the same logic was implemented using a stream pipeline: the filter step excludes empty strings, followed by two map operations to transform each path, and a reduce operation to concatenate the results. Despite the presence of the continue statement, which usually complicates refactoring, the transformation preserves the original behavior and produces an equivalent result using functional constructs.
Listing 6. An example of for-loop refactoring of the CO category where ChatGPT successfully refactors a for-loop containing a continue statement. |
1 // original for loop |
2 for ( String p : paths ) { |
3 if (p. equals ("")) { |
4 continue ; |
5 } |
6 p = currentFileSystem . toLegalFileName (p, ’-’); |
7 normalizedLocalName += String . format ("%c%s", File . separatorChar , p); |
8 } |
9 |
10 // refactored stream pipeline |
11 normalizedLocalName = paths . stream () |
12 . filter (p -> !p. equals ("")) |
13 . map (p -> currentFileSystem . toLegalFileName (p, ’-’)) |
14 . map (p -> String . format ("%c%s", File . separatorChar , p)) |
15 . reduce ("", ( result , p) -> result + p); |
Table 8 presents the evaluation of loops transformed into Java stream-based code for the cases of loops with one return statement (OR), loops having fewer than five statements, and the five templates presented in [
12]. The results reveal variability in both the success and failure rates of stream-based refactoring depending on the repository and the type of for-loop template applied. Notably, the counts of compilable versus uncompilable transformations indicate that certain patterns of loops are more challenging to refactor into streams, particularly with template 1 (for-loops with a conditional statement and one or more return statements), in agreement with the results of
Table 6 where statements with returns are more challenging.
Starting with loops having satisfied the one return (OR) precondition resulted in very few compilable generated fragments for both repositories: less than 20% and 37%, respectively. In contrast, column small size (SS) represents cases where the loop body size is less than five statements. Streams are generally suitable for such loops, and we observe a higher number of compilable cases in most repositories. For instance, openrefine has 105 compilable cases, but it is also noteworthy that there are 653 uncompilable cases, emphasizing that many loops are still too complex for straightforward refactoring. A deeper analysis of the compilation error is described in
Section 4.3 to elucidate the main reasons behind the large number of uncompilable cases. Repository rapiddweller-benerator-ce shows a large number of cases (287 compilable vs. 244 uncompilable), suggesting a mixed suitability for streams in this repository.
The first template was found in two repositories: rapiddweller-benerator-ce and openrefine. Interestingly, openrefine achieved 19 compilable refactorings, but 33 instances failed to compile. This suggests that the first template, while somewhat effective, still poses significant challenges, particularly due to conditional structures that may not easily translate into a single stream operation. Similarly, rapiddweller-benerator-ce encountered several unsuccessful refactoring, with just 4 compilable for-loops and 18 uncompilable ones.
The second and fifth templates show few occurrences, showing that they are less common when writing for-loop structures. Nevertheless, the fifth template shows a higher success rate than the second, and this highlights that the temp variable’s inclusion complicates the transformation, possibly due to scoping and reuse issues as streams are typically expected to be stateless and work more effectively with immutable data.
The results for the third template vary, and it appears to be another challenging pattern, particularly in openrefine (14 compilable and 10 uncompilable) and rapiddweller-benerator-ce (3 compilable and 4 uncompilable). The combination of a temp variable and method calls adds complexity that might interfere with stream semantics, particularly since method calls inside loops may cause side effects, which are difficult to express in functional streams.
The forth template is the most prevalent in the data, particularly in magpie (4 compilable and 36 uncompilable), rapiddweller-benerator-ce (34 compilable and 10 uncompilable), and openrefine (31 compilable and 12 uncompilable). Despite a high number of successful compilations in some repositories, the number of uncompilable transformations remains significant. This suggests that while this template fits well within the stream paradigm (as streams naturally support element-wise operations), edge cases, such as incorrect handling of the new value insertion logic, might still cause failure in the refactoring process.
The data suggest that stream-based refactoring is highly dependent on the specific structure of the loop and the repository’s coding patterns. While some templates (like the fourth) are frequently encountered and refactored with relative success, others (such as the second and third) are more error-prone. This variability highlights the need for further improvements in automated refactoring tools to handle edge cases, particularly where temp variables and conditional branches are involved. Additionally, individual codebases vary significantly in how amenable they are to stream-based refactoring, suggesting that the nature of the existing code plays a crucial role in determining the success rate of such transformations.
According to the results previously shown, ChatGPT was able to correctly generate a refactored version of 414 for-loops that were originally discarded by the SOTA approaches because they violated their preconditions. This highlights that ChatGPT has overcome the limits of these approaches, suggesting innovative refactoring that was not previously proposed. This can be justified by the massive training process of the GPT model, which it applied to the billions of model code snippets that, unlike deterministic approaches, helped the model to properly combine stream APIs to obtain innovative stream pipelines. The remaining 903 uncompilable generated for-loops can be further analyzed to comprehend the reasons behind these failures.
Listing 7 displays an example of a for-loop corresponding to the fifth template presented in [
12] that was successfully refactored by ChatGPT. The original code iterates over the list sg.getSnaks() using a classic enhanced for-loop. Inside the loop, it conditionally selects between two visitors—referenceSnakPrinter or mainSnakPrinter—depending on the value of the boolean variable reference and then writes the result of the accept method to the writer. The refactored version expresses the same logic using a stream pipeline. It transforms the collection into a stream, applies a map operation that performs the same conditional logic inline using the ternary operator, and finally consumes the result using forEach(writer::write).
Listing 7. An example of for-loop refactoring of the 5th template where ChatGPT successfully refactored a loop containing an if/else statement. |
1 // original for loop |
2 for ( Snak s : sg. getSnaks ()) { |
3 if ( reference ) { |
4 writer . write (s. accept ( referenceSnakPrinter )); |
5 } else { |
6 writer . write (s. accept ( mainSnakPrinter )); |
7 } |
8 } |
9 |
10 // refactored stream pipeline |
11 sg. getSnaks (). stream () |
12 . map (s -> reference ? s. accept ( referenceSnakPrinter ) : |
s. accept ( mainSnakPrinter )) |
13 . forEach ( writer :: write ); |
4.3. RQ3: Compilation Error Analysis
To address the third research question, we focus the study on the uncompilable refactorings generated by ChatGPT (see
Section 4.2).
Table 9 highlights the metrics extracted from the compilation errors. The columns are described according the tags introduced in
Section 3.2, while rows represent the error name collected from the compilation report. A description of each error can be found in
Table A1 in the
Appendix A.
From
Table 9, the ‘cannot find symbol’ error is the most prevalent across all categories. This suggests that these categories often involve references to variables or methods that are not defined within the scope. This could be due to misspelled variable or method names, variables or methods that are declared outside the loop but not accessible within it, and missing imports or incorrect package references.
The ‘illegal character’ error is relatively rare, appearing only in SS (seven occurrences) and the fifth template (six occurrences). Illegal characters can be caused by typographical error, copy-pasting code from different sources that include non-printable characters, and incorrect encoding settings in the development environment.
The ‘symbol expected’ error is notable in SS (59 occurrences) and the fifth template (37 occurrences). It indicates syntax issues where the compiler expected a different token. Common causes include missing semicolons or braces, incorrect use of operators or delimiters, and incomplete statements or expressions.
Another common error is ‘unreported exception’, which occurs in SS (56 occurrences) and MS (15 occurrences). It suggests that these categories often involve operations that can throw exceptions not handled properly. This can happen when methods that throw checked exceptions are called without proper try–catch blocks and the throws clause is missing in the method signature.
‘Local variables referenced’ is often overlooked during refactoring. This error is relatively frequent in MS (34 occurrences) and SS (18 occurrences). It indicates issues with variable scope and usage, such as referencing variables outside their declared scope, modifying effectively final variables within lambda expressions, and conflicts between local and global variable names.
The ‘variable is already defined’ error is notable in SS (57 occurrences) and LV (13 occurrences). It suggests multiple declarations of the same variable within these categories. This can occur due to re-declaring variables within nested scopes, conflicts between parameter names and local variable names, and copy-pasting code without renaming variables.
The ‘incompatible types’ error is rare, with only four occurrences in SS. It indicates type mismatch issues, which can be caused by assigning values of incompatible types, incorrect method return types, and using raw types instead of parameterized types in generics.
The ‘non-static variable’ error appears in OR (four occurrences) and SS (three occurrences). It indicates issues with accessing non-static variables from static contexts. This can happen when trying to access instance variables from static methods, when the scope of static and non-static members is misunderstood, and when the static context is used incorrectly within lambda expressions.
The ‘no suitable method’ error is relatively rare, with the highest occurrence in SS (five occurrences). It indicates method signature mismatches, which can be caused by calling methods with incorrect parameters, overloading methods without proper parameter types, and using incorrect method references in lambda expressions.
The ‘illegal start of type’ error is very rare, with only one occurrence in SS. It indicates a syntax issue at the start of a type declaration, which can be caused by incorrectly placed annotations or modifiers, missing class or interface keywords, and typographical errors in type declarations.
The ‘try without’ error is extremely rare, with only one occurrence in SS. It indicates a missing try block, which can happen when forgetting to include the try keyword before a block of code that handles exceptions, incorrectly formatting the try–catch–finally structure, and overlooking exception handling during refactoring.
Finally, ‘parse errors’ is more common in SS (21 occurrences) and NL (11 occurrences). It indicates general syntax errors, which can be caused by incomplete or incorrect code statements, misplaced or missing punctuation, and errors in code structure or formatting.
The types of errors encountered can provide insights into both the severity of mistakes made by ChatGPT and potential underlying causes. Moreover, they offer an indication of the effort a developer may need to invest to resolve them.
Starting with the most common issue, the ‘cannot find symbol’ error often arises due to insufficient context provided to ChatGPT during code generation. Since the model was queried with a focus solely on refactoring a for-loop (see
Section 3.3), it generated the corresponding stream-based code without including necessary imports and variable declarations, assuming that they are implicit. In these cases, the error can easily be rectified by adding the required imports or defining the missing variables. However, a more complex situation occurs when the missing symbol refers to a class or method that does not exist, having been ‘hallucinated’ by the model [
48,
49]. This could require more significant intervention by the developer.
Similarly, the ‘variable is already defined’ error is a result of inadequate context. In this case, the model unnecessarily redefines variables already declared outside the generated stream pipeline. The resolution here is straightforward: remove the redundant variable declaration and use the pre-existing one.
Errors such as ‘illegal character’, ‘symbol expected’, ‘illegal start of type’, and ‘parse errors’ fall under the category of syntactical mistakes. These errors result from the model improperly writing code, using incorrect symbols, or omitting required ones. These errors are typically simple to correct, requiring the developer to insert or replace the appropriate symbols.
Shifting to more complex errors, ‘unreported exception’ and ‘try without’ relate to exception handling in Java, specifically within the context of stream APIs. These errors occur when the model fails to properly handle exceptions, a task often better suited for traditional for-loops rather than stream pipelines. Fixing such errors may necessitate reverting to a more conventional approach or a detailed restructuring of the code to ensure proper exception handling, which diminishes the utility of stream-based refactoring.
Although rarer, the ‘incompatible types’ and ‘non-static variable’ errors demand a deeper contextual analysis. These errors typically occur when there is a mismatch between the expected and actual types or when non-static variables are used incorrectly within static contexts. Resolving these issues may not always be feasible due to the constraints imposed by the semantic structure of stream pipelines.
Finally, the ‘local variable referenced from a lambda expression must be final or effectively final’ error is tightly connected to one of the preconditions discussed in state-of-the-art approaches (see
Section 4.2). By design, stream APIs do not allow the use of non-final local variables, meaning that certain for-loops cannot be easily refactored into streams. However, as indicated in
Table 6, ChatGPT is often able to successfully refactor these loops despite this limitation, highlighting its potential to handle such complex refactorings.
Listing 8 highlights a typical compilation error that can arise when refactoring a for-loop into a Java stream. The generated code attempts to reassign the local variable fileName inside a lambda expression used within the ifPresent method. However, Java requires that variables captured by lambdas be final or effectively final, and reassignment violates this rule. As shown in the Maven output, the compiler fails with the error ‘local variables referenced from a lambda expression must be final or effectively final’, preventing successful compilation of the refactored code. Such variable scope and mutability dependencies are often subtle and non-trivial, making them particularly difficult for LLMs to detect and handle correctly during automated refactoring.
Listing 8. Example of a compilation error encountered during the analysis of a refactored for-loop. The code snippet shows the stream generated by the GPT, followed by the Maven compilation failure indicating an error of type local variables referenced. |
1 Arrays . stream ( new String [] { ".gz", ". bz2 " }) |
2 . filter ( ext -> fileName . endsWith ( ext )) |
3 . findFirst () |
4 . ifPresent ( ext -> fileName = fileName . substring (0, |
5 fileName . length () - ext . length ())); |
6 |
7 .... |
8 [ ERROR ] Failed to execute goal |
org . apache . maven . plugins :maven - compiler - plugin :3.13.0: compile |
( default - compile ) on project main : Compilation failure : Compilation |
failure : |
9 [ ERROR ] |
/C:/ Users /Ale -m/ Desktop / RepositoryMining / AnalyzedRepositories / openrefine |
10 / main / src / com / google / refine / importing / ImportingUtilities . java : |
11 [733 ,24] local variables referenced from a lambda expression must be |
final or effectively final |
12 .... |