5.4. Experimental Results and Analysis
The experiments showed that when the autoregressive Transformer decoder model was combined with the 5-gram model, selecting a relative frequency confidence level of about
resulted in better performance for API predictions.
Figure 7 shows the prediction results of the 5-gram, Transformer, and DLH-API models:
As can be seen from
Figure 7, compared to using the autoregressive Transformer model or the n-gram-model-based API sequence call statistics method alone, DLH-API improved both the accuracy and MRR of API completion. This demonstrates the effectiveness of combining deep learning and heuristics for API prediction.
To gain a deeper understanding and evaluate our proposed approach, we conducted experiments to investigate the following four research questions:
To investigate the effectiveness of using the autoregressive Transformer decoder model alone for API completion, we evaluated its performance across four datasets: ML, Security, Web, and DL. The autoregressive Transformer decoder model can predict other code tokens besides API tokens, but in this paper, we only evaluated its performance in the API completion task. The results are shown in
Table 2:
The experimental results from the table above show that, although the autoregressive Transformer-decoder-based model achieved the best performance among five methods presented at recent top conferences [
1], its API completion was still low for real datasets, and there is much room for improvement.
In addition, API usage patterns are also widely applied in API completion, where the n-gram model is a representative and simple method that has achieved good results in various studies [
19]. This model predicts the next API based on the previous
words of an API call sequence, where
n is a hyperparameter in the n-gram model. Following the settings used by Xiao et al. [
34], in this experiment,
n was set to 3, 4, and 5. Based on the API statistical database constructed from the training set, the test set predicted APIs by matching the corresponding API call sequence prefixes (i.e., the previous
items). If a corresponding prefix existed, the API with the highest frequency in the nth item was chosen as the prediction result. The results are shown in
Table 3:
As shown in
Table 3, compared to the autoregressive Transformer decoder model, the accuracy and MRR of the API sequence call statistical method prediction based on the n-gram model were low overall. Moreover, as the value of
n in the n-gram increased, both the accuracy and MRR decreased. This illustrates the superiority of deep learning methods.
The n-gram model for statistical analysis of API call sequences primarily serves as a supplementary strategy to deep learning, used for predicting APIs in specific usage patterns that may be overlooked or difficult to handle. To achieve this, firstly, there should be API usage patterns in the test set that match those in the API statistics database, which is denoted by the prefix matching rate (PMR). Secondly, among all matching API call sequence prefixes, the higher the accuracy of the predictions, the better, which is measured as precision.
Table 4 below shows the experimental results for these two metrics across the four datasets:
As can be seen from
Table 4, the prefix matching rate gradually decreased and the accuracy gradually increased with increasing
n values. This suggests that, with a larger
n-value, the number of API call sequence prefixes in the test set that matched those in the training set decreased, but the correctness of predictions increased. Based on this observation, we chose to combine the 5-gram-model-based API call sequence statistics method with the autoregressive Transformer decoder model for API prediction (the impact of different
n-values on the results will be discussed in subsequent issues). The results of the combinatorial strategy prediction were as follows
Table 5:
From the above results, it can be seen that the results of the combinatorial strategy prediction were improved, both in terms of accuracy and MRR, relative to the autoregressive Transformer decoder based model and n-gram model alone. This demonstrates the effectiveness of the combinatorial prediction strategy.
To enhance the prediction accuracy, we explored two heuristic methods for setting thresholds based on absolute and relative frequencies. Absolute frequency refers to the actual count of API occurrences, for instance,
(net(),load_state_dict(),net():eval():3,load_state_dict():1, which indicates that in the training set, after the API call sequence prefix
(net(),load_state_dict(),net()),
eval() appears three times and
load_state_dict() once. Relative frequency indicates the percentage of a specific API’s occurrences relative to other APIs under the same call prefix. In the example provided, relative frequency is represented as
(net(),load_state_dict(),net()): {eval():0.75,load_state_dict():0.25}, meaning that after the prefix
(net(),load_state_dict(),net()),
eval() appeared with a relative probability of
, and
load_state_dict() with
. We set thresholds
based on both absolute and relative frequencies. If the predicted API’s prefix existed in the statistical database and its frequency exceeded the threshold
, the most frequently appearing API was selected as the prediction. For example, in the DL dataset with
, we evaluated the performance of the combinatorial strategy predictions under different thresholds for both absolute and relative frequencies. The results are shown in
Figure 8:
Figure 8 shows that when using absolute frequency, the highest accuracy, at
, occurred when the threshold was set to 1. With relative frequency, the highest accuracy peaked at
when the threshold was
, which is higher than the accuracy under the absolute frequency. Additionally, both absolute and relative frequencies exhibited a trend of increasing and then decreasing accuracy. Similar observations were made in the experiments conducted on the other three datasets.
The data from
Table 4 indicate that, as the value of
n increased, the precision of API predictions based on the n-gram model improved, but the accuracy decreased. This suggests that although increasing n enhanced the precision, the number of API call prefixes matching the training set actually decreased in the test set. Thus, increasing
n did not necessarily enhance the overall performance of the combinatorial strategy predictions. Additionally, the analysis from Question 3 reveals that setting the confidence based on relative frequency was more effective than absolute frequency. Based on this observation, this experiment focused on the effects on the combinatorial prediction strategy at different thresholds (intervals of
) under relative frequency, and
Figure 9 illustrates the prediction performance for different N values on the DL dataset.
From
Figure 9, it can be seen that for accuracy, with a confidence level of 0, the result for
was indeed better than that for
. However, as the confidence level increased, the improvement in the combinatorial strategy prediction accuracy was slow, reaching its peak at a threshold of
with a value of
, which was still lower than the optimal result of
achieved with
. This indicates that, while increasing
n can improve the precision of prefix matching, the reduced number of matching prefixes led to a decrease in the overall number of correctly predicted APIs, thus affecting the combinatorial strategy prediction results.The performance for MMR was similar, achieving the best results at
. Similar conclusions were obtained on the other three datasets, where the combinatorial strategy prediction performance at
was relatively better.
The following baseline models were run for comparison, to evaluate our API completion model.
PyART [
13]: This model utilizes a predictive framework optimized with heuristic principles, incorporating data-flow details, token similarity, and token co-occurrence to enhance API completion. It effectively employs these heuristic features to achieve accurate predictions.
MPL [
12]: This method enhances the AST characterization of source code by integrating multiple paths, and leverages LSTM to improve API completion. It focuses on extending the traditional AST approach to better capture diverse code features.
TravTrans [
15]: This approach processes AST node sequences using a pre-order traversal, encoding them into a Transformer model to predict masked API nodes. It aims to accurately identify APIs in the source code by utilizing advanced transformation techniques.
The results are shown in
Table 6. The last row shows the outcomes of our method, while the other rows represent the baseline results. From these results, it is evident that our proposed method outperformed all the baseline methods across the four datasets. In particular, it excelled on the DL dataset, with a high accuracy and MRR of 34.16% and 45.64%, respectively, which was an improvement of 3.99% and 4.90% over the state-of-the-art method (i.e., TravTrans). This success is attributed to our method employing a powerful backbone deep learning model and introducing an API for call sequence information.