Next Article in Journal
Multi-Scale and Multi-Factor ViT Attention Model for Classification and Detection of Pest and Disease in Agriculture
Previous Article in Journal
Investigation of the Shear and Pore Structure Characteristics of Rubber Fiber-Reinforced Expansive Soil
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

C2B: A Semantic Source Code Retrieval Model Using CodeT5 and Bi-LSTM

Department of Computer Software Engineering, National University of Sciences and Technology, Islamabad 44000, Pakistan
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(13), 5795; https://doi.org/10.3390/app14135795
Submission received: 13 May 2024 / Revised: 20 June 2024 / Accepted: 26 June 2024 / Published: 2 July 2024

Abstract

:
To enhance the software implementation process, developers frequently leverage preexisting code snippets by exploring an extensive codebase. Existing code search tools often rely on keyword- or syntactic-based methods and struggle to fully grasp the semantics and intent behind code snippets. In this paper, we propose a novel hybrid C2B model that combines CodeT5 and bidirectional long short-term memory (Bi-LSTM) for source code search and recommendation. Our proposed C2B hybrid model leverages CodeT5’s domain-specific pretraining and Bi-LSTM’s contextual understanding to improve code representation and capture sequential dependencies. As a proof-of-concept application, we implemented the proposed C2B hybrid model as a deep neural code search tool and empirically evaluated the model on the large-scale dataset of CodeSearchNet. The experimental findings showcase that our methodology proficiently retrieves pertinent code snippets and surpasses the performance of prior state-of-the-art techniques.

1. Introduction

Code searching is a common software development approach to increase “software productivity”. Software productivity generally refers to the efficiency with which software is developed [1]. Code searching and reusability improve software productivity by enabling developers to efficiently find and reuse existing code snippets, thereby tackling new challenges more effectively. Developers often use free-text queries to search and reuse previously created code across a large codebase for specific tasks [2,3]. Code reusability offers developers time efficiency, cost reduction, and improved code quality consistency by utilizing existing solutions to tackle new challenges. This expedites development and enhances code quality by reusing proven code segments, thereby enhancing the overall development process. Reusing code can improve development, but finding suitable snippets remains a challenge. Despite reusable codes in repositories, developers struggle to efficiently extract code that meets their requirements [4].
Information retrieval (IR) techniques are frequently applied in natural language processing (NLP) to implement the classic code search algorithms (e.g., [5,6,7,8]). The classic code search algorithms often treat the source code as a plain-text document or a series of tokens and then compute the similarity between searches and code documents to deliver the relevant code snippets [9]. Code search algorithms aim to find code fragments from a huge code corpus that most closely match the user’s intent [10]. Existing code search techniques struggle to find relevant code snippets due to semantic gaps between existing code in repositories and user queries. Recent academic and commercial efforts are advancing deeper learning for more sophisticated code searches (e.g., NQE [11], DeepCS [12], UNIF [13], CAT [14], SeCoDeGrMa [10], JBLG [10]). These techniques leverage the power of NLP to understand and match the semantic intent of a developer’s query with relevant code. While these techniques (e.g., [10,11,12,13,14]) show significant improvement over traditional keyword-based search methods, they are not without limitations [12,15]. Existing techniques significantly face two main challenges: (i) Inability to effectively understand the context within which code operates. This includes not just the broader application context but also the specific syntax and structure of programming languages. Traditional techniques fail to grasp the intricate dependencies and nuances within code structures, which can lead to irrelevant or incorrect search results [16]. (ii) Struggles in interpreting the semantics of both the natural language used in queries and the programming languages used in code. Effective code search requires a deep understanding of the intent behind a user’s query, which can be complex and nuanced. Current techniques lack the sophistication to accurately match these queries with appropriate code snippets, particularly when dealing with abstract concepts or less explicitly defined problems [10,11,12,13,14].
Deep learning techniques for code search, as discussed in [16], tend to overlook the knowledge gap existing between code snippets and user queries. Instead, they prioritize learning pattern matching based on the semantic relationship between the code and its corresponding description. Recently, a lot of academics have been working in this field to identify new ways to overcome the gaps in existing work [17]. To address the above-stated limitations of existing approaches, this research aims to propose a hybrid neural code search approach that leverages CodeT5 [18] (a pretrained transformer-based model; transformer-based models in deep learning revolutionized natural language processing. These models rely on self-attention mechanisms to capture contextual relationships between words in a sequence without relying on recurrent or convolutional structures) and Bi-LSTM. The proposed model is named “CodeT5 and Bi-LSTM hybrid model” (C2B). The Bi-LSTM processes sequential data, capturing contextual information in both forward and backward directions. This capability is essential for accurately parsing complex, context-rich queries. The C2B model employs CodeT5 [18], a specialized version of the T5 [19] model tailored for code, to comprehend the semantics of both natural language and code. The CodeT5 adaptability across various programming languages and frameworks significantly enhances the model’s generalization. By combining these strengths, the C2B model offers a comprehensive approach that addresses the challenges of existing code search systems, providing developers with enhanced productivity and reduced development time and ultimately leading to higher-quality software development. This synergistic approach enhances the accuracy of code search, improves comprehension of complex queries, and facilitates code reuse, addressing key limitations in existing systems.
In this research, our main contributions can be outlined as follows:
  • A robust C2B model is proposed that efficiently integrates CodeT5 and Bi-LSTM, allowing seamless interaction between natural language and code representations.
  • Extensive assessments are conducted using the CodeSearchNet benchmark dataset to indicate that our C2B hybrid model performs better when compared with previous alternatives.
  • An attention mechanism is incorporated to enhance the model’s ability to focus on relevant parts of the code and queries, further improving search accuracy.
The rest of this paper is organized as follows: Section 2 presents a comprehensive analysis of existing studies, methodologies, and findings related to the research themes. Section 3 details the research methodology. Section 4 proposes a C2B hybrid model architecture. Section 5 presents empirical results. Section 6 outlines future research avenues and concludes this study.

2. Literature Review

This section provides an overview of the traditional and deep learning techniques for code search. Traditional code search approaches are discussed in Section 2.1. In Section 2.2, we discuss NCS methods with the aim of exploring neural network-based approaches for code retrieval. Section 2.3 addresses the importance of pretraining on natural language, emphasizing the relevance of leveraging language models trained on natural language tasks for code-related applications.

2.1. Traditional Code Search Approaches

Traditional code search methods primarily view source code as either plain text or a sequence of tokens, disregarding its inherent structure and semantics. These models commonly involve manual feature extraction or NLP-based vectorization to identify clone pairs using text similarity algorithms. These approaches are prevalent in industry due to their simplicity, low complexity, and cross-platform applicability.
In their initial research, Mayrand et al. [20] identified distinctive sample features, including function names, expressions, and control flow, through manual extraction from the source code. The similarity between clone code pairs was assessed using 21 function-level metrics derived from these features. This method heavily depended on expert domain knowledge due to the manual generation of features. However, the model exhibited limited adaptability, particularly in handling the varying structural features across different programming languages.
A well-known Eclipse platform plug-in, SDD [21], utilizes token detection to find cloned code snippets in the source code. Through the implementation of inverted index and N-nearest neighbor algorithms, it efficiently minimizes detection time, enabling developers to swiftly identify clone code within the IDE environment. SDD excels in detecting code but falls short in recognizing clone pairs with significant syntactic dissimilarity that still serve the same function.
Deckard [22], a renowned model frequently employed for comparative experiments by subsequent researchers, conducted an analysis on the abstract syntax tree (AST) generated from the provided source code. The vectors in Deckard are manually generated using predefined rules during tree traversal. Following vector processing with the local sensitive hash (LSH) algorithm, the model determines code similarity based on Euclidean distance and provides the probability of clone pairs. Deckard is effective in detecting clone code pairs with statements in different orders, as the corresponding AST preserves the fundamental structure of the source code and is adaptable to various programming platforms with corresponding AST parsers. However, the model faces challenges in large-scale clone detection due to the intricacies of clustering operations.

2.2. Deep Learning Approaches

There has been a notable increase in interest in implementing deep learning approaches in software engineering applications. Our research utilizes a deep learning algorithm for the task of code search [23,24]. Recent studies [25,26,27,28,29,30,31,32] explored the potential uses of deep learning techniques in source code with applications including code generation, code completion, and code clone detection. For instance, Mou et al. [33] introduced an RNN encoder–decoder model to generate code from natural language user intentions, demonstrating its effectiveness in a dataset of simple programming assignments. Gu et al. [34] applied deep learning for application programming interface (API) learning, generating API usage sequences based on natural language queries. Additionally, White et al. [35] used an RNN language model to predict software tokens in source code files for the task of code completion.
Our research focuses on exploring the application of deep learning specifically to code search. Prior research in NLP [36,37,38] creates embedding only for limited bilingual data. A particular challenge within the context of code search revolves around the alignment of embedding, emphasizing that the alignment should not solely focus on the word (i.e., token) level. Instead, the embedding should effectively aggregate and represent entire code snippets and queries comprehensively. The application of NLP research techniques is highly beneficial in tackling this particular issue, providing valuable insights into potential solutions and approaches. For instance, Allamanis et al. [39] presented a probabilistic model that can synthesize a program snippet based on a natural language query. Meanwhile, a Bayesian statistical approach—BAYOU [40]—leverages deep neural networks (DNNs) to generate code programs from a combination of API calls and natural language input. The CODE-NN [41] approach uses long short-term memory (LSTM) networks to generate natural language descriptions from code snippets, enhancing software development by linking code and natural language. Our research assesses the performance of neural network designs in code search tasks, concentrating on diverse source code snippets and utilizing natural language queries as input.
Traditional approaches [42,43] heavily relied on information retrieval (IR) algorithms, which include rule-based strategies [44,45,46], statistical language models [47], and statistical machine translation [48]. These methods often faced challenges in effectively capturing the complex and diverse nature of source codes, especially in terms of structural and semantic representations. Deep learning approaches powered by neural networks offer the ability to automatically learn intricate patterns and representations from data that enable more accurate and flexible modeling of source codes. This shift is driven by the desire to enhance the representation and understanding of source codes, addressing the shortcomings of manual feature extraction and rule-based techniques prevalent in earlier approaches. For instance, Iyer et al. [41] used an RNN with attention to produce summaries straight from token embeddings. To generate function-name-like summaries, Allamanis et al. [25] presented a convolutional attention network for the purpose of extracting features from token sequences. Additionally, structural information from abstract syntax trees (AST) [49] is taken into account. Hu et al. [50] serialized ASTs using a traversal technique called structure-based traversal (SBT) and then summarized the results using a conventional sequence-to-sequence (Seq2Seq) model. To identify pertinent pathways during decoding, Alon et al. [51] expressed code as a set of AST paths and encoded it using an RNN. Zhang et al. [52] provided a different strategy [52] that uses the most comparable code snippets that were recovered from the training set to improve the neural model in a retrieval-based neural code summarization technique.
Wang et al. [53] used graph neural networks (GNNs) to search similar code snippets. They proposed FA-AST (flow-augmented AST), a technique that leverages control and data flow graphs to enhance the ASTs. The authors employed GNNs on FA-AST to measure the similarity between the code pairs. Guo et al. [54] developed GraphCodeBERT to address the limitation of CodeBERT [55], which ignores the code’s inherent structure. They used the program’s data flow graph to encode semantic information about dependencies between variables. GraphCodeBERT is pretrained on three tasks: masked language modeling, predicting dependency edges in the code structure, and aligning representations between source code and code structure. The model improves CodeBERT results and achieves state-of-the-art performance on four downstream tasks.
Wang et al. [18] developed CodeT5, an encoder–decoder-based transformer model that uses user-defined identifiers to capture semantic properties in code. They extended a Seq2Seq-based model for code understanding and generation applications and proposed a novel identifier-aware pretraining objective. CodeT5 outperformed prior architectures on fourteen code-related subtasks. Liu et al. [56] proposed ROBERTA, a transformer-based language model designed to understand context and relationships in natural language and code. It utilizes a bidirectional self-attention [57] mechanism and large-scale pretraining on diverse data. ROBERTA has demonstrated strong performance in various NLP and source code-related tasks.
Wan et al. used a tree-based RNN model to represent code sequential content and integrate context vectors for summaries. Context vectors are high-dimensional numerical representations that capture the essential information and context of input data, often used in machine learning models to summarize or encode features. They also used reinforcement learning to address exposure bias, a common challenge in training models on imbalanced datasets. LeClair et al. [58] and Hu et al. [59] shared a similar objective, which involves combining the contexts of the symbolic bag-of-words tree and the token sequence of the code to produce summaries. Sachdev et al. [16] and Zhou et al. [60] used neural bag-of-words (NBoW), a classic approach in source code retrieval that represents code snippets as fixed-size vectors by summing the embeddings of individual tokens. It is known for its simplicity and efficiency but struggles with capturing semantic relationships and context in code due to its bag-of-words nature. Hu et al. [61] utilized an API sequence as an additional input in their model, while Zhou et al. [62] suggested using a convolutional neural network to extract vector representations of AST nodes and learn an adaptable weight vector across various code representations.
Following a methodology similar to that presented in [52], Wei et al. [63] adopted a similar strategy by incorporating retrieved samples to enhance their task. However, they went beyond utilizing the retrieved code snippet alone, also exploiting its abstract syntax tree (AST) and the associated comment, termed as the exemplar. Concurrently, there has been a growing interest in the utilization of graph neural networks (GNNs). Fernandes et al. [64] used lexical information and AST for graph representation of code and then used GNN to encode this graph. To introduce sequential information into the GNN, they initialized a graph node with the corresponding output of the RNN. Expanding on this notion, Liu et al. [65] amalgamated diverse representations of source code, encompassing AST, control flow graph (CFG), and program dependency graph (PDG), into a unified code property graph, leveraging this joint representation for generating summaries. Similarly, building upon their earlier research [58], LeClair et al. [66] employed a GNN for the AST as a distinct input alongside the token sequence.
Chen et al.’s [67] bimodal variational autoencoder enhances code retrieval and summarization by mapping natural language and code into a common semantic space, thereby driving collective advancements in code-related tasks. Yao et al. [68] took a unique approach to code retrieval by exploring code annotation. They trained a code annotation model to generate code summaries, enabling a code retrieval model to better discern relevant codes and enhance the overall retrieval process. In contrast, Wei et al. [69] developed a dual learning framework that trains a code summarization and a code generation model simultaneously, enhancing performance in both tasks. Similarly, Ye et al. [70] employed dual learning to leverage code generation, with a specific focus on simultaneous improvement in both code summarization and retrieval. This approach leads to superior outcomes in both interconnected tasks.
Wang et al.’s [71] study on code retrieval utilized a hierarchical approach, using reinforcement learning and vanilla hierarchical attention networks (HANs) with multiple inputs. However, their research did not consider input dynamics, and their implementation of HAN did not improve the conventional Seq2Seq model. The authors also did not mention the split hierarchical input split for abstract syntax tree (AST), which is a significant difference from our data processing methods. Additionally, ensemble learning (ELO) approaches were not explored in Wang’s study.
This section explores the evolution of code search methodologies, focusing on the transition from traditional to data-driven approaches. This section discusses deep enhanced learning approaches that incorporate source code and natural language aspects. Aligned code and natural language data are investigated for enhancing software development.

2.3. Pretraining on Programming Language

Pretraining in the realm of programming languages is an emerging field, where recent efforts aim to adapt NLP pretraining techniques to source code. Two cutting-edge deep learning models, CuBERT [72] and CodeBERT [55], are commonly used in this area. CuBERT utilizes BERT’s potent masked language modeling objective to derive generic code-specific representations, while CodeBERT goes a step further by incorporating a replaced token detection task [73] to learn cross-modal representations between natural language and programming languages. Apart from the BERT-style models, Svyatkovskiy et al. (2020) [74] and Liu et al. (2020) [75] employed a generative pretrained transform (GPT) and unified pretrained language model (UniLM) [76], respectively, for the code completion task. Additionally, transcoder [77] explores unsupervised programming language translation. In contrast, our approach involves encoder–decoder models based on T5 [19] for programming language pretraining, encompassing a more comprehensive set of tasks.
The recent literature includes some emerging studies [78,79,80] that examine the T5 framework applied to code. However, these works have limited scope, concentrating solely on specific generation tasks without accommodating understanding tasks like our approach. Another encoder–decoder model called PLBART [81], built on BART, also supports both understanding and generation tasks. Nevertheless, prior to our proposal, all these previous works treat code in the same manner as natural language, largely overlooking the distinctive characteristics of code. In contrast, our approach focuses on leveraging identifier information in code for pretraining.
In recent developments, GraphCodeBERT [54] integrated data flow extracted from code structure into CodeBERT, and Rozière et al. (2021) [82] introduced a deobfuscation objective to utilize the structural aspect of programming languages. These models are primarily concerned with enhancing the code-specific encoder during training. Conversely, Zügner et al. (2021) [83] proposed a method to capture relative distances between code tokens across the code structure. However, our specific focus lies in the identifiers that carry rich code semantics. We devised the C2B hybrid model that identifies tagging and prediction tasks to incorporate this valuable information into a Seq2Seq model.

3. Research Methodology

This section outlines the research methodology employed in this study, which focuses on developing a comprehensive framework known as the C2B hybrid model for source code retrieval. In light of the growing complexity of software systems and the extensive repositories of available code, the efficient and precise retrieval of pertinent code snippets has become increasingly vital for software developers. To address this challenge, we harness the advancements in deep learning and natural language processing. Our methodology involves a unique fusion of two powerful models: CodeT5, which is known for its good code comprehension capabilities, and the bidirectional long short-term memory (Bi-LSTM) model, known for its good sequential modeling capabilities. This integration aims to elevate the performance of source code retrieval by effectively capturing both the semantic relationships and sequential dependencies inherent in natural language queries and code snippets. The research methodology encompasses several key steps, as illustrated in Figure 1, which depicts the workflow of the proposed C2B hybrid model. This figure visually guides our approach, allowing for a clear understanding of the steps involved in achieving our research objectives. The research methodology covers the following key steps:
  • Data collection: Use the CodeSearchNet dataset for training and evaluating the C2B hybrid model. The dataset consists of pairs of natural language comments and corresponding code snippets along with relevant labels indicating the relationship between the query and code.
  • Data preprocessing: Perform preprocessing of the collected dataset, including tokenization and encoding of queries/comments and code snippets which are then feed into the C2B hybrid model. The queries/comments and code snippets are tokenized, breaking them down into individual words or tokens. Then, these tokens are encoded into numerical representations to be fed into the neural network models. A vocabulary is created by collecting all unique tokens from the dataset, enabling the mapping of tokens to their numerical representations during encoding.
  • C2B hybrid model architecture design: Develop a hybrid architecture that combines the strengths of CodeT5 and Bi-LSTM for source code retrieval. The architecture of the C2B hybrid model determines and exchanges information necessary for source code retrieval. The proposed architecture defines the specific layers, connections, and mechanisms for integrating the two models.
  • C2B hybrid model implementation: Implement the proposed C2B hybrid model architecture using a deep learning framework such as TensorFlow or PyTorch. Construct the necessary layers, connections, and mechanisms as defined in the model design phase. Incorporate appropriate attention mechanisms or other enhancements to improve the model’s performance.
  • Fine-tuning and training: Perform the fine-tuning process for both CodeT5 and Bi-LSTM components of the hybrid model. Initialize the CodeT5 component with pretrained weights from the pretraining phase, and train both components using the collected dataset. Determine the appropriate loss functions and optimization algorithms for training the hybrid model.
  • Evaluation metrics: Select suitable evaluation metrics for assessing the performance of the hybrid model. Common metrics for source code retrieval include precision, recall, mean average precision (MAP), and normalized discounted cumulative gain (NDCG). Determine the evaluation protocol, such as train–test splits or cross-validation, to measure the model’s effectiveness.
  • Experimental evaluation: Conduct experiments to evaluate the hybrid CodeT5 and Bi-LSTM model on the collected dataset. Compare its performance against baseline models or existing state-of-the-art approaches. Perform statistical analysis to assess the significance of the results and identify areas of improvement.
  • Result analysis and interpretation: Analyze the experimental results to gain insights into the strengths and weaknesses of the hybrid model. Identify patterns or trends in the model’s performance and understand the factors that contribute to its success or failure. Interpret the results to draw meaningful conclusions about the effectiveness of the hybrid approach for source code retrieval.

4. Proposed Architecture for C2B Hybrid Model

In this section, we present the architecture of our proposed C2B hybrid model for source code retrieval. The proposed C2B hybrid model leverages the strengths of both the CodeT5 (like language understanding, capturing semantics, programming languages adaptability) and Bi-LSTM (like context encoding, capturing sequential data dependency) models to enhance the code retrieval process. By integrating the strengths of these two (CodeT5 and Bi-LSTM) models, the hybrid approach takes advantage of their complementary features to improve the accuracy and effectiveness of source code retrieval, enabling developers to find relevant code snippets for their specific queries. We present and discuss the main building blocks of the proposed C2B hybrid model in Section 4.1 and we discuss our training procedure in Section 4.2.

4.1. The Main Building Blocks of the Proposed C2B Hybrid Model

As shown in Figure 2, the proposed architecture of the C2B hybrid model consists of three main blocks: data processing, word embedding, and the C2B layer. We discuss each of these blocks in Section 4.1.1, Section 4.1.2, and Section 4.1.3, respectively.

4.1.1. Data Preprocessing

This section describes the preprocessing steps applied to all the programs in the CodeSearchNet dataset. During inference, the preprocessing steps are applied to every query code to convert it into a standard form to be input to the architecture. Preprocessing is required to clean and normalize the source codes to improve the model performance. We employ the following steps to preprocess the codes.
  • Removal of unnecessary code: The irrelevant parts of the source code such as unnecessary whitespaces and redundant comments are discarded. However, essential comments that aid in interpreting complex code are retained. This process leaves us with the crucial parts of the code including the function names, parameters, user-defined identifiers, variable names values, and comments.
  • Tokenization: Tokenization is the process of splitting the code fragments into smaller units called tokens. It is one of the crucial steps involved in improving the model performance by handling out-of-the-vocabulary (OOV) words. Similar to BERT and GPT, we employ CodeT5’s pretrained byte-pair-encoding (BPE) tokenizer [84], which is trained as proposed by Radford et al. [85]. The BPE tokenizer reduces the sequence length and has been determined to work better on the understanding and generation tasks [18].
  • Code-specific features: As proposed in CodeT5, we extract the code-specific features by leveraging the token-type (identifier) information in the source code. The identifiers are code tokens that are common to many programming languages capturing rich code semantics. The CodeT5 tokenizer uses the identifier information and converts the code into features to be utilized during training.
  • Encoding programs: The original CodeT5 architecture is pretrained on natural language (NL) and programming language (PL) description pairs (NL-PL) or PL-PL pairs. It concatenates the tokens of the bimodal inputs by putting a [SEP] token between them. The [SEP] token acts as a separation between the PL-PL or NL-PL pairs. Since we modify the architecture to take as input a single code, we use [P AD] tokens after the code tokens to pad the tokens to their maximum length and ignore the [SEP] token. Our input sequence can be represented as ( [ C L S ] , c 1 , c 2 , c 3 , , c n , [ P A D ] , [ P A D ] , , [ P A D ] ) , where [ C L S ] represents the classification token and c and n represent the individual code tokens and number of tokens in the program, respectively.

4.1.2. Word Embeddings

There are two main parts of this block: CodeT5 word embeddings and Bi-LSTM word embeddings.
  • CodeT5 word embeddings: Our proposed C2B hybrid model is illustrated in Figure 2. Instead of taking two codes as input to the model, we generate embedding for each code individually and query (or code comments). For the code search task, we compute raw embeddings for both code and query separately and then experiment with two different training strategies to learn these embeddings. Once the embeddings are trained, we compute and store them for every code in the dataset. Our idea is to use the code and query embeddings for comparison during code retrieval rather than using the whole architecture. This proposed approach is based on the assumption that the embeddings capture syntactic and semantic information present in the program and query. We conduct experiments to assess the efficacy of these embeddings. If the assumption holds true, then the code retrieval task reduces to comparing these embeddings, which is faster and more efficient than using architecture for pairwise comparison.
    CodeT5, an extension of the T5 (text-to-text transfer transformer) model, can generate query and code embeddings for source code retrieval tasks. Here is a detailed explanation of how CodeT5 generates query and code embeddings for our C2B hybrid model, along with the formulas involved:
    Input encoding: The first step is to tokenize the query and code into subword units using techniques like WordPieces or byte-pair encodings (BPEs). Let us denote the tokenized query as Q = Q 1 , Q 2 , , Q n and the tokenized code as C = C 1 , C 2 , , C m .
    Embedding layer: The tokenized query and code are passed through an embedding layer, which maps each token to a continuous vector representation. The embeddings for the query tokens and code tokens are denoted as E Q = E Q 1 , E Q 2 , , E Q n and E C = E C 1 , E C 2 , , E C m , respectively. These embeddings capture the semantic meaning of the tokens.
    Positional encoding: To incorporate positional information, positional encodings are added to the query and code embeddings. Positional encodings help the model understand the relative and absolute positions of tokens within the input sequences. The positional encodings for the query and code are denoted as P Q = P Q 1 , P Q 2 , , P Q n and P C = P C 1 , P C 2 , , P C m . Each positional encoding is a vector that represents the position of the token within the sequence. The positional encoding vectors can be computed using sine and cosine functions as shown in Equations (1)–(4).
    P Q i , k = s i n Q i 10,000 2 k / d m o d e l f o r e v e n k
    P Q i , k = c o s Q i 10,000 2 k / d m o d e l f o r o d d k
    P C i , k = s i n Q i 10,000 2 k / d m o d e l f o r e v e n k
    P C i , k = c o s Q i 10,000 2 k / d m o d e l f o r o d d k
    Here, Q i and C i represent the position of the token within the query and code sequences, respectively. d m o d e l refers to the dimensionality of the embedding space, and k represents the dimension index of the positional encoding vector.
    Query embedding: The query embedding is obtained by summing the token embeddings and positional encodings for the query tokens. This can be represented using Equation (5).
    Q u e r y E m b e d d i n g = s u m E Q + P Q
    The sum operation combines the token embeddings and positional encodings for each token in the query, resulting in an overall representation of the query.
    Code embedding: Similar to the query embedding, the code embedding is obtained by summing the token embeddings and positional encodings for the code tokens. This can be represented using Equation (6).
    C o d e E m b e d d i n g = s u m E C + P C
    The sum operation aggregates the token embeddings and positional encodings for each token in the code, generating an overall representation of the code.
    In CodeT5, the transformer-based architecture plays a crucial role in capturing the contextual relationships and dependencies between the tokens. The model consists of multiple layers of self-attention and feedforward neural networks. After generating the query and code embeddings, CodeT5 can leverage these embeddings for various retrieval tasks, such as similarity matching or ranking of relevant code snippets based on the query. The embeddings capture both the semantic meaning of the tokens through the token embeddings and the positional information through the positional encodings. This enables the model to effectively understand and retrieve relevant source code in the context of the given query.
  • Bi-LSTM word embeddings: The generated word embedding as shown in Figure 2 are fed into the Bi-LSTM layer to capture the contextual information of query and code as follows:
    Query embedding: Let us denote the query token at position i as Q i . The embedding for the query token can be obtained using an embedding layer, which maps the token to a continuous vector representation. We can represent this using Equation (7).
    Q i e m b e d d i n g = E m b e d d i n g Q i
    The embedded query tokens are fed into the Bi-LSTM layer. The Bi-LSTM processes the tokens in both the forward and backward directions, capturing the contextual information. The forward hidden state at position i is denoted as H i f o r w a r d , and the backward hidden state is denoted as H i b a c k w a r d , as shown in Equations (8a) and (8b).
    H i f o r w a r d = L S T M Q i f o r w a r d , H ( i 1 ) f o r w a r d
    H i b a c k w a r d = L S T M Q i f o r w a r d , H ( i + 1 ) b a c k w a r d
    Code token embedding: Similar to the query token embedding, the code token at position i, C i , can be embedded using the same embedding layer. The embedding for the code token is represented using Equation (9).
    C i e m b e d d i n g = E m b e d d i n g ( C i )
    The embedded code tokens are fed into the Bi-LSTM layer, which processes them in both the forward and backward directions, generating forward hidden states, H i f o r w a r d , and backward hidden states, H i b a c k w a r d .
    H i f o r w a r d = L S T M C i f o r w a r d , H ( i 1 ) f o r w a r d
    H i b a c k w a r d = L S T M C i f o r w a r d , H ( i + 1 ) b a c k w a r d
    In Equations (10a) and (10b), E m b e d d i n g ( ) represents the embedding layer that maps tokens to continuous vector representations. L S T M ( ) denotes the long short-term memory layer, which processes the inputs (query tokens or code tokens) and the hidden states from previous time steps to generate the current hidden states.
    The Bi-LSTM processes the tokens sequentially, incorporating both the previous and future contextual information. The forward hidden state, H i f o r w a r d , is updated using the token embedding and the previous forward hidden state. Similarly, the backward hidden state, H i b a c k w a r d , is updated using the token embedding and the subsequent backward hidden state.
    By iteratively applying these formulas to all the tokens in the query or code, you can obtain the query and code embeddings using Bi-LSTM. These embeddings capture the contextual information and semantic meaning of the inputs, enabling effective source code retrieval based on query matching and similarity.

4.1.3. C2B Layer

This layer performs the following operations:
  • Concatenation: To combine the query and code embedding vectors ( V e c ) from CodeT5 and Bi-LSTM using concatenation, we concatenate the vectors along a specific axis. Let us assume the query embedding vectors obtained from CodeT5 and Bi-LSTM are represented as “ Q E m b C o d e T 5 ” and “ Q E m b B i L S T M ”, respectively. Similarly, the code embedding vectors from CodeT5 and Bi-LSTM are represented as “ C E m b C o d e T 5 ” and “ C E m b B i L S T M ”, respectively. The concatenation C o n c a t operation can be represented using Equations (11a) and (11b).
    C o n c a t ( Q _ V e c ) = C o n c a t [ Q E m b C o d e T 5 , Q E m b B i L S T M ]
    C o n c a t ( C _ V e c ) = C o n c a t [ C E m b C o d e T 5 , C E m b B i L S T M ]
    Here, we concatenate the query embedding vectors from CodeT5 and Bi-LSTM to create a new joint representation “ C o n c a t ( Q _ V e c ) ” Similarly, we concatenate the code embedding vectors from CodeT5 and Bi-LSTM to create a new joint representation “ C o n c a t ( C _ V e c ) ”. The resulting shapes of “ C o n c a t ( Q _ V e c ) ” and “ C o n c a t ( C _ V e c ) ” will be ( d q + d b ) , where “ d q ” represents the dimensionality of the query embeddings from CodeT5, “ d b ” represents the dimensionality of the query embeddings from Bi-LSTM, “ d c ” represents the dimensionality of the code embeddings from CodeT5, and “ d l ” represents the dimensionality of the code embeddings from Bi-LSTM. These concatenated representations can then be used for downstream tasks that require joint representations of the query and code information from both CodeT5 and Bi-LSTM.
  • Attention: The first step is to generate query and code embeddings. This typically involves mapping each token in the query and code to a continuous vector representation using techniques like word embeddings or pretrained language models. The attention mechanism calculates the relevance or similarity between the query and code tokens. One common approach is to use the dot product or cosine similarity to measure the similarity between the embeddings of each query token and code token. Cosine similarity attention is used to calculate the attention weights based on the cosine similarity between the query embedding and code embedding vectors. This measures the cosine of the angle between the two vectors, capturing their directional similarity. The attention weights represent the importance or similarity of each code token with respect to the query. They can be normalized using softmax to ensure a valid probability distribution over the code tokens. After calculating the attention weights, they are used to weight the code embeddings. Each code embedding is multiplied by its corresponding attention weight. The resulting weighted code embeddings are then summed or averaged to obtain an aggregated code representation. The aggregation step combines the relevant parts of the code based on the attention weights, focusing the model’s attention on the code tokens that are most similar to the query. Let us consider a query token embedding Q i and a code token embedding C i . The cosine similarity attention is calculated using the cosine similarity between the query and code embeddings.
    Cosine similarity calculation: Cosine similarity calculates the cosine of the angle between two vectors, representing their directional similarity as shown in Equation (12). Here is the formula for calculating the cosine similarity between the query and code embeddings:
    C o s i n e S i m i l a r i t y Q i , C i = Q i · C j Q i · C j
    In Equation (12), Q i , C i denotes the dot product between the query embedding Q i and the code embedding C i . Q i represents the Euclidean norm (magnitude) of the query embedding Q i . C j represents the Euclidean norm (magnitude) of the code embedding C j .
    Cosine similarity attention: After calculating the cosine similarity between the query and code embeddings, the attention weight for each code token with respect to the query token can be obtained using s o f t m a x normalization. The attention weight, A i j , for the code token at position j given the query token at position i can be calculated using Equation (13).
    A i j = s o f t m a x C o s i n e S i m i l a r i t y Q i , C j
    The s o f t m a x function ensures that the attention weights sum up to 1, creating a valid probability distribution over the code tokens for each query token.
    The cosine similarity attention formula measures the similarity between query and code embeddings based on the cosine similarity metric. It allows the model to assign higher attention weights to code tokens that are more similar to the query, enabling focused attention on relevant parts of the code during source code retrieval tasks.
  • Aggregation: After calculating the attention weights, they are used to weight the code embeddings. Each code embedding is multiplied by its corresponding attention weight. The resulting weighted code embeddings are then summed or averaged to obtain an aggregated code representation. The aggregation step combines the relevant parts of the code based on the attention weights, focusing the model’s attention on the code tokens that are most similar to the query.
    Given the attention weights A i j calculated for each code token C j with respect to the query token Q i , we can proceed with the aggregation process. Here is a step-by-step explanation:
    Weighted code embeddings: Multiply each code embedding C j by its corresponding attention weight A i j to obtain the weighted code embeddings. This can be represented using Equation (14).
    W e i g h t e d   E m b e d d i n g j = A i j C j
    Each weighted embedding captures the importance or relevance of the corresponding code token in relation to the query token.
    Aggregated code representation: Sum or average the weighted code embeddings to obtain an aggregated code representation. This step combines the information from different code tokens into a single representation, focusing on the relevant parts determined by the attention weights. The sum aggregation is computed using Equation (15a), and average aggregation is computed using Equation (15b).
    A g g C o d e = S u m ( W e i g h t e d   E m b j )   o v e r   a l l   j
    A g g C o d e = A v e r a g e ( W e i g h t e d   E m b j )   o v e r   a l l   j
    The sum aggregation considers the total contribution of the weighted embeddings, while the average aggregation provides a normalized representation. The resulting A g g r e g a t e d C o d e representation captures the combined information of the code tokens that are most relevant to the query. It represents a condensed version of the code snippet, emphasizing the important aspects highlighted by the attention mechanism. By performing this aggregation step, the model effectively focuses its attention on the code tokens that have higher attention weights, indicating their relevance to the query. This aggregation enables the model to generate a comprehensive and contextually relevant code representation for source code retrieval tasks.
In our C2B hybrid model, attention is calculated between the query and the code before the cosine embedding loss (CEL) and binary cross-entropy loss (BCE) loss. The query and code embeddings are generated using techniques like token embeddings and positional encodings, as discussed previously. The query and code embeddings are passed through attention mechanisms to calculate the attention weights. The attention mechanism allows the model to focus on the most relevant parts of the code given the query. The attention weights are used to weight the code embeddings, aggregating the relevant information from the code based on the query’s attention distribution. The aggregated code representation is then used for prediction, and the CEL and BCE loss is calculated by comparing the predicted output with the ground truth labels. The detail of how cosine embedding loss (CEL) and binary cross-entropy loss (BCE) are computed is explained in Section 4.2.1 and Section 4.2.2.
After the aggregation step, the aggregated code representation can be used for prediction. For example, it can be fed into a classification or ranking model to predict the relevance or ranking of the code snippet. The BCE loss is then calculated by comparing the predicted output (relevance score or ranking) with the ground truth labels. The CEL and BCE loss measures the dissimilarity between the predicted and ground truth outputs and is used to update the model’s parameters during training.

4.2. Training Procedure

In this section, we describe two different training strategies that we use for learning the parameters of the modified CodeT5 architecture as shown in Figure 3. One training objective is to minimize the cosine distance between the embeddings of code clones and maximize the distance between the embeddings of nonclones, referred to as the cosine embedding loss. Another training objective is to compute the binary cross-entropy loss in order to perform clone detection and automatically learn code embeddings during this process. These two strategies are described in the following subsections.
Note that we fine-tune the model instead of training it from scratch. Fine-tuning is a training strategy wherein we take a model already pretrained on a certain task and apply it to our task. The idea is that the pretrained model already understands the patterns in the data. Fine-tuning a pretrained model allows for generating better results while saving a significant amount of training time and cost.

4.2.1. Cosine Embedding Loss (CEL)

In this training strategy, we aim to bring the embeddings of the true clone pairs closer to the common embedding space and simultaneously push the embeddings of nonclone pairs farther from each other. We use a cosine similarity metric to determine the level of similarity between each code pair. Using the cosine similarity, we define a cosine embedding loss and plan to minimize the loss by updating the model parameters. In this way, the model learns to project the code into an embedding vector rich in code syntax and semantics.
  • Cosine similarity ( C S i m ): It represents the level of likeness between two vectors. In our case, we project two codes into embedding vectors using the architecture to compute the cosine similarity. If the two vectors point in the same direction, they have a high level of similarity, and vice versa. The C S i m is essentially the dot product of V 1 and V 2 normalized by the product of their magnitude. The C S i m outputs a floating point value in the range [0, 1], where the value 0 (1, resp.) indicates the highest level of dissimilarity (similarity, resp.) between vectors. Given two vectors A and B, the cosine similarity is defined by Equation (16).
    C o s i n e   S i m i l a r i t y V 1 , V 2 = cos θ = V 1 · V 2 V 1 V 2
  • Cosine embedding (CE) loss: The CE loss is derived from the cosine similarity. This loss is used to measure whether the two vectors are similar or dissimilar. If the two vectors of true (false, resp.) clone pairs are predicted dissimilar (similar, resp.), then the loss penalizes the model parameters. In other words, if the prediction and the actual labels do not match, the loss function takes a high value. The CE loss for each sample is defined in Equation (17).
    l C E z 1 , z 2 , T L = 1 cos ( z 1 , z 2 ) , i f T L = 1 m a x 0 , cos z 1 , z 2 , γ , o t h e r w i s e
    where ( z 1 , z 2 ) are two embedding vectors, T L is the true label, and γ is the margin, which is a tunable hyperparameter. We use the default value of the margin ( γ = 0 ) . The CE loss encourages the cosine angle between the embedding vectors to be small if both vectors are similar [86].
  • Fine-tuning C2B hybrid model parameters on CE loss. Figure 4 depicts how the architecture is fine-tuned by minimizing the CE loss. We instantiate a single pretrained CodeT5 architecture and input code-1 and code-2 tokens to extract their embeddings individually. Thus, there is parameter sharing (single CodeT5 instance) as illustrated in the figure. Once, the code embeddings are generated, the CE loss is computed and the model parameters are updated using gradient descent. The gradient descent [87] is an optimization algorithm to adjust the model parameters in order to minimize the loss function.
  • Inference on clone detection: During clone detection inference, we input the code and query tokens to the fine-tuned architecture. Specifically, we input code and query tokens individually to the model to extract their embeddings. The cosine similarity between the embeddings is computed, which determines the level of similarity between the code and query. Finally, we compare the similarity score within a certain threshold. If the similarity lies above the threshold, the code is claimed to be a true clone and vice versa.
  • Inference on C2B hybrid model code retrieval: Once the model is fine-tuned, we precompute and store the embeddings corresponding to all the codes and query within the dataset. During code retrieval inference, the input is a single query. We use the architecture only once to extract the query embeddings. We then compute the cosine similarity between the query embeddings and all the precomputed embeddings. The codes in the dataset are ranked according to their similarity with the query. Finally, we choose and retrieve the top-k codes as the clone pairs.

4.2.2. Binary Cross-Entropy Loss

In this strategy, we train the architecture to perform clone detection and automatically learn the parameters to generate the code embeddings during the process, as shown in Figure 5. This strategy differs from the previous one conceptually and also by the way of computing loss on the embedding vectors. Let us first define and state the characteristics of the loss function.
  • Binary cross-entropy (BCE) loss: It is a common loss function used in binary classification problems. In our case, given a code pair, we aim to predict whether they are true clone pairs or not. The BCE loss penalizes the model by returning a high value for every wrong prediction. It can be represented by Equation (18).
    l B C E ( y , y ^ ) = 1 N y i · log ( y ^ i + ( 1 y i ) ) · ( 1 y ^ i )
    where N is the total number of samples, y is the true label (“1” if true clone pair, otherwise “0”), and y ^ is the predicted label. We train the model to minimize the loss and automatically learn the parameters to produce code embeddings.
  • Fine-tuning C2B model parameters on BCE loss: Figure 5 illustrates how the architecture is fine-tuned by minimizing the BCE loss. The whole process is similar to fine-tuning the architecture on CE loss till the point the embeddings are extracted individually. After that, the code embeddings are concatenated horizontally and given to a feedforward neural network (classifier) for classification. The output of the classifier is a probability indicating the level of similarity between the code pair.
  • Bi-LSTM network: Bi-LSTM is used in our model architecture. In the original CodeT5 framework, a two-layered feedforward neural network was employed, which accepted a single 768-dimensional embedding as input. In our adapted architecture, we use five-layer structure with nonlinear functions following each layer, similar to the original CodeT5 model. However, the key difference lies in the input dimensions, where we concatenate code and query embeddings, each comprising a 768-dimensional vector. This results in a 1536-dimensional input vector for our Bi-LSTM model. Our five-layer classifier architecture, featuring Bi-LSTM, is presented in Table 1. The Bi-LSTM model operates differently from the original feedforward neural network, as it processes sequential data by considering the sequential dependencies within the concatenated code embeddings. The sigmoid function continues to play a role in transforming raw scores (logits) into probabilities within the range [0, 1]. Notably, this classifier takes concatenated code embeddings as input and computes a weighted average over these embeddings, followed by the application of nonlinear activation functions ( T a n h or S i g m o i d ). Finally, the softmax pooling layer takes the probability values from the sigmoid layer and normalizes them across the entire dataset to ensure that the final output is a probability distribution, which helps in ranking and retrieving the most relevant code snippets. In total, the classifier consists of 1,181,954 trainable parameters. It is important to emphasize that the entire architecture including the classifier network is trained in an end-to-end fashion to optimize performance. The detailed network diagram is shown in Figure 6.
    Inference on code search: We use the CodeT5 architecture to generate embeddings corresponding to both the codes individually. Further, our Bi-LSTM network predicts a probability indicating the level of similarity between the codes and query. If the probability is greater than a threshold ( p 0.5 ) , then we classify the code as true for the given query.
    Inference on code retrieval: We precompute and store the embeddings using the fine-tuned architecture on the codes present in the entire dataset. During inference, we fetch the query embeddings using the architecture. We compute cosine similarity distance as the comparison metric. The codes are now ranked according to their similarity with the query embeddings, and the top-k codes are retrieved for the given query.

5. Experimental Results

This section describes the experiments conducted to assess the performance of the modified C2B hybrid model for clone detection and retrieval. The details of the dataset used in our experiments are presented in Section 5.1, while the evaluation metrics are presented in Section 5.2. The experimental setup is presented in Section 5.3, while the results of the experiments are analyzed and discussed in Section 5.4. Finally, in Section 5.7, we discuss potential issues with the validity of the results and some ways to mitigate them.

5.1. CodeSearchNet Dataset

The CodeSearchNet dataset is a comprehensive and large-scale dataset specifically created to support research and development in the field of source code retrieval. It addresses the challenge of matching natural language queries with relevant code snippets, enabling advancements in natural language understanding applied to code search. The dataset is designed to facilitate training and evaluation of models that can effectively bridge the gap between textual queries and code solutions. With over 17 million natural language queries and their corresponding code snippets, CodeSearchNet provides a diverse and extensive collection of data. The dataset covers a wide range of programming languages, including popular ones like Python, Java, C, C++, and JavaScript, allowing researchers to explore code search tasks across multiple programming paradigms. This diversity in languages and query types makes the dataset highly versatile and suitable for various research scenarios.
The data for CodeSearchNet are sourced from a variety of platforms and repositories, including open-source code repositories, Q&A websites, and code-related StackExchange posts. By aggregating data from these sources, the dataset captures real-world programming scenarios and reflects the challenges faced by developers in finding code solutions for their tasks. This ensures the relevance and authenticity of the queries and code snippets present in the dataset.
To ensure the quality and consistency of the data, the CodeSearchNet dataset undergoes preprocessing. This includes tokenization of the queries and code snippets, removing irrelevant or noisy data and anonymizing certain sensitive information, such as usernames or project-specific details. The preprocessing steps aim to provide a clean and standardized dataset for researchers and practitioners to work with. The dataset split of CodeSearchNet, a comprehensive dataset designed for source code retrieval, consists of training, validation, and test sets. These splits facilitate the training, evaluation, and benchmarking of models in the field of natural language understanding applied to code search. The dataset provides a diverse collection of natural language queries paired with code snippets in various programming languages. Table 2 showcases the approximate dataset split for CodeSearchNet, including the number of instances in each split. In addition to code retrieval, the CodeSearchNet dataset also supports other related tasks, including code summarization and code documentation generation. This flexibility encourages researchers to explore various aspects of natural language understanding and code search, fostering innovation and advancements in the field. Overall, the CodeSearchNet dataset has become an invaluable resource for researchers and practitioners working on code search and related natural language understanding tasks. Its extensive size, diversity, and real-world relevance make it a key asset for developing and evaluating models that can bridge the gap between textual queries and code snippets, improving the efficiency and effectiveness of code search systems.

5.2. Evaluation Metrics

The C2B hybrid model is evaluated against a set of standard metrics identified in the literature [3,10,88,89]. These metrics help determine the efficiency of the algorithm and allow the users to choose the system that best fits their use case. The metrics we use in our work for code retrieval are described below:
Precision: Precision is a classification metric that determines the percentage of predicted code snippets as true positives (TPs) and false positives (FPs). The precision value ranges from 0 to 1, where a value of 1 indicates no false positives. Precision is given by Equation (19).
P r e c i s o n ( P ) = T P / ( T P + F P )
Recall: Recall is a classification metric that calculates the percentage of actual code snippets that are true positives (TPs) out of all actual positives (true positives, TPs, and false negatives, FNs). The recall value ranges from 0 to 1, where a value of 1 indicates no false negatives. Recall is given by Equation (20).
R e c a l l ( R ) = T P / ( T P + F N )
Accuracy: Accuracy is a classification metric that calculates the proportion of correctly predicted code snippets, both true positives (TPs) and true negatives (TNs), out of all predictions made (true positives, true negatives, false positives, FPs, and false negatives, FNs). The accuracy value ranges from 0 to 1, where a value of 1 indicates perfect prediction. Accuracy is given by Equation (21).
A c c u r a c y ( A ) = ( T P + T N ) / ( T o t a l # o f P r e d i c t i o n s )
F1-score: The F1-score is a classification metric that combines precision and recall into a single value, providing a balance between the two. It is the harmonic mean of precision and recall, with the F1 value ranging from 0 to 1, where a value of 1 indicates perfect precision and recall. The F1 measure is given by Equation (22).
F 1 S c o r e = ( 2   ×   P   ×   . R ) / ( P + R )

5.3. Experimental Setup and Implementation Details

Our work was carried out in the Python programming language (version 3.7.12). In particular, we used the PyTorch deep learning framework (version 1.11.0) for model implementation, training, and optimization. We used the CodeT5 implementation https://huggingface.co/Salesforce/codet5-small (accessed on 15 April 2024) provided in the HuggingFace transformers library (version 4.18). The model was fine-tuned for two epochs on both training strategies—cosine embedding loss (CE) and binary cross-entropy (BCE) loss. We used the Kaggle https://www.kaggle.com/ (accessed on 15 April 2024) platform for training the models, which provides NVIDIA Tesla P100 GPUs (NVIDIA, Santa Clara, CA, USA) for parallel computation. Training the model for two epochs took between 40 and 45 h on the P100 GPU accelerators.
We used the AdamW [90] as the optimization algorithm with a learning rate of 5 × 10 5 and epsilon value of 1 × 10 8 , as given in the CodeT5 implementation https://github.com/salesforce/CodeT5 (accessed on 15 April 2024). The training and validation batch sizes were kept as 16 and 128, respectively. The maximum token length of codes was kept at 128 to retain maximum information as well as stay within the computational limits of Kaggle.

5.4. Results and Discussions

In this section, we describe the experiments conducted and present the results. In Experiment 1, we conduct a comprehensive evaluation of our proposed C2B hybrid model for code search. Firstly, we compare two training strategies, CE and BCE losses, and subsequently compare our best model against existing works using clone detection metrics. The results highlight the strengths and weaknesses of the C2B model in comparison to the original CodeT5 architecture and the Bi-LSTM model. In Experiment 2, we focus on assessing the total time required by the original CodeT5 and our proposed C2B retrieval model to answer queries. Additionally, we compare the accuracy of both models in response to queries, considering factors such as truncation of code token length and the independence of code embeddings from natural language query embeddings. These experiments, collectively, provide insights into the performance, efficiency, and accuracy of the proposed C2B hybrid model in code search applications.

5.4.1. Experiment 1

We divide the experiment into two parts. First, we present and compare the results of two of our proposed training strategies—CE and BCE losses. Second, we compare the results of our best model with the existing works against the clone detection metrics defined in the previous subsection.
Experiment 1.1: Table 3 depicts the results of our proposed C2B model on two training strategies along with the original CodeT5 results on the clone detection task. The first row in the table represents the results of the original CodeT5 architecture (with concatenated code tokens as described in Section 4.1.3) on the clone detection task. The second row represents the original architecture directly used without any fine-tuning on single code input. The third and the fourth row represent the modified CodeT5 architecture fine-tuned on CE and BCE losses, respectively.
As we can see from Table 3, the original CodeT5 performs the best on all metrics compared with the original Bi-LSTM and our proposed C2B model on BCE and CE losses. One reason why our fine-tuned C2B model lags behind the original CodeT5 architecture can be attributed to the fact that the CodeT5 authors use a maximum code token length of 400. In contrast, we truncate the maximum code token length to 128 to stay within our computational limits, thus losing a lot of useful information. Note that the original CodeT5 results were taken from the Github https://github.com/salesforce/CodeT5/issues/55#issuecomment-1178517051 (accessed on 15 April 2024) repository and not from their work [18], as they claimed that they reported an incorrect result in the paper.
We can see from Table 3 that in contrast to the CodeT5 model, the original Bi-LSTM model’s performance seems to be lower in terms of precision, recall, and F1 score. This indicates that it may struggle to capture relevant information effectively. It can be clearly observed in the table that the proposed C2B model using BCE loss achieves a higher F1-score (0.9023) than using CE loss (0.7485). Thus, we can conclude that the proposed C2B model, by minimizing BCE loss, tends to provide better code embeddings than CE loss. To verify this claim, we perform Experiment 2, in which we visualize the code embeddings and generate meaningful interpretations.
Experiment 1.2: In this experiment, we compare the performance of our proposed C2B model with some classical code search approaches from the literature. The results are presented in Table 4. All the results are generated on the CodeSearchNet testing dataset. The results of the existing works are arranged in ascending order of F1-scores. We can observe that the classical code search approaches such as NBOW [91] and CNN [92] have high precision but low recall values and thus a lower F1-score. Works such as BiRNN [93] and SelfAtt [94] are classical deep learning architectures that obtain better precision and recall scores than NBoW and CNN.
Recent deep learning architectures such as RoBERTa [56], CodeBERT [55], Graph-CodeBERT [54], CodeT5 [18], and FA-AST [53] compete to attain top results on the task. Three architectures—FA-AST, GraphCodeBERT, and CodeT5—currently achieve state-of-the-art F1-scores (0.95) on the clone detection task. Our best model, which is the modified CodeT5 architecture fine-tuned on BCE loss, obtains an F1-score of 0.902. The main reason why our best model drops in the F1-score is because we use a maximum code token length of 128 as opposed to all the other architectures which use a maximum code sequence length greater than 400. Our focus was never to achieve top results in clone detection but to assess whether the CodeT5 embeddings are capable of capturing essential semantic and syntactic code information. Our secondary aim was to identify whether the code embeddings could be useful for clone detection and retrieval. Thus, we can conclude that our best model achieves a considerable performance on the clone detection task.
When comparing the results of our proposed C2B hybrid model with other models in Table 5 for code search tasks, we can observe that our proposed C2B model delivers competitive performance, particularly in specific programming languages. In the Ruby language, the C2B model achieves an MRR of 0.863, which is close to the MRR of 0.872 achieved by the top-performing CodeT5. This suggests that the C2B model exhibits strong effectiveness in retrieving Ruby code, closely approaching the performance level of CodeT5. The C2B model performs impressively in JavaScript queries with an MRR of 0.730, closely trailing behind CodeT5, which has an MRR of 0.731. This demonstrates that our proposed C2B hybrid model is nearly as effective as CodeT5 in this language. The C2B model achieves an MRR of 0.901 for Go, whereas CodeT5 outperforms it slightly with an MRR of 0.921. However, the C2B model remains a strong contender and is very close in performance. For Python, the C2B model delivers an MRR of 0.703 for queries, which is slightly below the MRR of 0.712 achieved by CodeT5. Nevertheless, it demonstrates competitive performance. For Java language, the C2B model achieves an MRR of 0.711, while CodeT5 leads with an MRR of 0.723. The C2B model maintains a strong presence in this language. In the case of PHP, the C2B model performs exceptionally well in PHP queries, with an MRR of 0.731, slightly surpassing CodeT5, which has an MRR of 0.710. This highlights the C2B model’s strength in retrieving PHP code.
Across all programming languages, the C2B model maintains a competitive position with an MRR of 0.799, closely following “CodeT5,” which leads with an MRR of 0.801. The C2B model demonstrates remarkable effectiveness in code search tasks with particularly strong performance. It consistently ranks as the second-best or closely behind CodeT5 in various languages, showcasing its versatility and competitive capabilities in retrieving relevant code snippets. These results suggest that the C2B model is a valuable contender for code search applications.
C2B hybrid model performance in comparison to other models: The C2B hybrid model is a novel approach to code search tasks, combining natural language understanding with contextual code comprehension. It consistently ranks among the top performers across multiple programming languages with an overall MRR of 0.799 compared to the highest-performing model, CodeT5. The model’s architecture includes linear and nonlinear transformations and a softmax pooling layer that ensures the effective ranking and retrieval of relevant code snippets. This hybrid approach outperforms traditional models like NBow, CNN, and BiRNN and more advanced models like RoBERTa and GraphCodeBERT. The C2B hybrid model offers a robust and versatile solution for code search tasks. This model significantly enhances developer productivity and code quality in practical applications.

5.4.2. Experiment 2

In this experiment, we are interested in finding out the total time required by both the original CodeT5 and our proposed C2B retrieval model to answer n queries. Further, we compare through visualization how accurate both systems are at answering the queries.
Code retrieval process: The retrieval process with the Bi-LSTM model using cosine similarity begins with the precomputation of code embeddings, where vector representations capturing semantic information are generated and stored for all codes within the dataset. In the CodeSearchNet testing dataset, code pairs are provided, with the first code designated as the query code for retrieval purposes. The fine-tuned Bi-LSTM architecture is then employed to process this query code, extracting code embeddings that encapsulate its semantic content. This approach employs cosine similarity as the primary comparison metric. Cosine similarity computes similarity scores by measuring the cosine of the angle between two vectors, providing a numerical indication of their similarity. The computed cosine similarity scores are used to assess the likeness between the query code’s embeddings and the precomputed embeddings of all codes present in the dataset. Following the computation of cosine similarity scores, the retrieval process identifies the top “k” source codes that exhibit the highest cosine similarity scores with the query. The proposed approach leverages cosine similarity to retrieve the 20 code snippets most closely resembling the semantic content of the query, providing an efficient and effective method for code retrieval based on semantic similarity.
Benchmark queries: The query dataset used in this study was sourced from the CodeSearchNet challenge https://github.com/github/CodeSearchNet (accessed on 15 April 2024), which was developed by Husain et al. [95]. This dataset consists of 99 queries https://github.com/github/CodeSearchNet/blob/master/resources/queries.csv (accessed on 15 April 2024), representing commonly searched queries on Bing that are characterized by their high clickthrough rates to code snippets. These queries typically contain technical keywords [95].
Execution time: Figure 7 illustrates the time-based comparison of the original CodeT5 architecture [18] and our proposed C2B model. We can see that the original architecture takes a great deal of time as the number of queries increases, in the order of 102 min. However, the proposed C2B model takes around 1 min to address 50 queries. Thus, we can conclude that the proposed C2B model is scalable.
For every query, the original CodeT5 architecture runs through all the existing code in the dataset, concatenates the query tokens with the existing codes one by one, and produces a similarity score. Further, the codes in the dataset are ranked in descending order according to their similarity score. Since the architecture runs for n times (n represents total codes in the corpus) for each query it ends up consuming a lot of time. However, in our case, we run the fine-tuned proposed C2B model once for every query to extract its embeddings and then use cosine similarity for comparing the embeddings.
Accuracy: Figure 8 depicts the accuracy of both models in response to n queries. We define accuracy as a fraction of total queries containing the true codes in the top k retrieved code snippets by the model. We can observe from the figure that the performance of both models is comparable. There exists a drop in the accuracy of our proposed C2B model when compared to the CodeT5 model. This drop can be attributed to two factors: (i) We truncate the maximum code token length to 128, which is way below the 400-token limit used in the original CodeT5. (ii) In the original CodeT5 architecture, the model obtains both codes as inputs. This is beneficial as the model is able to capture the interrelation between the codes. However, in our case, the fine-tuned model generates embeddings by looking at the codes independently with respect to NL query embeddings.
Note that for the inference phase, code snippets corresponding to the provided natural language query were selected from the dataset. As depicted in Figure 8, it is evident that the average accuracy of our proposed C2B model for NL query and code snippet retrieval is approximately 0.93, while the average accuracy of the original CodeT5-based retrieval model stands at around 0.96. From this, we can infer that our proposed C2B retrieval system exhibits comparable accuracy and can retrieve relevant code snippets efficiently, reducing retrieval time.

5.5. Ablation Study

In this research, a comprehensive ablation study was conducted on the C2B hybrid model to evaluate the contributions of its key components. By systematically removing the CodeT5 and attention mechanism we analyze their impact on the model’s performance using precision, recall, F1-score, and accuracy metrics. This ablation study provides valuable insights into the model’s architecture, guiding future improvements and optimizations for enhanced performance and robustness.

5.5.1. Removing CodeT5 Component from C2B Hybrid Model

The model’s performance metrics changed drastically after the removal of CodeT5. As shown in Table 6, the precision declined to 0.70, recall to 0.65, F1-score to 0.67, and accuracy to 0.75. These results highlight the importance of CodeT5 in understanding natural language queries and improving the overall efficiency of the C2B model within organizations.
These statistics clearly highlight the significant impact of the CodeT5 component on the overall performance of the C2B hybrid model, emphasizing its importance in achieving high accuracy and effectiveness in code search tasks.

5.5.2. Removing Attention Component from C2B Hybrid Model

The C2B hybrid model without the attention mechanism was also subject to our evaluation. Without an attention mechanism, the results demonstrated a substantial decline in performance, with precision dropping to 0.82, recall to 0.78, F1-score to 0.80, and accuracy to 0.85, as shown in Table 7. The results indicate the critical importance of the attention mechanism in this model. The attention helps enhance focus between relevant portions of code and queries. The absence of an attention mechanism means that it is not capable of properly interpreting or matching code snippets with natural language queries. This emphasizes the need of using an attention strategy to attain the best outcomes in code search activities.

5.6. Practical Application of the C2B Hybrid Model

The C2B hybrid model offers significant practical application value in the domain of code search and retrieval. By combining the strengths of CodeT5 for natural language processing and Bi-LSTM for contextual code comprehension, the C2B hybrid model provides a robust and versatile solution capable of accurately understanding and matching complex queries with relevant code snippets. This dual approach ensures that the model not only grasps the semantic nuances of natural language queries but also captures the contextual dependencies within code, leading to superior performance across various programming languages and scenarios.
In practical terms, the C2B hybrid model enhances developer productivity by reducing the time and effort required to locate pertinent code snippets from large codebases. This is particularly beneficial in environments where quick access to specific code segments is critical for debugging, code review, or learning purposes. Additionally, the model’s ability to handle diverse code search tasks makes it a valuable tool in educational settings, where students and instructors can use it to find relevant examples and improve coding skills.
Furthermore, the high accuracy and effectiveness of the C2B hybrid model in retrieving relevant code can lead to improved code quality. By facilitating the reuse of well-tested and optimized code snippets, developers can minimize errors and redundancies, thereby enhancing the overall robustness of software projects. The model’s practical application extends to automated code documentation and intelligent code assistants, which can leverage its capabilities to provide context-aware suggestions and insights, ultimately streamlining the software development life cycle.

5.7. Threats to Validity

The proposed model may face various threats to validity that could impact the reliability and generalizability of its findings. Here are some potential threats to validity for this paper:
Data bias: The choice of the dataset used for evaluation could introduce bias into the results. If the dataset primarily consists of code snippets from certain programming languages or specific domains, the model’s performance may not generalize well to other languages or domains.
Limited dataset size: The dataset size could be relatively small, leading to potential overfitting issues. With limited training data, the model might not capture the full complexity and diversity of code retrieval scenarios.
Model selection: This paper might not explore a wide range of model architectures and configurations. This could limit the understanding of how other models perform in comparison to CodeT5 and Bi-LSTM or fail to identify the most suitable model for the specific task.
Evaluation metrics: The choice of evaluation metrics could influence the perceived performance of the proposed framework. If the metrics do not align well with the end users’ needs or real-world use cases, the reported results may not accurately reflect the system’s true effectiveness.
Implementation details: This paper may lack detailed information about hyperparameter tuning, learning rates, and other implementation specifics. These details are crucial for reproducibility and may affect the replicability of the experiments.
Overfitting: There is a risk that the reported performance metrics are inflated due to overfitting on the test set. Proper cross-validation techniques and evaluation on an external, unseen dataset could address this concern.
Lack of comparison with state of the art: Without comparing the proposed framework against state-of-the-art models or existing baseline approaches, this paper may not provide a comprehensive understanding of its advantages or limitations.
Domain-specific limitations: The framework may perform well in a specific domain but struggle with different types of code or queries. This paper should clearly state the limitations and applicability of the proposed method.
Interpretability: Deep learning models can be challenging to interpret. This paper should address how the model’s predictions are explainable or provide insights into the attention mechanisms’ behavior.
Hardware and computational constraints: The hardware and computational resources used in the experiments could influence the model’s performance and training time. A lack of information about these resources might hinder the reproducibility of the results.
To mitigate these threats to validity, the authors should provide comprehensive experimental details, use diverse datasets, compare their framework with relevant baselines, consider different evaluation metrics, and perform thorough cross validation to ensure the robustness and generalizability of their findings.

6. Conclusions and Future Work

In this research, we introduced a unique strategy for source code search and recommendation based on the CodeT5 and Bi-LSTM models. Existing code search tools frequently struggle to comprehend the semantics and purpose underlying code snippets, and they may fail to capture the sequential patterns inherent in code. We developed the “CodeT5 and Bi-LSTM” (C2B) hybrid model, which combines the advantages of CodeT5’s domain-specific pretraining with Bi-LSTM’s contextual knowledge, to overcome these difficulties. In order to perform better on tasks involving code, our goal was to improve code representation, code interpretation and creation, and the capturing of sequential dependencies. Our C2B hybrid model’s empirical evaluation on the large CodeSearchNet dataset produced encouraging findings. Our solution performed better than earlier approaches that relied on keyword- or syntactic-based strategies for retrieving pertinent code snippets. The C2B hybrid model significantly improved the code search and recommendation tasks by better understanding the semantics of code and capturing sequential patterns.
Several possible paths for future study might be investigated to further improve the suggested C2B hybrid model for source code search and recommendation. To begin, fine-tuning and refining the C2B model’s efficiency may result in increased code comprehension and generating capabilities. Furthermore, improving the C2B paradigm to include multilanguage code search would boost its practical value for developers working on a variety of projects involving different programming languages. The code search tool’s usability would be improved by including user interaction and feedback, which would also enable customized code suggestions. It would be beneficial to test the C2B model on actual projects and include developers in user studies to learn more about how successful and practical it is.
The model might perform better in real-world circumstances if code variability and resilience in retrieval results for code snippets with variations were addressed. In order to give developers a more effective and convenient tool for successfully reusing code and enhancing software development workflows, the C2B hybrid model may be improved and enhanced by following the above suggestions.

Author Contributions

Conceptualization, N.B.; Methodology, N.B.; Software, N.B.; Resources, F.A.; Data curation, A.M.; Writing—review & editing, A.M.; Supervision, T.R.; Project administration, T.R. and A.A.K.; Funding acquisition, F.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Dataset supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Azzeh, M.; Nassif, A.B.; Elsheikh, Y.; Angelis, L. On the value of project productivity for early effort estimation. Sci. Comput. Program. 2022, 219, 102819. [Google Scholar] [CrossRef]
  2. Ling, C.; Lin, Z.; Zou, Y.; Xie, B. Adaptive deep code search. In Proceedings of the 28th International Conference on Program Comprehension, Seoul, Republic of Korea, 13–15 July 2020; pp. 48–59. [Google Scholar]
  3. Sharma, T.; Kechagia, M.; Georgiou, S.; Tiwari, R.; Vats, I.; Moazen, H.; Sarro, F. A survey on machine learning techniques applied to source code. J. Syst. Softw. 2024, 209, 111934. [Google Scholar] [CrossRef]
  4. Bibi, N.; Rana, T.; Maqbool, A.; Alkhalifah, T.; Khan, W.Z.; Bashir, A.K.; Zikria, Y.B. Reusable Component Retrieval: A Semantic Search Approach for Low-Resource Languages. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2023, 22, 141. [Google Scholar] [CrossRef]
  5. Nie, L.; Jiang, H.; Ren, Z.; Sun, Z.; Li, X. Query expansion based on crowd knowledge for code search. IEEE Trans. Serv. Comput. 2016, 9, 771–783. [Google Scholar] [CrossRef]
  6. Stolee, K.T.; Elbaum, S.; Dobos, D. Solving the search for source code. ACM Trans. Softw. Eng. Methodol. (TOSEM) 2014, 23, 1–45. [Google Scholar] [CrossRef]
  7. Lv, F.; Zhang, H.; Lou, J.-g.; Wang, S.; Zhang, D.; Zhao, J. Codehow: Effective code search based on api understanding and extended boolean model (E). In Proceedings of the 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), Lincoln, NE, USA, 9–13 November 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 260–270. [Google Scholar]
  8. McMillan, C.; Grechanik, M.; Poshyvanyk, D.; Fu, C.; Xie, Q. Exemplar: A source code search engine for finding highly relevant applications. IEEE Trans. Softw. Eng. 2011, 38, 1069–1087. [Google Scholar] [CrossRef]
  9. Liu, S.; Xie, X.; Siow, J.; Ma, L.; Meng, G.; Liu, Y. Graphsearchnet: Enhancing gnns via capturing global dependencies for semantic code search. IEEE Trans. Softw. Eng. 2023, 49, 2839–2855. [Google Scholar] [CrossRef]
  10. Bibi, N.; Maqbool, A.; Rana, T.; Afzal, F.; Akgül, A.; El Din, S.M. Enhancing Semantic Code Search with Deep Graph Matching. IEEE Access 2023, 11, 52392–52411. [Google Scholar] [CrossRef]
  11. Liu, J.; Kim, S.; Murali, V.; Chaudhuri, S.; Chandra, S. Neural query expansion for code search. In Proceedings of the 3rd ACM Sigplan International Workshop on Machine Learning and Programming Languages, Phoenix, AZ, USA, 22 June 2019; pp. 29–37. [Google Scholar]
  12. Gu, X.; Zhang, H.; Kim, S. Deep code search. In Proceedings of the 40th International Conference on Software Engineering, Gothenburg, Sweden, 27 May–3 June 2018; pp. 933–944. [Google Scholar]
  13. Cambronero, J.; Li, H.; Kim, S.; Sen, K.; Chandra, S. When deep learning met code search. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Tallinn, Estonia, 26–30 August 2019; pp. 964–974. [Google Scholar]
  14. Haldar, R.; Wu, L.; Xiong, J.; Hockenmaier, J. A multi-perspective architecture for semantic code search. arXiv 2020, arXiv:2005.06980. [Google Scholar]
  15. Gu, W.; Li, Z.; Gao, C.; Wang, C.; Zhang, H.; Xu, Z.; Lyu, M.R. CRaDLe: Deep code retrieval based on semantic dependency learning. Neural Netw. 2021, 141, 385–394. [Google Scholar] [CrossRef]
  16. Sachdev, S.; Li, H.; Luan, S.; Kim, S.; Sen, K.; Chandra, S. Retrieval on source code: A neural code search. In Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, Philadelphia, PA, USA, 18 June 2018; pp. 31–41. [Google Scholar]
  17. Ling, X.; Wu, L.; Wang, S.; Pan, G.; Ma, T.; Xu, F.; Liu, A.X.; Wu, C.; Ji, S. Deep graph matching and searching for semantic code retrieval. ACM Trans. Knowl. Discov. Data (TKDD) 2021, 15, 1–21. [Google Scholar] [CrossRef]
  18. Wang, Y.; Wang, W.; Joty, S.; Hoi, S.C. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv 2021, arXiv:2109.00859. [Google Scholar]
  19. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 5485–5551. [Google Scholar]
  20. Mayrand, J.; Leblanc, C.; Merlo, E.M. Experiment on the automatic detection of function clones in a software system using metrics. In Proceedings of the 1996 Proceedings of International Conference on Software Maintenance, Monterey, CA, USA, 4–8 November 1996; pp. 244–253. [Google Scholar]
  21. Lee, S.; Jeong, I. SDD: High performance code clone detection system for large scale source code. In Proceedings of the Companion to the 20th annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, San Diego, CA, USA, 16–20 October 2005; pp. 140–141. [Google Scholar]
  22. Jiang, L.; Misherghi, G.; Su, Z.; Glondu, S. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th International Conference on Software Engineering (ICSE’07), Washington, DC, USA, 20–26 May 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 96–105. [Google Scholar]
  23. Dam, H.K.; Pham, T.; Ng, S.W.; Tran, T.; Grundy, J.; Ghose, A.; Kim, T.; Kim, C.J. A deep tree-based model for software defect prediction. arXiv 2018, arXiv:1802.00921. [Google Scholar]
  24. Alon, U.; Zilberstein, M.; Levy, O.; Yahav, E. code2vec: Learning distributed representations of code. Proc. ACM Program. Lang. 2019, 3, 1–29. [Google Scholar] [CrossRef]
  25. Allamanis, M.; Peng, H.; Sutton, C. A convolutional attention network for extreme summarization of source code. In Proceedings of the International Conference on Machine Learning, York City, NY, USA, 19–24 June 2016; pp. 2091–2100. [Google Scholar]
  26. Lam, A.N.; Nguyen, A.T.; Nguyen, H.A.; Nguyen, T.N. Combining deep learning with information retrieval to localize buggy files for bug reports (n). In Proceedings of the 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), Lincoln, NE, USA, 9–13 November 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 476–481. [Google Scholar]
  27. Mou, L.; Li, G.; Zhang, L.; Wang, T.; Jin, Z. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the AAAI Conference on Artificial Intelligence, Singapore, 3–7 September 2016; Volume 30. [Google Scholar]
  28. Nguyen, T.D.; Nguyen, A.T.; Phan, H.D.; Nguyen, T.N. Exploring API embedding for API usages and applications. In Proceedings of the 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), Buenos Aires, Argentina, 20–28 May 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 438–449. [Google Scholar]
  29. Peng, H.; Mou, L.; Li, G.; Liu, Y.; Zhang, L.; Jin, Z. Building program vector representations for deep learning. In Proceedings of the Knowledge Science, Engineering and Management: 8th International Conference, KSEM 2015, Proceedings 8, Chongqing, China, 28–30 October 2015; Springer: Cham, Switzerland, 2015; pp. 547–553. [Google Scholar]
  30. Raychev, V.; Vechev, M.; Yahav, E. Code completion with statistical language models. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, New York, NY, USA, 9–11 June 2014; pp. 419–428. [Google Scholar]
  31. White, M.; Tufano, M.; Martinez, M.; Monperrus, M.; Poshyvanyk, D. Sorting and transforming program repair ingredients via deep learning code similarities. In Proceedings of the 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER), Hangzhou, China, 24–27 February 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 479–490. [Google Scholar]
  32. White, M.; Tufano, M.; Vendome, C.; Poshyvanyk, D. Deep learning code fragments for code clone detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, Singapore, 3–7 September 2016; pp. 87–98. [Google Scholar]
  33. Mou, L.; Men, R.; Li, G.; Zhang, L.; Jin, Z. On end-to-end program generation from user intention by deep neural networks. arXiv 2015, arXiv:1510.07211. [Google Scholar]
  34. Gu, X.; Zhang, H.; Zhang, D.; Kim, S. Deep API learning. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, Seattle, WA, USA, 13 November 2016; pp. 631–642. [Google Scholar]
  35. White, M.; Vendome, C.; Linares-Vásquez, M.; Poshyvanyk, D. Toward deep learning software repositories. In Proceedings of the 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, Florence, Italy, 16–17 May 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 334–345. [Google Scholar]
  36. Artetxe, M.; Labaka, G.; Agirre, E. Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 451–462. [Google Scholar]
  37. Conneau, A.; Lample, G.; Ranzato, M.; Denoyer, L.; Jégou, H. Word translation without parallel data. arXiv 2017, arXiv:1710.04087. [Google Scholar]
  38. Grave, E.; Joulin, A.; Berthet, Q. Unsupervised alignment of embeddings with wasserstein procrustes. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, Naha, Japan, 18 April 2019; pp. 1880–1890. [Google Scholar]
  39. Allamanis, M.; Tarlow, D.; Gordon, A.; Wei, Y. Bimodal modelling of source code and natural language. In Proceedings of the International Conference on Machine Learning, Lile, France, 6–11 July 2015; pp. 2123–2132. [Google Scholar]
  40. Murali, V.; Chaudhuri, S.; Jermaine, C. Bayesian sketch learning for program synthesis. arXiv 2017, arXiv:1703.05698v5. [Google Scholar]
  41. Iyer, S.; Konstas, I.; Cheung, A.; Zettlemoyer, L. Summarizing source code using a neural attention model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics 2016, Berlin, Germany, 7–12 August 2016; pp. 2073–2083. [Google Scholar]
  42. Zhou, Z.; Yu, H.; Fan, G.; Huang, Z.; Yang, X. Summarizing source code with hierarchical code representation. Inf. Softw. Technol. 2022, 143, 106761. [Google Scholar] [CrossRef]
  43. Haiduc, S.; Aponte, J.; Marcus, A. Supporting program comprehension with source code summarization. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 2, Cape Town, South Africa, 1–8 May 2010; pp. 223–226. [Google Scholar]
  44. de Rezende Martins, M.; Gerosa, M.A. CoNCRA: A Convolutional Neural Networks Code Retrieval Approach. In Proceedings of the XXXIV Brazilian Symposium on Software Engineering, Natal, Brazil, 19–23 October 2020; pp. 526–531. [Google Scholar]
  45. Sridhara, G.; Pollock, L.; Vijay-Shanker, K. Automatically detecting and describing high level actions within methods. In Proceedings of the 33rd International Conference on Software Engineering, Honolulu, HI, USA, 21–28 May 2011; pp. 101–110. [Google Scholar]
  46. McBurney, P.W.; McMillan, C. Automatic source code summarization of context for java methods. IEEE Trans. Softw. Eng. 2015, 42, 103–119. [Google Scholar] [CrossRef]
  47. Oda, Y.; Fudaba, H.; Neubig, G.; Hata, H.; Sakti, S.; Toda, T.; Nakamura, S. Learning to generate pseudo-code from source code using statistical machine translation. In Proceedings of the 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), Lincoln, NE, USA, 9–13 November 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 574–584. [Google Scholar]
  48. Movshovitz-Attias, D.; Cohen, W. Natural language models for predicting programming comments. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Sofia, Bulgaria, 4–9 August 2013; pp. 35–40. [Google Scholar]
  49. Fischer, G.; Lusiardi, J.; Von Gudenberg, J.W. Abstract syntax trees-and their role in model driven software development. In Proceedings of the International Conference on Software Engineering Advances (ICSEA 2007), Cap Esterel, France, 25–31 August 2007; p. 38. [Google Scholar]
  50. Hu, X.; Li, G.; Xia, X.; Lo, D.; Jin, Z. Deep code comment generation. In Proceedings of the the 26th Conference on Program Comprehension, Gothenburg, Sweden, 27 May–3 June 2018; pp. 200–210. [Google Scholar]
  51. Alon, U.; Brody, S.; Levy, O.; Yahav, E. code2seq: Generating sequences from structured representations of code. arXiv 2018, arXiv:1808.01400. [Google Scholar]
  52. Zhang, J.; Wang, X.; Zhang, H.; Sun, H.; Liu, X. Retrieval-based neural source code summarization. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, Seoul, Republic of Korea, 27 June–19 July 2020; pp. 1385–1397. [Google Scholar]
  53. Wang, W.; Li, G.; Ma, B.; Xia, X.; Jin, Z. Detecting code clones with graph neural networkand flow-augmented abstract syntax tree. arXiv 2020, arXiv:2002.08653. [Google Scholar]
  54. Guo, D.; Ren, S.; Lu, S.; Feng, Z.; Tang, D.; Liu, S.; Zhou, L.; Duan, N.; Svyatkovskiy, A.; Fu, S.; et al. Graphcodebert: Pre-training code representations with data flow. arXiv 2020, arXiv:2009.08366. [Google Scholar]
  55. Feng, Z.; Guo, D.; Tang, D.; Duan, N.; Feng, X.; Gong, M.; Shou, L.; Qin, B.; Liu, T.; Jiang, D.; et al. Codebert: A pre-trained model for programming and natural languages. arXiv 2020, arXiv:2002.08155. [Google Scholar]
  56. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
  57. Lee, J.; Lee, I.; Kang, J. Self-attention graph pooling. In Proceedings of the International Conference on Machine Learning, Beach, CA, USA, 10–15 June 2019; pp. 3734–3743. [Google Scholar]
  58. LeClair, A.; Jiang, S.; McMillan, C. A neural model for generating natural language summaries of program subroutines. In Proceedings of the 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), Montreal, QC, Canada, 25–31 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 795–806. [Google Scholar]
  59. Hu, X.; Li, G.; Xia, X.; Lo, D.; Jin, Z. Deep code comment generation with hybrid lexical and syntactical information. Empir. Softw. Eng. 2020, 25, 2179–2217. [Google Scholar] [CrossRef]
  60. Zhou, Z.; Yu, H.; Fan, G.; Huang, Z.; Yang, K. Towards Retrieval-Based Neural Code Summarization: A Meta-Learning Approach. IEEE Trans. Softw. Eng. 2023, 49, 3008–3031. [Google Scholar] [CrossRef]
  61. Hu, X.; Li, G.; Xia, X.; Lo, D.; Lu, S.; Jin, Z. Summarizing Source Code with Transferred API Knowledge. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI-18), Stockholm, Sweden, 13–19 July 2018; pp. 2269–2275. [Google Scholar]
  62. Zhou, Z.; Yu, H.; Fan, G. Effective approaches to combining lexical and syntactical information for code summarization. Softw. Pract. Exp. 2020, 50, 2313–2336. [Google Scholar] [CrossRef]
  63. Wei, B.; Li, Y.; Li, G.; Xia, X.; Jin, Z. Retrieve and refine: Exemplar-based neural comment generation. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, Melbourne, Australia, 21–25 September 2020; pp. 349–360. [Google Scholar]
  64. Fernandes, P.; Allamanis, M.; Brockschmidt, M. Structured neural summarization. arXiv 2018, arXiv:1811.01824. [Google Scholar]
  65. Liu, S.; Chen, Y.; Xie, X.; Siow, J.K.; Liu, Y. Automatic code summarization via multi-dimensional semantic fusing in gnn. arXiv 2020, arXiv:2006.05405. [Google Scholar]
  66. LeClair, A.; Haque, S.; Wu, L.; McMillan, C. Improved code summarization via a graph neural network. In Proceedings of the 28th International Conference on Program Comprehension, Seoul, Republic of Korea, 13–15 July 2020; pp. 184–195. [Google Scholar]
  67. Chen, Q.; Zhou, M. A neural framework for retrieval and summarization of source code. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, Montpellier, France, 3–7 September 2018; pp. 826–831. [Google Scholar]
  68. Yao, Z.; Peddamail, J.R.; Sun, H. Coacor: Code annotation for code retrieval with reinforcement learning. In Proceedings of the The World Wide Web Conference, Austin, TX, USA, 13–17 May 2019; pp. 2203–2214. [Google Scholar]
  69. Wei, B.; Li, G.; Xia, X.; Fu, Z.; Jin, Z. Code generation as a dual task of code summarization. arXiv 2019, arXiv:1910.05923. [Google Scholar]
  70. Ye, W.; Xie, R.; Zhang, J.; Hu, T.; Wang, X.; Zhang, S. Leveraging code generation to improve code retrieval and summarization via dual learning. In Proceedings of the Web Conference 2020, Taipei, Taiwan, 20–24 April 2020; pp. 2309–2319. [Google Scholar]
  71. Wang, W.; Zhang, Y.; Sui, Y.; Wan, Y.; Zhao, Z.; Wu, J.; Philip, S.Y.; Xu, G. Reinforcement-learning-guided source code summarization using hierarchical attention. IEEE Trans. Softw. Eng. 2020, 48, 102–119. [Google Scholar] [CrossRef]
  72. Kanade, A.; Maniatis, P.; Balakrishnan, G.; Shi, K. Learning and evaluating contextual embedding of source code. In Proceedings of the 37th International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 5110–5121. [Google Scholar]
  73. Clark, K.; Luong, M.T.; Le, Q.V.; Manning, C.D. Electra: Pre-training text encoders as discriminators rather than generators. arXiv 2020, arXiv:2003.10555. [Google Scholar]
  74. Svyatkovskiy, A.; Deng, S.K.; Fu, S.; Sundaresan, N. Intellicode compose: Code generation using transformer. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual, 6–16 November 2020; pp. 1433–1443. [Google Scholar]
  75. Ren, S.; Guo, D.; Lu, S.; Zhou, L.; Liu, S.; Tang, D.; Sundaresan, N.; Zhou, M.; Blanco, A.; Ma, S. Codebleu: A method for automatic evaluation of code synthesis. arXiv 2020, arXiv:2009.10297. [Google Scholar]
  76. Dong, L.; Yang, N.; Wang, W.; Wei, F.; Liu, X.; Wang, Y.; Gao, J.; Zhou, M.; Hon, H.W. Unified language model pre-training for natural language understanding and generation. Adv. Neural Inf. Process. Syst. 2019, 32, 1–13. [Google Scholar]
  77. Roziere, B.; Lachaux, M.A.; Chanussot, L.; Lample, G. Unsupervised translation of programming languages. Adv. Neural Inf. Process. Syst. 2020, 33, 20601–20611. [Google Scholar]
  78. Clement, C.B.; Drain, D.; Timcheck, J.; Svyatkovskiy, A.; Sundaresan, N. PyMT5: Multi-mode translation of natural language and Python code with transformers. arXiv 2020, arXiv:2010.03150. [Google Scholar]
  79. Mastropaolo, A.; Scalabrino, S.; Cooper, N.; Palacio, D.N.; Poshyvanyk, D.; Oliveto, R.; Bavota, G. Studying the usage of text-to-text transfer transformer to support code-related tasks. In Proceedings of the 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), Madrid, Spain, 22–30 May 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 336–347. [Google Scholar]
  80. Elnaggar, A.; Ding, W.; Jones, L.; Gibbs, T.; Feher, T.; Angerer, C.; Severini, S.; Matthes, F.; Rost, B. CodeTrans: Towards Cracking the Language of Silicon’s Code Through Self-Supervised Deep Learning and High Performance Computing. arXiv 2021, arXiv:2104.02443. [Google Scholar]
  81. Ahmad, W.U.; Chakraborty, S.; Ray, B.; Chang, K.W. Unified pre-training for program understanding and generation. arXiv 2021, arXiv:2103.06333. [Google Scholar]
  82. Roziere, B.; Lachaux, M.A.; Szafraniec, M.; Lample, G. Dobf: A deobfuscation pre-training objective for programming languages. arXiv 2021, arXiv:2102.07492. [Google Scholar]
  83. Zügner, D.; Kirschstein, T.; Catasta, M.; Leskovec, J.; Günnemann, S. Language-agnostic representation learning of source code from structure and context. arXiv 2021, arXiv:2103.11318. [Google Scholar]
  84. Bostrom, K.; Durrett, G. Byte pair encoding is suboptimal for language model pretraining. arXiv 2020, arXiv:2004.03720. [Google Scholar]
  85. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
  86. Sudholt, S.; Fink, G.A. Evaluating word string embeddings and loss functions for CNN-based word spotting. In Proceedings of the 2017 14th Iapr International Conference on Document Analysis and Recognition (Icdar), Kyoto, Japan, 9–15 November 2017; IEEE: Piscataway, NJ, USA, 2017; Volume 1, pp. 493–498. [Google Scholar]
  87. Ruder, S. An overview of gradient descent optimization algorithms. arXiv 2016, arXiv:1609.04747. [Google Scholar]
  88. Phan, H.; Jannesari, A. Leveraging Statistical Machine Translation for Code Search. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering (EASE 2024), Salerno, Italy, 18–21 June 2024; ACM: New York, NY, USA, 2024; pp. 191–200. [Google Scholar]
  89. Bibi, N.; Rana, T.; Maqbool, A.; Afzal, F.; Akgül, A.; De la Sen, M. An Intelligent Platform for Software Component Mining and Retrieval. Sensors 2023, 23, 525. [Google Scholar] [CrossRef] [PubMed]
  90. Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the 7th International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  91. Zhang, X.; Xin, J.; Yates, A.; Lin, J. Bag-of-Words Baselines for Semantic Code Search. In Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021), Bangkok, Thailand, 1–6 August 2021; pp. 88–94. [Google Scholar]
  92. Kim, Y. Convolutional neural networks for sentence classification. arXiv 2014, arXiv:1408.5882. [Google Scholar]
  93. Cho, K.; Van Merriënboer, B.; Bahdanau, D.; Bengio, Y. On the properties of neural machine translation: Encoder-decoder approaches. arXiv 2014, arXiv:1409.1259. [Google Scholar]
  94. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 88–94. [Google Scholar]
  95. Husain, H.; Wu, H.H.; Gazit, T.; Allamanis, M.; Brockschmidt, M. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv 2019, arXiv:1909.09436. [Google Scholar]
Figure 1. Methodology for the proposed C2B hybrid model.
Figure 1. Methodology for the proposed C2B hybrid model.
Applsci 14 05795 g001
Figure 2. Proposed architecture for C2B hybrid model.
Figure 2. Proposed architecture for C2B hybrid model.
Applsci 14 05795 g002
Figure 3. Training of C2B hybrid model.
Figure 3. Training of C2B hybrid model.
Applsci 14 05795 g003
Figure 4. Embedding learning code embeddings by minimizing cosine embedding loss.
Figure 4. Embedding learning code embeddings by minimizing cosine embedding loss.
Applsci 14 05795 g004
Figure 5. Learning code embeddings by minimizing binary cross-entropy loss.
Figure 5. Learning code embeddings by minimizing binary cross-entropy loss.
Applsci 14 05795 g005
Figure 6. C2B hybrid model network architecture.
Figure 6. C2B hybrid model network architecture.
Applsci 14 05795 g006
Figure 7. Total time taken to answer n queries. (a) CodeT5 time for n queries; (b) C2B hybrid model time for n queries.
Figure 7. Total time taken to answer n queries. (a) CodeT5 time for n queries; (b) C2B hybrid model time for n queries.
Applsci 14 05795 g007
Figure 8. Learning code embeddings by minimizing binary cross-entropy loss.
Figure 8. Learning code embeddings by minimizing binary cross-entropy loss.
Applsci 14 05795 g008
Table 1. C2B hybrid model classifier architecture.
Table 1. C2B hybrid model classifier architecture.
LayerInput ShapeOutput Shape# Parameters
CodeT5[1, 768][1, 768]-
Bi-LSTM[1, 768][1, 768]-
Feature concatenation[1, 1536][1, 1536]-
Linear-1[1, 1536][1, 768]1,180,416
Tanh[1, 768][1, 768]-
Linear-2[1, 768][1, 1]1538
Sigmoid[1, 1][1, 1]-
Softmax pooling[1, 1][1, 1]-
Table 2. CodeSearchNet dataset splits.
Table 2. CodeSearchNet dataset splits.
LanguageTrainingValidationTest
Java1,402,97670,14687,682
Python1,274,98663,83379,787
Javascript1,505,98475,29994,149
Ruby192,825965012,064
Go10,00050005000
PHP99,98649994999
Total4,576,757228,927283,681
Table 3. Results of the original CodeT5 and our C2B model for code search. Figures in bold refer to the best values, whereas the underlined figures refer to be the second-best values for the evaluation metrics.
Table 3. Results of the original CodeT5 and our C2B model for code search. Figures in bold refer to the best values, whereas the underlined figures refer to be the second-best values for the evaluation metrics.
ArchitecturePrecisionRecallF1 Score
Original CodeT50.95260.94740.9500
Original Bi-LSTM0.54530.61020.5759
Training Strategies for C2B Model
Fine-tuned C2B model
(CEL)
0.62420.93460.7485
Fine-tuned C2B model
(BCEL)
0.89720.90740.9023
Table 4. Results on code search. Figures in bold refer to the best values, whereas the underlined figures refer to be the second-best values for the evaluation metrics.
Table 4. Results on code search. Figures in bold refer to the best values, whereas the underlined figures refer to be the second-best values for the evaluation metrics.
ArchitecturePrecisionRecallF1 Score
NBoW [91]0.900.010.01
CNN [92]0.930.020.03
BiRNN [93]0.920.740.82
selfAtt [94]0.920.940.93
ROBERTA [56]0.9490.9220.935
CodeBERT [55]0.9470.9340.941
FA-AST [53]0.960.940.95
GraphCodeBERT [54]0.9480.9520.95
CodeT5 [18]0.9520.9470.95
Proposed C2B hybrid model0.8970.9070.902
Table 5. Results on code search. Figures in bold refer to the best values, whereas the underlined figures refer to the second-best values for the MRR.
Table 5. Results on code search. Figures in bold refer to the best values, whereas the underlined figures refer to the second-best values for the MRR.
ModelRubyJavascriptGoPythonJavaPhpOverall
NBow0.1620.1570.3300.1610.1710.1520.189
CNN0.2760.2240.6800.2420.2600.2600.324
BiRNN0.2130.1930.6880.2900.3380.3380.338
selfAtt0.2750.2870.7230.3980.4040.4260.419
RoBERTa0.5870.5170.8500.5870.5990.5600.617
CodeBERT0.6790.6200.8820.6720.6760.6280.693
FA-AST0.6280.5620.8590.6100.6200.5790.643
GraphCodeBERT0.7030.6440.8970.6920.6910.6490.713
CodeT50.8720.7310.9210.7120.7230.7100.801
Proposed C2B hybrid model0.8630.7300.9010.7030.7110.7310.799
Table 6. Performance metrics of the C2B hybrid model with and without CodeT5.
Table 6. Performance metrics of the C2B hybrid model with and without CodeT5.
ModelPrecisionRecallF1-ScoreAccuracy
Original C2B model0.8970.9070.8200.902
Without CodeT50.7000.6500.6700.750
Table 7. Performance metrics of the C2B hybrid model with and without the attention mechanism.
Table 7. Performance metrics of the C2B hybrid model with and without the attention mechanism.
ModelPrecisionRecallF1-ScoreAccuracy
Original C2B model0.8970.9070.820.902
Without attention mechanism0.8200.7800.8000.850
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bibi, N.; Maqbool, A.; Rana, T.; Afzal, F.; Khan, A.A. C2B: A Semantic Source Code Retrieval Model Using CodeT5 and Bi-LSTM. Appl. Sci. 2024, 14, 5795. https://doi.org/10.3390/app14135795

AMA Style

Bibi N, Maqbool A, Rana T, Afzal F, Khan AA. C2B: A Semantic Source Code Retrieval Model Using CodeT5 and Bi-LSTM. Applied Sciences. 2024; 14(13):5795. https://doi.org/10.3390/app14135795

Chicago/Turabian Style

Bibi, Nazia, Ayesha Maqbool, Tauseef Rana, Farkhanda Afzal, and Adnan Ahmed Khan. 2024. "C2B: A Semantic Source Code Retrieval Model Using CodeT5 and Bi-LSTM" Applied Sciences 14, no. 13: 5795. https://doi.org/10.3390/app14135795

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop