A Review of Deep Learning-Based Binary Code Similarity Analysis

Du, Jiang; Wei, Qiang; Wang, Yisen; Sun, Xiangjie

doi:10.3390/electronics12224671

Open AccessReview

A Review of Deep Learning-Based Binary Code Similarity Analysis

by

Jiang Du

¹,

Qiang Wei

¹,

Yisen Wang

^1,* and

Xiangjie Sun

²

¹

School of Cyber Science and Engineering, Information Engineering University, Zhengzhou 450001, China

²

School of Cyber Science and Engineering, Zhengzhou University, Zhengzhou 450002, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(22), 4671; https://doi.org/10.3390/electronics12224671

Submission received: 26 September 2023 / Revised: 27 October 2023 / Accepted: 14 November 2023 / Published: 16 November 2023

(This article belongs to the Special Issue AI in Cybersecurity, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Against the backdrop of highly developed software engineering, code reuse has been widely recognized as an effective strategy to significantly alleviate the burden of development and enhance productivity. However, improper code citation could lead to security risks and license issues. With the source codes of many pieces of software being difficult to obtain, binary code similarity analysis (BCSA) has been extensively implemented in fields such as bug search, code clone detection, and patch analysis. This research selects 39 papers on BCSA from top-tier and emerging conferences within artificial intelligence, network security, and software engineering from 2016 to 2022 for deep analysis. The central focus lies on methods utilizing deep learning technologies, detailing a thorough summary and the arrangement of the application and implementation specifics of various deep learning technologies. Furthermore, this study summarizes the research patterns and development trends in this field, thereby proposing potential directions for future research.

Keywords:

1. Introduction

With the progression of science and technology, electronic devices, software, and the Internet have become integral components of daily life. The continual improvement of Internet technology, the frequent updates and upgrades of software applications, and the increasingly complex network environment, coupled with the ease of use of software, has brought immense convenience to people. However, it also presents significant challenges to software developers in terms of development and maintenance. The utilization of open source software not only reduces the workload of developers but also transfers the maintenance responsibilities to third-party software developers. Despite the benefits of the extensive use of open source software, including improved efficiency in development, it also entails several risks. For instance, the incorporation of open source software that contains vulnerabilities can lead to the introduction of such vulnerabilities into the engineering code. Additionally, the unauthorized use of open source software in project code may result in license compliance issues.

The “2022 Open Source Security and Analysis Report” [1] released by Synopsys highlights the prevalence of open source code in various industries. Among the 17 industries studied, those related to computer hardware semiconductors, network security, energy and clean technology, and the Internet of Things have code bases that are entirely composed of open source code. The remaining industries, which range from 93% to 99% in terms of open source code usage, still have significant portions of their code bases relying on open source software. The report also indicates that the extensive use of open source code in different industries has brought both benefits and risks. For example, in the Internet of Things sector, 100% of codebases use open source software, with 64% of those codebases being vulnerable. Similarly, in the aerospace, automotive, transportation, and logistics industries, 97% of codebases contain open source code and 60% of those codebases have security vulnerabilities.

In late 2021, a zero-day vulnerability was identified in ApacheLog4j, a commonly used program. This vulnerability, known as Log4Shell (CVE-2021-44228) [2], enables an attacker to execute arbitrary code on an affected server. The first documented attacks using this vulnerability occurred on December 9, initially aimed at the Java Edition 1.18 of the Microsoft’s Minecraft game. According to the attack cases documented in the GitHub repository YfryTchsGD/Log4jAttackSurface, this vulnerability affects a range of popular services and platforms, including Apple iCloud, QQ Mailbox, Steam Store, Twitter, and Baidu search. This highlights the potential far-reaching consequences of vulnerabilities in widely used open source codebases.

In terms of license security, works of innovation (including software) are protected by exclusive copyright as a matter of default. Any use, copying, distribution, or modification of the software without the express permission of the creator/author in the form of an authorized license is legally prohibited. Even the most lenient open source licenses impose obligations on users when utilizing the software. The potential for license risk arises when the license of open source code present in a codebase may be in conflict with the overarching license of that codebase. For instance, the GNU General Public License (GPL) generally regulates the utilization of open source code in commercial software, but commercial software vendors may neglect the mandates of the GPL license, which may result in license conflicts. With respect to industries, the computer hardware and semiconductor industries have the highest percentage of codebases with open source license conflicts at 93%, followed by the Internet of Things industry at 83%. Conversely, healthcare, health tech, and life sciences have the lowest percentage of codebases with open source license conflicts at 41%.

BCSA constitutes a strategic approach to address security vulnerabilities arising from code reuse, under the inhibitive prerequisite of source access denial. By measuring and comparing the similarity between binary and vulnerable functions, we can perform a preliminary assessment of the potential vulnerability properties of the target function. Such comparison framework can be utilized for a singular binary function match, as well as extended to multiple matches, that is, indexing the target function in a global vulnerability database. Analogously, this methodology can also aid in revealing covert acts of code plagiarism as well as potential licensing risks.

The reuse of open source code or licensing security issues can pose threats to both network security and copyright protection and make it more challenging to obtain source code during program analysis. Dynamic analysis tools that are stable and adaptable for use in embedded devices are limited in availability. As a result, researchers have started to investigate the detection of code reuse using BCSA techniques and have achieved significant progress. However, there is a lack of comprehensive literature that presents the recent advancements in BCSA techniques, inspired by technologies such as natural language processing (NLP) and graph neural networks (GNN). A literature review conducted by Haq et al. [3] provides a summary of the development of BCSA technology in the two decades prior to 2019 and a systematic analysis of the technical details of BCSA methods. Kim et al. [4] analyzed 43 BCSA papers from 2014 to 2020, outlined the problems in the current research, and offered solutions. Yu et al. [5] evaluated the content of 34 works, focusing specifically on their performance in searching for vulnerabilities in embedded device firmware.

This article has collected the representative binary similarity analysis methods proposed from 2016 to 2022 and has evaluated and classified the technological features of binary similarity analysis in detail, with a particular emphasis on their technological features in combining the usage of deep learning technologies. These works not only appear in top conferences and selected secondary conferences in the fields of cyber security and software engineering but also in top conferences and journals in the fields of data mining and artificial intelligence, amounting to a total of 39 papers. According to this article, the similarity analysis of binary code is performed by processing and comparing two segments of binary code. Therefore, methods that require auxiliary information from source code or other external binary code are not included.

The structure of this article will be arranged as follows: Firstly, the necessity of binary similarity analysis is discussed (Section 1), along with its basic process and the challenges it faces (Section 2). Subsequently, through the analysis and classification of existing work, the developmental trends of BCSA technologies are summarized (Section 3). In studies where deep learning technologies are implemented, thorough analyses and comparisons pertaining to their application have been meticulously conducted (Section 4). Lastly, a summary of all the related existing technologies is provided, and potential future research directions are proposed (Section 5).

2. Basic Process of BCSA

The study of software similarity analysis encompasses both source code similarity analysis and BCSA. When the source code is readily accessible, source code similarity analysis is frequently conducted to examine the reuse of vulnerable code segments or the utilization of unlicensed code, for instance in the case of interpreted languages such as Java or Python. However, in most cases, the target program is in binary form, and it becomes challenging to obtain the source code. Hence, BCSA plays a significant role in the research of code similarity analysis. This section will provide an overview of BCSA from two perspectives: the transformation process from source code to binary code and the fundamental procedures involved in BCSA.

2.1. Compile Preprocessing

Binary code represents the machine code that results from the compilation of source code and can be executed directly via the central processing unit. This code comprises a series of binary digits (0 s and 1 s) and is not easily readable by humans. To facilitate the analysis of binary code, reverse engineering techniques are employed to translate the machine language code into assembly language, and tools such as debuggers are utilized to simplify the manual examination process.

As shown in Figure 1, the typical process of transforming source code into binary code usually encompasses four stages: pre-compilation, compilation, assembly, and linking. The pre-compilation phase primarily manages operations such as the expansion of header files, substitution of macros, and the elimination of comments. The compilation stage carries out lexical, syntax, and semantic analysis on the code, optimizes it, and transforms it into assembly code. The assembly stage transforms assembly code into machine code. Finally, the linking stage integrates the compiled object files into a binary form to generate the final executable file.

The compilation process is responsible for accurately converting the source code into a binary format that the CPU can execute directly. However, the outcome of this process is not fixed, as various factors such as the choice of compiler, optimization options, target CPU architecture, and operating system can all have an impact on the final machine code produced. Consequently, the same source code can result in different binary code outputs through different compilation paths, presenting a challenge for BCSA.

2.2. Basic Process of BCSA

The central objective of BCSA is to establish the provenance of two binary functions by analyzing their similarities. This analysis forms the foundation for determining the likeness of binary functions. In certain circumstances, the one-to-one comparison of binary functions can be expanded, such as in the case of vulnerability search, where it may be extended to one-to-many function comparison, and in code clone detection, where it may be expanded to many-to-many function comparison.

This paper presents a clear depiction of the technical characteristics of BCSA technology by organizing the work into three stages, as depicted in Figure 2. These stages are the feature extraction stage, the feature representation stage, and the feature comparison stage.

Phase 1: Feature Extraction. The primary task in this stage is to obtain the inherent features of the binary function through the utilization of analysis tools such as IDA Pro [6], BAP [7], Angr [8], Valgrind [9], etc. These inherent features refer to those that are directly obtained from the analysis tools without any additional processing, such as program control flow graphs and call graphs. The input to this stage is a set of raw binary functions, such as binary files, and the output is the raw binary function features. These features are then subjected to further processing in the feature representation stage before being compared in the feature comparison stage. As an illustration, the work performed in the feature extraction stage in Gemini [10] involves the extraction of function control flow graphs (CFGs) and basic block information.

Phase 2: Feature Representation. The main objective of this stage is to process the inherent features of the function obtained in the feature extraction stage in accordance with the author’s requirements and preferences. The input of this stage is the inherent features of the function as produced by the feature extraction stage, and the output is a form of data that can be directly utilized for similarity calculation in the feature comparison stage. As an example, the work performed in the feature representation stage in Gemini [10] encompasses two main tasks: first, the basic block information is transformed into a digital vector representation, serving as nodes in the control flow graph (CFG), resulting in the creation of an ACFG with basic block attribute information. The ACFG is then represented as a vector through the use of an end-to-end neural network, providing a representation of the function that encompasses both its structural and semantic information. This vector representation is used to directly calculate the similarity of the functions in the feature comparison stage.

Phase 3: Feature Comparison. The primary task of this stage is to employ an appropriate method to calculate the similarity between pairs of functional features generated in the feature representation stage. The input of this stage is the representation of the functional features directly produced by the feature representation stage, and the output is the score of similarity between the two functions obtained through the similarity calculation. As an illustration, in the case of Gemini [10], the feature comparison stage employs the cosine distance method to determine the similarity between two feature vectors representing two functions.

3. Classification of BCSA

In this section, the evolution of the BCSA field from 2016 to 2022 will be described through a preliminary categorization of the studies in this field. Table 1 presents a classification and comparison of BCSA works from the past seven years based on three criteria: the methods employed for BCSA, the types of features selected, and the availability of the project code and datasets for disclosure.

3.1. Analysis Methods

From the perspective of analysis methods, BCSA works can be divided into three categories, namely static analysis, dynamic analysis, and a combination of dynamic and static analysis.

Static analysis refers to the examination of a binary program or function without executing it. This method does not require the preparation of an operational environment or repetition due to coverage issues, as it is performed without running the target binary program or function. Compared to dynamic analysis, static analysis is faster and more straightforward, and typically utilizes the statically extracted information of the function structure, raw byte data, intermediate representation (IR) information, and function slices to create functional features for BCSA. For instance, the work carried out by Genius [16] involves the static extraction of function graph structure information and basic block-level statistical data, which are then combined to form an ACFG graph as a representation of the function. The MASES2018 [27] study utilizes the original byte information of functions, transforming binary files into pixels for feature representation. Oscar [42] uses IR information from binary functions and processes feature representation through NLP techniques. The study by Esh [15] also involves the extraction of program slices and the application of a program verification solver for semantic equivalence judgment.

However, the absence of a dynamic running program in static analysis often leads to a higher rate of false positives compared to dynamic analysis. In the past 7 years of research, 79.4% of studies have chosen static analysis as their method.

Dynamic analysis, as its name suggests, involves the analysis of a binary program or function by executing it in a dynamic manner. During the execution process, various information such as dynamic slices, inputs and outputs, and the behavior of the program is collected and monitored. This type of analysis can either be accomplished through the actual execution of the program or by simulation. Techniques such as fuzz testing and dynamic instrumentation are often used to obtain the runtime behavior of the program in dynamic analysis. For instance, IMF-sim [18] collects the dynamic behavior of binary functions through in-memory fuzzing and represents functions through running traces. MockingBird [13] employs dynamic instrumentation to gather the semantic features of a function, such as operand count and system call attributes, during program execution. BinSim [19] acquires function dynamic features through dynamic slices and uses symbolic execution to determine function equivalence. WSB [28] obtains dynamic control flow graph (DCFG) of a function by executing it and then converts it into a birthmark, which is used to calculate function similarity based on the extended cosine algorithm. Compared to static analysis, dynamic analysis can provide more realistic and accurate function features, but it incurs significant time overhead, especially when performing large-scale analysis. In the past 7 years of research, only 10.3% of the work carried out has utilized dynamic analysis.

Hybrid analysis is a combination of both static and dynamic analysis methods that seeks to harness the benefits of both. By utilizing fast static analysis to obtain preliminary results, followed by high-accuracy dynamic analysis, the approach is able to address the issue of high false positive rates in pure static analysis. For instance, BinMatch [26] employs simulation to extract semantic features for similarity analysis after initial results have been obtained through static analysis. Meanwhile, Patchecko [38] performs dynamic analysis on the basis of candidate functions acquired through static detection, utilizing runtime binary injection and remote debugging to capture the execution trajectories of two functions and determine their similarity. Over the past seven years of research, 10.3% of the work carried out has utilized hybrid analysis.

3.2. Feature Type

From the perspective of analysis methods, BCSA works can be divided into three categories, namely static analysis, dynamic analysis, and a combination of dynamic and static analysis.

The type of feature selected is a crucial aspect of BCSA work as it determines the method used for processing and comparing features. Various types of features offer different strengths and weaknesses in representing binary functions. For instance, raw byte features can be quickly and easily obtained but are susceptible to cross-architecture, cross-optimization, and code obfuscation issues, while path features can accurately capture the execution information of functions but may result in coverage and overhead issues.

The original binary byte information refers to the unprocessed binary data obtained after the binary file has been extracted. While it can be extracted efficiently, the original binary bytes produced by the same source code may vary significantly after undergoing different compilation processes, resulting in poor robustness. ACCESS2020 [40] employs a direct approach to extract the raw binary bytes and transforms them into vectors and signals for similarity analysis.

The path information refers to a sequence of information that reflects a portion of the program control flow and serves as a representation of the program’s execution path. This information can be obtained through various methods such as fuzzing, dynamic instrumentation, random walk, or simulation. For instance, IMF-sim [18] uses fuzzing to dynamically execute binary functions and build function execution traces, while BinMatch [26] employs dynamic instrumentation to acquire semantic signatures and runtime information of functions. On the other hand, Asm2vec [33] and DeepBinDiff [37] obtain program execution sequences through random walk on the statically obtained function control flow graph. Lastly, Trex [41] generates micro-traces of functions under various instruction set architectures through simulation to train the function representation model.

Structural information refers to the information that encompasses the structure of the function, since the execution of a function is not linear but follows a path determined by the graph structure. Therefore, the representation of functions through graph structures is often more effective compared to the sequential representation provided by original bytes. The function structure information is typically represented by the control flow graph (CFG) of the function. There are also other derivatives of CFG, such as the attributed control flow graph (ACFG), inter-procedural control flow graph (ICFG), and labeled semantic flow graph (ISFG). DiscovRE [17] employs graph matching algorithms to assess the similarity of functions through their control flow graphs (CFGs). Kam1n0 [11] utilizes both CFGs and locality-sensitive hashing algorithms to perform binary function similarity analysis. Ordermatters [39] derives the execution order information of functions through the application of a convolutional neural network (CNN) to their CFGs. Genius [16] employs a vector obtained by concatenating eight attributes from the basic block information of functions and utilizes attributed control flow graphs (ACFGs) as nodes in their CFGs. Patchecko [38] expands the 8 attributes to 48 and builds upon ACFGs. VulSeeker [22] employs ACFGs, as well as the combined function CFG and control graph (CG) data flow information, to construct labeled semantic flow graphs (LSFGs) that contain both structural and data flow information as semantic features. DeepBinDiff [37] uses a random walk approach on the inter-procedural control flow graphs (ICFGs) obtained by combining CFGs and CGs to retrieve the execution traces of functions.

Strands refer to segments of code that are extracted from programs according to specific criteria. By identifying a specific variable and analyzing the program in reverse, it is possible to obtain an instruction sequence and partial data flow information related to that variable. Several methods for BCSA, such as Esh [15], GitZ [20], Zeek [25], and FirmUP [24], utilize the concept of strands to obtain comparable code fragments.

Intermediate representation (IR) is an intermediary expression form that is generated after a program undergoes lexical analysis, syntax analysis, and semantic analysis through the front end. This representation is then optimized by the back end to generate the target code. IR is architecture-independent and can be used to perform architecture-independent analysis of functions. In the context of BCSA, converting the code from its binary form to IR representation can result in platform-independent analysis and the identification of similar functions across different architectures. Bingo [14] leverages constraint solving based on the reverse engineering intermediate language (REIL) after converting assembly language to REIL’s intermediate representation. Xmatch [21] transforms object code into the LLVM intermediate representation (IR) and extracts conditional statements from it. Oscar [42] employs NLP techniques to construct a language model for the intermediate representation of functions in IR form.

Information regarding function calls refers to the circumstances in which a function invokes other functions or is invoked by other functions. This information is typically represented in a program’s call graph. Functions of various types not only exhibit differences in their structural makeup but also demonstrate distinctive differences in their call graph. For instance, the function calls in an image processing program are vastly different from those in a network communication program. Function calls can to some extent differentiate between binary programs and also have a certain level of resistance to code obfuscation. Code obfuscation techniques, such as Obfuscator-LLVM, primarily complicate the intermediate representation generated by the compiler’s front end by introducing false control flows, meaningless instructions, switch-case statements, and replacing instructions. Such obfuscation primarily affects the complexity of the binary function’s control flow graph and hinders analysis but has little to no impact on the information regarding function calls. Obfuscation techniques complicate the structure of functions but rarely affect the information regarding function calls. αdiff [23] utilizes call graph information to reflect the inter-function semantic features of functions. FuncNet [34] incorporates basic block features and call interface information and employs Structure2vec [48] graph embedding neural network to convert binary functions into high-dimensional vector representations as a representation of function features. TIKNIB [4] creates a feature vector consisting of basic block information, function control flow graph (CFG) information, and digital features derived from the call graph (CG), and employs a greedy algorithm to measure differences between multiple interpretable feature values. Codee [43] utilizes NLP technology to extract basic block semantic information from the inter-procedural control flow graph (ICFG) produced by combining CFG and CG and to generate token embeddings. Bingo-E [29] employs function call information as one of the high-level semantic features to represent functions, and ultimately performs a weighted aggregation of features to obtain function similarity.

Word information refers to the textual information of the instructions within a function. In the field of BCSA, researchers have explored the use of word information in representing functions. With the advancement of NLP technology, several studies have sought to extract instruction semantics from the semantic information of function instruction text to obtain basic block node semantics or function semantics. To achieve this, a number of studies have employed word embedding methods from NLP techniques, such as Word2Vec [49] or BERT [50], to extract instruction semantics. For example, BinDNN [12] uses a neural network classifier to determine whether two functions were compiled from the same source code, while InnerEye [30], SAFE [31], InstrModel [32], GENN [35], and BinDeep [44] all utilize the Word2Vec method to map instructions into vector representations. On the other hand, Asm2Vec employs an improved PV-DM [51] method based on Word2Vec to learn token-level embeddings, while Jtrans [46], Oscar [42], and Ordermatters [39] utilize the BERT method to extract instruction sequence semantics.

Embedding information refers to a representation of digital features or semantic features as vectors through learning or combining techniques. The vector representation of a basic block node in the ACFG graph is created by concatenating its digital features as an 8-bit vector. Token or instruction embeddings, obtained through NLP-based methods such as Word2Vec or BERT, are high-dimensional vectors that capture the semantics of the token or instruction through self-supervised learning by neural networks. The use of embedding as a representation of functional features is motivated by the ability of NLP-based technologies to express words as high-dimensional vectors containing semantic information, as well as the efficiency of vector distance calculations to determine similarity. Gemini [10] utilizes Structure2Vec [48], an end-to-end graph embedding network, to convert a function into a vector representation after obtaining its ACFG. GMN [36] presents a novel attention-based cross-graph matching mechanism that exhibits more efficient information flow and improved feature extraction capability in comparison to graph embedding models. Asteria [45] processes the abstract syntax tree (AST), converts it into vectors via a tree-based embedding extraction method, and calculates the similarity. XBA [47] transforms binary files into a binary disassembled graph (BDG) that encompasses rich binary information, and trains graph convolutional networks (GCNs) to generate entity embeddings.

3.3. The Evolution of BCSA Techniques

As BCSA technology evolves, the methods employed in research on BCSA also undergo changes. As Table 1 illustrates, there has been a clear trend of change in the elements of the selected characterization function over the years, marked by a decrease in the use of strand and IR, the widespread adoption of deep learning-based methods, and a trend towards open project codes and datasets.

The code fragment, strand, which contains data flow information to some extent, was utilized in four studies prior to 2019. However, it has not been employed in any studies since then. Similarly, the architecture-independent feature representation, IR, has played a crucial role in cross-architecture binary similarity analysis and was utilized in nine studies before 2019. Nevertheless, it has only been employed in a solitary study since 2019. This may be attributed to two factors. Firstly, both strand and IR require a non-negligible amount of time and space for extraction from binary programs, particularly with respect to offline operations. Secondly, the widespread adoption of learning-based research programs since 2019 has resulted in their powerful characterization capabilities and superior performance becoming increasingly popular among researchers.

The popularity of the embedding form as a feature in BCSA can be observed from its growth trend over the past seven years. Embedding features can be obtained through digital feature concatenation or learning-based techniques. The success of deep learning methods has made embedding features an attractive choice for researchers due to its lower cost of similarity calculation and more effective representation compared to graph isomorphism and tree matching methods. In fact, almost all studies after 2019 adopted the embedding form as a feature. Only ACCESS2020 [40] opted to not use the embedding form, instead utilizing a signal-based method for function representation.

After 2019, deep learning-based research methods have demonstrated strong performance and become the preferred choice for nearly all researchers. These methods primarily fall into three categories: those utilizing GNN [36], NLP-based word embedding techniques [42], and a combination of both [37,43,47]. Research that uses GNNs converts the graph representation of binary functions into a vector representation through the GNN method, then assesses function similarity through either vector distance calculations or neural network classification methods. Approaches using word vectors transform opcodes or entire assembly language instructions into vectors through language model techniques, resulting in the vector representations of basic blocks or functions. Methods that combine GNN and word embedding construct a function-level embedding by combining the instruction embedding with the structural information of the function graph, resulting in a function-level embedding that incorporates not only the semantics of the instruction but also the structural information of the function.

Additionally, the trend towards greater willingness among researchers to disclose their project code and datasets in their papers has also been observed. Using 2019 as a dividing line, of the 20 papers published prior to 2019, 35% of the project codes were open source and 20% of the datasets were open source. However, among the 19 papers published after 2019, a higher percentage of both project codes (73.7%) and datasets (57.9%) were open source. This trend towards increased transparency and sharing has resulted in a challenge for the field of BCSA, as it lacks industry-recognized benchmark datasets, unlike other areas of artificial intelligence such as computer vision and NLP, which makes it difficult to make comparisons across studies. In order to address the lack of comparable benchmark datasets in the field of BCSA, Kim et al. introduced the BINKIT dataset [4]. BINKIT represents the first comprehensive and replicable benchmark dataset for BCSA. It comprises 243,128 binaries and 36,256,322 functions, encompassing diverse combinations of compilers, compilation options, and target architectures. Additionally, Marcelli et al. [52] released a benchmark dataset that includes different compilation paths, enabling the reproduction and analysis of numerous papers.

In summary, the trend of BCSA research has shifted towards the utilization of deep learning-based methods, while there have been attempts to standardize the field through benchmarking of existing studies. Deep learning methods have intrinsic potential and bring positive effects on BCSA. Compared to traditional techniques that are based on statistical instruction counts and syntax feature extraction, techniques based on deep learning have showcased their advantages in obtaining effective semantic features. Moreover, these advantages are not limited to a single modality and can leverage various means, such as assembly codes or the structure of graphs, to acquire semantic information. Concurrently, by utilizing methods like neural network classifiers or cosine distance, deep learning techniques are capable of promoting rapid execution of comparisons, thereby significantly enhancing computational efficiency during the comparison stage.

Although deep learning techniques have garnered substantial attention in the realm of BCSA, they are not without their limitations and challenges.

The first is the opacity issues, or so-called “black box” problems inherent to deep learning algorithms. The complex operation mechanisms of these models, embedded in large-scale network structures, can make it challenging for researchers to comprehend and correct the behavior of the model when erroneous match results occur.

Secondly, the sensitivity of deep learning models to noise presents a significant problem. When dealing with binary codes, noise may be introduced due to compiler optimizations or variations in compiler flags. Such noise can impact the generated binary codes, potentially diminishing the precision of code similarity analysis based on deep learning.

The scalability and computational time cost of deep learning models pose a prominent challenge, due to their high demands for computational and storage resources, as well as the need for efficiency and scalability in large-scale tasks. As the complexity of the model increases, the training time may increase significantly, posing an obstacle for environments that require rapid iteration and optimization.

Lastly, the issue of model generalization capacity cannot be overlooked. Deep learning models run the risk of overfitting, where their performance on unfamiliar data falls short, despite demonstrating impressive effectiveness on known training data. Such constraints may hinder the model’s capacity for conducting effective similarity analysis on new, unlabeled binary codes.

4. How Deep Learning Technology Is Applied to Existing Technologies

The trend of BCSA research has shifted towards the utilization of deep learning-based methods since 2019. These methods have demonstrated effective performance and have been widely adopted in the field. This study aims to provide a comprehensive overview of the application of deep learning in BCSA, focusing on both text semantic features and functional structure features.

4.1. Text Semantic Features

Text semantic features are features that contain semantic information obtained from the original binary program data, assembly instructions or operand tokens. The construction of program language models typically involves utilizing NLP techniques to generate a language model representation. Table 2 provides a summary of the research performed on BCSA using text semantic features. To facilitate the presentation, the datasets used for the training and evaluation of the deep learning model are the only ones displayed in the table. If the corpus dataset used for language model training differs from that used for evaluation, the corpus dataset is indicated separately in the table (Corpus). It is imperative to consider the dissimilarity between programming language and natural language when extracting program language models through NLP techniques. Unlike natural language, which follows a linear structure, programming language has a graph-like structure with branching jump statements. Hence, the inclusion of supplementary information such as call information or structural information is crucial in enhancing the content of the program language model. The language model derived from plain text information is unreliable due to the presence of varying compilation paths and ambiguity. The choice of method to use in combining word vectors with supplementary information for function representation and for determining the similarity of two functions also requires careful consideration in the study of BCSA based on text semantic features. As such, the main focus of such studies would involve the selection of appropriate language model modeling techniques, the utilization of appropriate auxiliary information for semantic enrichment of functions, and the selection of appropriate comparison methods.

The “Score” column in Table 2 showcases the performance of this research. For clarity and succinctness, we selected a representative result among numerous experimental outcomes in the paper. For instance, in conducting comparative experiments across different optimization spans, we chose the mean value at the maximum optimization span. Please be aware that due to differences in evaluation datasets, methodologies, and metrics amongst various studies, these figures serve only as a reference point. Any comparative assessment must take into account these inherent variations.

In the studies that are summarized in the text, most utilize the word vector language model technology from NLP to obtain semantic information of functions. The study of αdiff [23] takes a different approach by using a CNN method to extract features from the binary raw bytes, calculate the distance, extract inter-function features from the function call graph and its distance, and then convert the imported function set into a vector to calculate the overall similarity of the function. On the other hand, BinDNN [12] employs an assembly language vocabulary to encode each opcode, constructing function features and utilizing a deep learning network for training and classifying function samples.

Word2Vec is a widely used unsupervised learning model for obtaining semantic knowledge from a large corpus of text in NLP. Embedding refers to a mapping from the original space to a new multidimensional space, where words that are semantically similar are mapped to close vectors in the vector space. Word2Vec accomplishes this by learning the central words and contextual relationships within the corpus. InnerEYE [30], Safe [31], InstrModel [32], GENN [35], DeepBinDiff [37], Codee [43], and BinDeep [44] all utilize the Word2Vec approach to generate token or instruction embeddings and construct word vectors. Specifically, InnerEYE [30] converts assembly instructions into instruction embeddings, employs long short-term memory (LSTM) to represent the instruction embeddings of each basic block as basic block embeddings, and stores the basic block embeddings in a locality-sensitive hashing (LSH) database for efficient online searches. The Safe [31] method begins by performing instruction embedding, training an instruction embedding model using the skip-gram method in Word2Vec, then using a Bi-RNN network to determine the function embedding vector through the sequence of instruction vectors. Finally, the similarity between two function embedding vectors is evaluated using the Siamese architecture and cosine distance. InstrModel [32] leverages the continuous bag of words (CBOW) model within the Word2Vec framework to learn instruction models of the same architecture, utilizing a center word alignment technique. The model is capable of predicting the center word of an instruction belonging to a different architecture, by relying on the context of an instruction belonging to one architecture. This enables the learning of cross-architecture instruction models. On the other hand, GENN [35] obtains instruction embeddings as vertex features in the control flow graph (CFG) and aggregates these vectors. The Structure2Vec method is then applied to the CFG to convert it into a vector representation, including node feature information. Finally, the similarity is determined through the use of the cosine distance method. DeepBinDiff [37] first conducts preprocessing to extract the interprocedural control flow graph (ICFG), and then employs Word2Vec to learn the token embedding model for opcodes and operands. The sequence of instructions, which contains control flow context information, is generated by the random walk algorithm, and the basic block embedding is obtained through the token embedding model. The k-hop greedy matching algorithm is used to match basic blocks and determine the similarity of functions. And the performance of the model was verified with real-world vulnerabilities in the experiments. Codee [43] uses semantic information extracted from ICFG basic blocks to generate token embeddings, utilizes network-based representation technology to generate basic block embeddings by combining token embeddings with CFG structural information, and computes function embeddings based on the tensor singular value decomposition (tSVD) algorithm. Finally, it employs the locality sensitive hashing (LSH) method to calculate functional similarity. BinDeep [44] categorizes binary function pairs through a RNN classifier after obtaining instruction embeddings, submits them to the corresponding model for similarity assessment, and employs a network model comprised of CNN, LSTM, and Siamese architecture to convert the instruction sequences of binary functions into feature vectors. Finally, the vector distance is utilized to determine the similarity. Asm2Vec [33] models the program’s CFG graph as a linear sequence of assembly instructions through selective inlining and random walk and then utilizes the improved PV-DM [51] model by Word2Vec to conduct representation learning on the assembly instructions to obtain word vectors and function vectors. Finally, cosine similarity is utilized to generate the top-k candidate vectors as the output results.

BERT, or bidirectional encoder representations from transformers, is a state-of-the-art pre-trained language representation model. Unlike traditional one-way language models or two one-way language models, BERT uses a novel masked language model (MLM) to generate rich bidirectional language representations. Upon its release, BERT achieved exceptional results in 11 NLP tasks, establishing a new benchmark in the field. Ordermatters [39], Oscar [42], and Jtrans [46] utilize BERT, a pre-trained language representation model, to develop their solutions. Specifically, Ordermatters [39] employs BERT to extract the basic block embeddings as the node features of the control flow graphs (CFGs) and then utilizes the MPNN [53] method from GNN to convert the CFG containing node features into graph embeddings. This is combined with the order information generated through CNN processing of the CFG adjacency matrix that has been embedded. Oscar [42] is the only study published after 2019 that used intermediate representation (IR) features. It uses BERT to represent the language model of IR and integrates the node order information of the function CFG graph into the position embedding of BERT, allowing the IR language model to learn the function structure information in the linear language modeling by BERT. Jtrans [46] incorporates location embeddings into BERT by annotating the target address of jump instructions, which are then combined with the token embeddings to produce the final embeddings. In addition, it replaces the unsupervised next sentence prediction (NSP) subtask in BERT with the jump target prediction (JTP) task. All of the studies that utilize BERT as a language model employ cosine distance as the final metric for measuring similarity. In a practical task of searching for known vulnerabilities, jTrans accomplished a recall rate that is twice as high as the ones achieved by the existing state-of-the-art baselines.

NLP techniques play a pivotal role in extracting semantic information from text. However, the “structural features” still remain a key differentiator between programming languages and natural languages. Relying purely on natural language models when dealing with programming languages often leads to the loss of structural information inherent in the programs. Therefore, nearly all works utilizing NLP techniques incorporate some auxiliary methods to compensate for this lack of structural information.

4.2. Functional Structural Features

Due to the nonlinear characteristics of programming languages, graph representations have been demonstrated to be an effective means of modeling functions. In early studies, graph isomorphism methods were utilized to evaluate the similarity of binary functions, as seen in works such as Multi-MH [54], discovRE [17], and Genius [16]. However, given the substantial computational demands of graph matching algorithms, traditional graph isomorphism techniques are being increasingly replaced by GNN-based methods. The GNN approach transforms the graph structure into a high-dimensional embedding vector through the training of a neural network, thereby enabling the rapid calculation of similarity through measures such as vector distance or cosine similarity. This constitutes a more efficient alternative to traditional graph isomorphism methods. Table 3 presents the findings of a study focused on analyzing binary code similarity using function structure features. The Dataset column in the table indicates the dataset that was utilized for training and evaluating the model. The study on modeling from function structure features using deep learning methods is primarily centered around four key elements. These include selecting an appropriate method for graph structure information extraction, determining which specific graph to extract graph structure information from, identifying which auxiliary information to incorporate, and ultimately selecting the most appropriate method for determining function similarity.

Although some studies have utilized both text semantic features and functional structure features, they have been consolidated in the section on text semantic features and are therefore not listed in this section. However, they are still reflected in Table 3 and include Asm2Vec [33], GENN [35], DeepBinDiff [37], Ordermatters [39], and Codee [43].

The study of Gemini [10] is a seminal work in this field. It builds upon the ACFG proposed by Genius [16], and utilizes Struc2Vec [48] to extract graph embeddings from the CFG graph that includes node attributes. The real-world case study demonstrates that Gemini, compared to the state-of-the-art method, Genius, possesses a superior ability in identifying substantially more vulnerable firmware images. The function similarity is finally calculated using a Siamese architecture and cosine distance. VulSeeker [22] creates a labeled semantic flow graph (LSFG), which integrates information from both the CFG and data flow graph for binary functions. The study uses eight artificially selected features as basic block semantic features, converts them into numerical vectors, and embeds the LSFG of the basic block numerical features using the Struc2Vec [48] method. The similarity of the two functions is finally computed using cosine similarity. In the top ten candidates, VulSeeker identifies 50.00% more vulnerabilities compared to previous findings. FuncNet [34] employs basic block features as inputs, along with the call interface information, and utilizes the Struc2Vec [48] method to convert the ACFG of the binary function into a high-dimensional vector. It then maps the high-dimensional vector into the grid space using a self-organizing map (SOM) model, reducing the computational cost of comparison. GMN [36] introduces a novel attention-based cross-graph matching mechanism that considers cross-information between the two CFGs in the node feature propagation process, thereby enhancing the identification of relationships between the vertices of the two graphs. The information flow in GMN is more efficient compared to graph embedding models, making it more effective in extracting graph similarity feature information. Asteria [45] employs an abstract syntax tree to depict binary functions, initially extracting the function AST and undergoing preprocessing to digitize and convert its format. Then, using the tree-LSTM [55] technique, the AST is encoded into a vector, the similarity is calculated, and the function call relationship is utilized to adjust the AST similarity. In a real-world case study, Asteria carried out a vulnerability search on the firmware dataset and successfully detected 75 instances of vulnerabilities. XBA [47], on the other hand, first transforms binary files into binary disassembled graphs (BDG) containing extensive binary information. Graph convolutional networks (GCNs) are utilized to generate graph embeddings as functional representations, and the similarity is finally computed using vector distance methods.

The structural features of functions, compared to textual semantic features, provide greater representational capability. Most works integrating NLP and GNN utilize textual semantic features as the node representation in graph structures [35,37,39,43], while a portion of the research employs statistical features for the same purpose [10].

5. Summary and Prospects

The origin of BCSA can be traced back to 1999, when Baker and colleagues introduced a method for compressing the differences in executable code and developed a prototype tool, named EXEDIFF [56], for generating patches on DEC Alpha executables. Over the course of the past 23 years, a wealth of ideas have been put forth to address the challenge of BCSA, ranging from byte-level comparison to graph isomorphism, incorporating calling information and dynamic trace, using the fuzzy hash representation of the entire file to high-dimensional vector-based feature representation, and transitioning from traditional methods to deep learning-based approaches. Additionally, some unconventional approaches such as signal theory and game theory have also been proposed as solutions to the problem of BCSA. Presently, the utilization of deep learning-based methods is the prevailing approach in BCSA. Through the examination and synthesis of current research in this field, this study identifies and summarizes certain recurring and emerging trends and characteristics in BCSA as follows:

The utilization of strand and IR-based methods in BCSA has declined as learning-based methods have become more prominent. Before 2019, four papers utilized strands as code fragments containing data flow information, but no studies published after 2019 have employed this approach. The use of IR as an architecture-independent feature representation has also decreased, with only one paper in recent years utilizing this technique compared to nine papers before 2019. This shift away from strand and IR methods can be attributed to the overhead of extracting these features from binary programs and the success of learning-based methods in terms of their representation ability and performance. However, IR still has the advantage of being usable across multiple architectures, and there may be opportunities for future research to combine IR with deep learning techniques.

Pure NLP methods that do not incorporate other auxiliary information pose a challenge in achieving high-quality models, given the difficulty of direct application of linear NLP methods to the graph structure of programming languages. To overcome this challenge, recent research on using Word2Vec method for programming language modeling has incorporated program structural information and call information to improve the representation capability of the model. Similarly, research using BERT for programming language modeling aims to enhance the semantic richness of the model by adding program structure information, such as control flow graph (CFG) sequence information and jump information, to the BERT model’s position embedding.

The BERT language model has shown significant improvements over the Word2Vec model, but it is not widely used for modeling programming languages. This is due to the fact that BERT primarily addresses the issue of polysemy in natural languages, which is not present in programming languages. Additionally, the complex subtask training in BERT leads to difficulties in corpus preparation and model training, making the simple and efficient Word2Vec a preferred method for constructing program language models. In future research, the exploration of more suitable Word2Vec and BERT subtask design methods that can better adapt to the programming language may be a focal point.

The lack of a benchmark dataset is a challenge in the field of BCSA. As the trend shifts towards deep learning-based methods, the size of the dataset becomes increasingly crucial to the quality of the model training. However, there are limited unified benchmark datasets available, leading to a tendency for researchers to rely on their own datasets. This lack of a shared benchmark makes it difficult to make fair comparisons between studies. While efforts have been made to address this issue, such as the construction and open sourcing of the large-scale dataset BINKIT by Kim et al., it has not yet been widely adopted by the research community. In future research, the adoption of a shared benchmark dataset for both training and evaluation purposes could enhance comparability and foster learning across various studies.

The article systematically compiles the key studies on BCSA methods from 2016 to 2022, offers a comprehensive analysis of their technical aspects, and particularly focuses on comparing their effectiveness when combined with deep learning techniques. The article also summarizes the current progress in the field of BCSA and suggests prospective avenues for future research.

Ultimately, employing deep learning technology in binary code analysis is mainly a classification issue, with relatively fewer related ethical considerations. However, ethical issues brought by deep learning, such as data privacy, job displacement, and the responsibility attributed to AI decisions, though not prominent at this stage, we still need to be vigilant for their possible emergence. As we promote technological innovation, we should consider its potential impacts, especially when we delve deeper into AI technology applications, we should treat ethical issues more prudently to fulfill our social responsibilities.

Author Contributions

Conceptualization, J.D.; investigation, J.D. and X.S.; resources, J.D. and Y.W.; data curation, J.D. and X.S.; writing—original draft preparation, J.D.; writing—review and editing, J.D. and Y.W.; supervision, Q.W.; project administration, Q.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Key R & D Program of China under Grant No. 2020YFB2010900, the Program for Innovation Leading Scientists and Technicians of Zhong Yuan under Grant No. 224200510002.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Synopsys. 2022 Open Source Security and Analysis Report. Available online: https://www.synopsys.com/software-integrity/resources/analyst-reports/open-source-security-risk-analysis.html (accessed on 16 June 2022).
CVE-2021-44228. Available online: https://cve.mitre.org/cgi-bin/cvename.cgi?name=2021-44228 (accessed on 10 January 2022).
Haq, I.U.; Caballero, J. A Survey of Binary Code Similarity. ACM Comput. Surv. 2022, 54, 1–38. [Google Scholar] [CrossRef]
Kim, D.; Kim, E.; Cha, S.K.; Son, S.; Kim, Y. Revisiting BCSA Using Interpretable Feature Engineering and Lessons Learned. IEEE Trans. Softw. Eng. 2022, 49, 1661–1682. [Google Scholar] [CrossRef]
Yu, Y.; Gan, S.; Qin, X.; Qiu, J.; Chen, Z. Research on the Technologies of BCSA and Their Applications on the Embedded Device Firmware Vulnerability Search. J. Softw. 2022, 33, 4137–4172. [Google Scholar] [CrossRef]
Hex-Rays about IDA. Available online: https://www.hex-rays.com/products/ida/ (accessed on 10 January 2022).
Brumley, D.; Jager, I.; Avgerinos, T.; Schwartz, E.J. BAP: A Binary Analysis Platform. In Computer Aided Verification, Proceedings of the 23rd International Conference, CAV 2011, Snowbird, UT, USA, 14–20 July 2011; Gopalakrishnan, G., Qadeer, S., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; Volume 6806, pp. 463–469. [Google Scholar]
Wang, F.; Shoshitaishvili, Y. Angr—The Next Generation of Binary Analysis. In Proceedings of the IEEE Cybersecurity Development, SecDev 2017, Cambridge, MA, USA, 24–26 September 2017; IEEE Computer Society: Washington, DC, USA, 2017; pp. 8–9. [Google Scholar]
Nethercote, N.; Seward, J. Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation, San Diego, CA, USA, 10–13 June 2007. [Google Scholar] [CrossRef]
Xu, X.; Liu, C.; Feng, Q.; Yin, H.; Song, L.; Song, D. Neural Network-Based Graph Embedding for Cross-Platform Binary Code Similarity Detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, Dallas, TX, USA, 30 October–3 November 2017; ACM: Dallas Texas USA, 2017; pp. 363–376. [Google Scholar]
Ding, S.H.H.; Fung, B.C.M.; Charland, P. Kam1n0: MapReduce-Based Assembly Clone Search for Reverse Engineering. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; Krishnapuram, B., Shah, M., Smola, A.J., Aggarwal, C.C., Shen, D., Rastogi, R., Eds.; ACM: New York, NY, USA, 2016; pp. 461–470. [Google Scholar]
Lageman, N.; Kilmer, E.D.; Walls, R.J.; McDaniel, P.D. BinDNN: Resilient Function Matching Using Deep Learning. In Proceedings of the Security and Privacy in Communication Networks—12th International Conference, SecureComm 2016, Guangzhou, China, 10–12 October 2016; Deng, R.H., Weng, J., Ren, K., Yegneswaran, V., Eds.; Springer: Berlin/Heidelberg, Germany, 2016; Volume 198, pp. 517–537. [Google Scholar]
Hu, Y.; Zhang, Y.; Li, J.; Gu, D. Cross-Architecture Binary Semantics Understanding via Similar Code Comparison. In Proceedings of the IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering, SANER 2016, Suita, Japan, 14–18 March 2016; IEEE Computer Society: Washington, DC, USA, 2016; Volume 1, pp. 57–67. [Google Scholar]
Chandramohan, M.; Xue, Y.; Xu, Z.; Liu, Y.; Cho, C.Y.; Tan, H.B.K. BinGo: Cross-Architecture Cross-OS Binary Search. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, Seattle, WA, USA, 13–18 November 2016; ACM: Seattle, WA, USA, 2016; pp. 678–689. [Google Scholar]
David, Y.; Partush, N.; Yahav, E. Statistical Similarity of Binaries. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2016, Santa Barbara, CA, USA, 13–17 June 2016; Krintz, C., Berger, E.D., Eds.; ACM: New York, NY, USA, 2016; pp. 266–280. [Google Scholar]
Feng, Q.; Zhou, R.; Xu, C.; Cheng, Y.; Testa, B.; Yin, H. Scalable Graph-Based Bug Search for Firmware Images. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, 24–28 October 2016; ACM: Vienna, Austria, 2016; pp. 480–491. [Google Scholar]
Eschweiler, S.; Yakdan, K.; Gerhards-Padilla, E. discovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code. In Proceedings of the 2016 Network and Distributed System Security Symposium, San Diego, CA, USA, 21–24 February 2016. [Google Scholar]
Wang, S.; Wu, D. In-Memory Fuzzing for Binary Code Similarity Analysis. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering, ASE 2017, Urbana, IL, USA, 30 October–3 November 2017; Rosu, G., Penta, M.D., Nguyen, T.N., Eds.; IEEE Computer Society: Washington, DC, USA, 2017; pp. 319–330. [Google Scholar]
Ming, J.; Xu, D.; Jiang, Y.; Wu, D. BinSim: Trace-Based Semantic Binary Diffing via System Call Sliced Segment Equivalence Checking. In Proceedings of the 26th USENIX Security Symposium, USENIX Security 2017, Vancouver, BC, Canada, 16–18 August 2017; Kirda, E., Ristenpart, T., Eds.; USENIX Association: Washington, DC, USA, 2017; pp. 253–270. [Google Scholar]
David, Y.; Partush, N.; Yahav, E. Similarity of Binaries through Re-Optimization. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017, Barcelona, Spain, 18–23 June 2017; Cohen, A., Vechev, M.T., Eds.; ACM: New York, NY, USA, 2017; pp. 79–94. [Google Scholar]
Feng, Q.; Wang, M.; Zhang, M.; Zhou, R.; Henderson, A.; Yin, H. Extracting Conditional Formulas for Cross-Platform Bug Search. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, AsiaCCS 2017, Abu Dhabi, United Arab Emirates, 2–6 April 2017; Karri, R., Sinanoglu, O., Sadeghi, A.-R., Yi, X., Eds.; ACM: New York, NY, USA, 2017; pp. 346–359. [Google Scholar]
Gao, J.; Yang, X.; Fu, Y.; Jiang, Y.; Sun, J. VulSeeker: A Semantic Learning Based Vulnerability Seeker for Cross-Platform Binary. In Proceedings of the 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE), Montpellier, France, 3–7 September 2018; pp. 896–899. [Google Scholar]
Liu, B.; Huo, W.; Zhang, C.; Li, W.; Li, F.; Piao, A.; Zou, W. αDiff: Cross-Version Binary Code Similarity Detection with DNN. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, Montpellier, France, 3–7 September 2018; Huchard, M., Kästner, C., Fraser, G., Eds.; ACM: New York, NY, USA, 2018; pp. 667–678. [Google Scholar]
David, Y.; Partush, N.; Yahav, E. FirmUp: Precise Static Detection of Common Vulnerabilities in Firmware. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2018, Williamsburg, VA, USA, 24–28 March 2018; Shen, X., Tuck, J., Bianchini, R., Sarkar, V., Eds.; ACM: New York, NY, USA, 2018; pp. 392–404. [Google Scholar]
Shalev, N.; Partush, N. Binary Similarity Detection Using Machine Learning. In Proceedings of the 13th Workshop on Programming Languages and Analysis for Security, PLAS@CCS 2018, Toronto, ON, Canada, 15–19 October 2018; Alvim, M.S., Delaune, S., Eds.; ACM: New York, NY, USA, 2018; pp. 42–47. [Google Scholar]
Hu, Y.; Zhang, Y.; Li, J.; Wang, H.; Li, B.; Gu, D. BinMatch: A Semantics-Based Hybrid Approach on Binary Code Clone Analysis. In Proceedings of the 2018 IEEE International Conference on Software Maintenance and Evolution, ICSME 2018, Madrid, Spain, 23–29 September 2018; IEEE Computer Society: Washington, DC, USA, 2018; pp. 104–114. [Google Scholar]
Marastoni, N.; Giacobazzi, R.; Preda, M.D. A Deep Learning Approach to Program Similarity. In Proceedings of the 1st International Workshop on Machine Learning and Software Engineering in Symbiosis, MASES@ASE 2018, Montpellier, France, 3 September 2018; Perrouin, G., Acher, M., Cordy, M., Devroey, X., Eds.; ACM: New York, NY, USA, 2018; pp. 26–35. [Google Scholar]
Yuan, B.; Wang, J.; Fang, Z.; Qi, L. A New Software Birthmark Based on Weight Sequences of Dynamic Control Flow Graph for Plagiarism Detection. Comput. J. 2018, 61, 1202–1215. [Google Scholar] [CrossRef]
Xue, Y.; Xu, Z.; Chandramohan, M.; Liu, Y. Accurate and Scalable Cross-Architecture Cross-OS Binary Code Search with Emulation. IEEE Trans. Softw. Eng. 2019, 45, 1125–1149. [Google Scholar] [CrossRef]
Zuo, F.; Li, X.; Young, P.; Luo, L.; Zeng, Q.; Zhang, Z. Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs. In Proceedings of the 26th Annual Network and Distributed System Security Symposium, NDSS 2019, San Diego, CA, USA, 24–27 February 2019; The Internet Society: Reston, VA, USA, 2019. [Google Scholar]
Massarelli, L.; Luna, G.A.D.; Petroni, F.; Baldoni, R.; Querzoni, L. SAFE: Self-Attentive Function Embeddings for Binary Similarity. In Detection of Intrusions and Malware, and Vulnerability Assessment, Proceedings of the 16th International Conference, DIMVA 2019, Gothenburg, Sweden, 19–20 June 2019; Perdisci, R., Maurice, C., Giacinto, G., Almgren, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2019; Volume 11543, pp. 309–329. [Google Scholar]
Redmond, K.; Luo, L.; Zeng, Q. A Cross-Architecture Instruction Embedding Model for Natural Language Processing-Inspired Binary Code Analysis. In Proceedings of the Workshop on Binary Analysis Research (BAR) 2019, San Diego, CA, USA, 24 February 2019. [Google Scholar]
Ding, S.H.H.; Fung, B.C.M.; Charland, P. Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization. In Proceedings of the 2019 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 19–23 May 2019; pp. 472–489. [Google Scholar]
Luo, M.; Yang, C.; Gong, X.; Yu, L. FuncNet: A Euclidean Embedding Approach for Lightweight Cross-Platform Binary Recognition. In Proceedings of the Security and Privacy in Communication Networks, Orlando, FL, USA, 23–25 October 2019; Chen, S., Choo, K.-K.R., Fu, X., Lou, W., Mohaisen, A., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 319–337. [Google Scholar]
Massarelli, L.; Luna, G.; Petroni, F.; Querzoni, L. Investigating Graph Embedding Neural Networks with Unsupervised Features Extraction for Binary Analysis. In Proceedings of the Workshop on Binary Analysis Research (BAR) 2019, San Diego, CA, USA, 24 February 2019. [Google Scholar]
Li, Y.; Gu, C.; Dullien, T.; Vinyals, O.; Kohli, P. Graph Matching Networks for Learning the Similarity of Graph Structured Objects. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, Long Beach, CA, USA, 9–15 June 2019; Volume 97, pp. 3835–3845. [Google Scholar]
Duan, Y.; Li, X.; Wang, J.; Yin, H. DeepBinDiff: Learning Program-Wide Code Representations for Binary Diffing. In Proceedings of the Network and Distributed System Security Symposium, San Diego, CA, USA, 1 January 2020. [Google Scholar]
Sun, P.; Garcia, L.; Salles-Loustau, G.; Zonouz, S. Hybrid Firmware Analysis for Known Mobile and IoT Security Vulnerabilities. In Proceedings of the 50th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2020), Valencia, Spain, 29 June–2 July 2020. [Google Scholar]
Yu, Z.; Cao, R.; Tang, Q.; Nie, S.; Huang, J.; Wu, S. Order Matters: Semantic-Aware Neural Networks for Binary Code Similarity Detection. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, New York, NY, USA, 7–12 February 2020; AAAI Press: Menlo Park, CA, USA, 2020; pp. 1145–1152. [Google Scholar]
Guo, H.; Huang, S.; Huang, C.; Zhang, M.; Pan, Z.; Shi, F.; Huang, H.; Hu, D.; Wang, X. A Lightweight Cross-Version Binary Code Similarity Detection Based on Similarity and Correlation Coefficient Features. IEEE Access 2020, 8, 120501–120512. [Google Scholar] [CrossRef]
Pei, K.; Xuan, Z.; Yang, J.; Jana, S.; Ray, B. Trex: Learning Execution Semantics from Micro-Traces for Binary Similarity. arXiv 2020, arXiv:2012.08680. [Google Scholar]
Peng, D.; Zheng, S.; Li, Y.; Ke, G.; He, D.; Liu, T.-Y. How Could Neural Networks Understand Programs? In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, Virtual Event, 18–24 July 2021; Volume 139, pp. 8476–8486. [Google Scholar]
Yang, J.; Fu, C.; Liu, X.-Y.; Yin, H.; Zhou, P. Codee: A Tensor Embedding Scheme for Binary Code Search. IEEE Trans. Softw. Eng. 2022, 48, 2224–2244. [Google Scholar] [CrossRef]
Tian, D.; Jia, X.; Ma, R.; Liu, S.; Liu, W.; Hu, C. BinDeep: A Deep Learning Approach to Binary Code Similarity Detection. Expert Syst. Appl. 2021, 168, 114348. [Google Scholar] [CrossRef]
Yang, S.; Cheng, L.; Zeng, Y.; Lang, Z.; Zhu, H.; Shi, Z. Asteria: Deep Learning-Based AST-Encoding for Cross-Platform Binary Code Similarity Detection. In Proceedings of the 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2021, Taipei, Taiwan, 21–24 June 2021; pp. 224–236. [Google Scholar]
Wang, H.; Qu, W.; Katz, G.; Zhu, W.; Gao, Z.; Qiu, H.; Zhuge, J.; Zhang, C. jTrans: Jump-Aware Transformer for Binary Code Similarity Detection. In Proceedings of the ISSTA’22: 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual Event, Republic of Korea, 18–22 July 2022; Ryu, S., Smaragdakis, Y., Eds.; ACM: New York, NY, USA, 2022; pp. 1–13. [Google Scholar]
Kim, G.; Hong, S.; Franz, M.; Song, D. Improving Cross-Platform Binary Analysis Using Representation Learning via Graph Alignment. In Proceedings of the ISSTA’22: 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual Event, Republic of Korea, 18–22 July 2022; Ryu, S., Smaragdakis, Y., Eds.; ACM: New York, NY, USA, 2022; pp. 151–163. [Google Scholar]
Dai, H.; Dai, B.; Song, L. Discriminative Embeddings of Latent Variable Models for Structured Data. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York, NY, USA, 19–24 June 2016; Balcan, M.-F., Weinberger, K.Q., Eds.; JMLR: New York, NY, USA, 2016; Volume 48, pp. 2702–2711. [Google Scholar]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. In Proceedings of the 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, AZ, USA, 2–4 May 2013; Bengio, Y., LeCun, Y., Eds.; Workshop Track Proceedings: New York, NY, USA, 2013. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, (Long and Short Papers). pp. 4171–4186. [Google Scholar]
Le, Q.V.; Mikolov, T. Distributed Representations of Sentences and Documents. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21–26 June 2014; JMLR: New York, NY, USA, 2014; Volume 32, pp. 1188–1196. [Google Scholar]
Marcelli, A.; Graziano, M.; Xabier, U.-P.; Fratantonio, Y.; Mansouri, M.; Balzarotti, D. How Machine Learning Is Solving the Binary Function Similarity Problem. In Proceedings of the 31st USENIX Security Symposium (USENIX Security 22), Boston, MA, USA, 10–12 August 2022; USENIX Association: Boston, MA, USA, 2022. [Google Scholar]
Gilmer, J.; Schoenholz, S.S.; Riley, P.F.; Vinyals, O.; Dahl, G.E. Neural Message Passing for Quantum Chemistry. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 1263–1272. [Google Scholar]
Pewny, J.; Garmany, B.; Gawlik, R.; Rossow, C.; Holz, T. Cross-Architecture Bug Search in Binary Executables. In Proceedings of the 2015 IEEE Symposium on Security and Privacy, SP 2015, San Jose, CA, USA, 17–21 May 2015; IEEE Computer Society: Washington, DC, USA, 2015; pp. 709–724. [Google Scholar]
Tai, K.S.; Socher, R.; Manning, C.D. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China, 26–31 July 2015; pp. 1556–1566. [Google Scholar]
Baker, B.S. Compressing Difference of Executable Code. In Proceedings of the ACM SIGPLAN 1999 Workshop on Compiler Support for System Software (WCSSS’99), Atlanta, GA, USA, 1 May 1999. [Google Scholar]

Figure 1. Compilation Process from Source Code to Binary.

Figure 2. The overall process of the method.

Table 1. Comparison between research methods of BCSA. √ implies inclusion.

Year	Name	Method			Feature Type								Open Source
Year	Name	Static	Dynamic	Hybrid	Raw Bytes	Trace	Architecture	Strand	IR	Call	Word	Embedding	Code	Dataset
2016	Kam1n0 [11]	√					√						√
	BinDNN [12]	√									√
	MOCKINGBIRD [13]		√			√			√	√
	BinGo [14]	√					√		√
	Esh [15]	√						√	√				√
	Genius [16]	√					√						√	√
	discovRE [17]	√					√		√
2017	IMF-sim [18]		√			√
	BinSim [19]		√			√				√
	GitZ [20]	√						√	√
	Gemini [10]	√					√					√	√	√
	xmatch [21]	√							√
2018	VulSeeker [22]	√					√		√	√		√	√
	αdiff [23]	√			√		√			√			√	√
	FirmUP [24]	√						√	√
	Zeek [25]	√						√				√
	BinMatch [26]			√		√			√
	MASES2018 [27]	√			√							√
	WSB [28]		√			√	√
	Bingo-E [29]			√		√	√						√	√
2019	InnerEye [30]	√					√				√	√	√	√
	Safe [31]	√									√	√	√	√
	InstrModel [32]	√									√	√	√	√
	Asm2Vec [33]	√				√	√				√	√	√	√
	FuncNet [34]	√					√			√		√	√
	GENN [35]	√					√				√	√	√	√
	GMN [36]	√					√					√
2020	DeepBinDiff [37]	√				√	√			√	√	√	√	√
	Patchecko [38]			√		√	√					√
	Ordermatters [39]	√					√				√	√		√
	ACCESS2020 [40]	√			√
	Trex [41]			√		√					√	√	√	√
2021	Oscar [42]	√							√		√	√	√
	TIKNIB [4]	√					√			√		√	√	√
	Codee [43]	√				√	√			√	√	√	√
	BinDeep [44]	√									√	√
	Asteria [45]	√					√					√	√
2022	Jtrans [46]	√					√				√	√	√	√
2022	XBA [47]	√					√					√	√	√

Table 2. Research on BCSA using text semantic features. Dataset and Corpus can be k (kilo), m (million), F (Functions), FP (function pairs), B (blocks), BP (block pairs), BF (binary files), or L (Lines). Score can be R (Recall@1), r (Recall), or A (AUC).

Name	Semantic	Supporting Info	Graph	Dataset	Corpus	Comparison	Machine Learning Technology	Score
BinDNN [12]	Wordlist			13 kF		Classifier	LSTM, Fully Connected
αdiff [23]	CNN	Call Info		2.49 mFP		Vector	CNN	R: 0.955
InnerEye [30]	Word2Vec	LSTM Block Embedding	CFG	830 kB		Manhattan	Word2Vec, LSTM, Siamese	A: 0.944
Safe [31]	Word2Vec	Bi-RNN		517 kF	190 mL	Cosine	Word2Vec, Siamese, Bi-RNN	A: 0.992
InstrModel [32]	Word2Vec	Instr Alignment		202 kBP		Classifier	Word2Vec	A: 0.900
Asm2Vec [33]	PV-DM	Random Walk	CFG	140 kF		Vector	PV-DM	R: 0.809
GENN [35]	Word2Vec	Struc2Vec	CFG	96 kF		Cosine	Word2Vec, Struc2Vec	A: 0.964
DeepBinDiff [37]	Word2Vec	Random Walk	ICFG	113 BF		K-Hop	Word2Vec	r: 0.904
Ordermatters [39]	BERT	CNN, MPNN	CFG	63 kF		Cosine	BERT, CNN, MPNN	R: 0.742
Oscar [42]	BERT	Jump		110 kF	500 kF	Cosine	Moco, BERT	R: 0.884
Codee [43]	Word2Vec	Random Walk	ICFG	15 kF		LSH	Word2Vec	r: 0.851
BinDeep [44]	Word2Vec	Classifier		4.7 mFP		Vector	Word2Vec, RNN, CNN, LSTM, Siamese	r: 0.990
Jtrans [46]	BERT	Jump			26 mF	Cosine	BERT	R: 0.962

Table 3. Research on BCSA using function structure features. In the Dataset column: k—kilo; m—million; F—functions; BF—binary files. Score can be R (Recall@1), r (Recall), and A (AUC).

Name	Struc Info	Graph	Semantic	Dataset	Cross Architecture	Comparison	Machine Learning Technology	Score
Gemini [10]	Struc2Vec	ACFG		129 kF	Y	Cosine	Struc2Vec, Siamese	A: 0.971
VulSeeker [22]	Struc2Vec	LSFG		730 kF	Y	Cosine	Struc2Vec, Fully Connected	A: 0.885
Asm2Vec [33]	Random Work	CFG	PV-DM	140 kF	N	Vector	PV-DM	R: 0.809
FuncNet [34]	Struc2Vec	ACFG		355 kF	Y	SOM	Struc2Vec, Fully Connected	A: 0.980
GENN [35]	Struc2Vec	CFG	Word2Vec	96 kF	Y	Cosine	Word2Vec, Struc2Vec	A: 0.964
GMN [36]	GMN	CFG		64 kF	N	Hamming	GMN	A: 0.993
DeepBinDiff [37]	Random Work	ICFG	Word2Vec	113 BF	Y	K-Hop	Word2Vec	r: 0.904
Ordermatters [39]	CNN, MPNN	CFG	BERT	63 kF	Y	Cosine Distance	BERT, CNN, MPNN	R: 0.742
Codee [43]	Random Work	ICFG	Word2Vec	15 kF	Y	LSH	Word2Vec	r: 0.851
Asteria [45]	Tree-LSTM	AST		7.56 mF	Y	Classifier	Tree-LSTM, Siamese	A: 0.969
XBA [47]	GCN	BDG		-	Y	Vector Distance	Siamese, GCN

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Du, J.; Wei, Q.; Wang, Y.; Sun, X. A Review of Deep Learning-Based Binary Code Similarity Analysis. Electronics 2023, 12, 4671. https://doi.org/10.3390/electronics12224671

AMA Style

Du J, Wei Q, Wang Y, Sun X. A Review of Deep Learning-Based Binary Code Similarity Analysis. Electronics. 2023; 12(22):4671. https://doi.org/10.3390/electronics12224671

Chicago/Turabian Style

Du, Jiang, Qiang Wei, Yisen Wang, and Xiangjie Sun. 2023. "A Review of Deep Learning-Based Binary Code Similarity Analysis" Electronics 12, no. 22: 4671. https://doi.org/10.3390/electronics12224671

APA Style

Du, J., Wei, Q., Wang, Y., & Sun, X. (2023). A Review of Deep Learning-Based Binary Code Similarity Analysis. Electronics, 12(22), 4671. https://doi.org/10.3390/electronics12224671

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Review of Deep Learning-Based Binary Code Similarity Analysis

Abstract

1. Introduction

2. Basic Process of BCSA

2.1. Compile Preprocessing

2.2. Basic Process of BCSA

3. Classification of BCSA

3.1. Analysis Methods

3.2. Feature Type

3.3. The Evolution of BCSA Techniques

4. How Deep Learning Technology Is Applied to Existing Technologies

4.1. Text Semantic Features

4.2. Functional Structural Features

5. Summary and Prospects

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI