FUSION: Measuring Binary Function Similarity with Code-Specific Embedding and Order-Sensitive GNN

Gao, Hao; Zhang, Tong; Chen, Songqiang; Wang, Lina; Yu, Fajiang

doi:10.3390/sym14122549

Open AccessArticle

FUSION: Measuring Binary Function Similarity with Code-Specific Embedding and Order-Sensitive GNN

by

Hao Gao

^1,2,

Tong Zhang

²,

Songqiang Chen

³,

Lina Wang

^2,* and

Fajiang Yu

²

¹

School of Cyber Science and Engineering, Wuhan University, Wuhan 430001, China

²

Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, Wuhan University, Wuhan 430001, China

³

School of Computer Science, Wuhan University, Wuhan 430001, China

^*

Author to whom correspondence should be addressed.

Symmetry 2022, 14(12), 2549; https://doi.org/10.3390/sym14122549

Submission received: 16 October 2022 / Revised: 18 November 2022 / Accepted: 25 November 2022 / Published: 2 December 2022

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

:

Binary code similarity measurement is a popular research area in binary analysis with the recent development of deep learning-based models. Current state-of-the-art methods often use the pre-trained language model (PTLM) to embed instructions into basic blocks as representations of nodes within a control flow graph (CFG). These methods will then use the graph neural network (GNN) to embed the whole CFG and measure the binary similarities between these code embeddings. However, these methods almost directly treat the assembly code as a natural language text and ignore its code-specific features when training PTLM. Moreover, They barely consider the direction of edges in the CFG or consider it less efficient. The weaknesses of the above approaches may limit the performances of previous methods. In this paper, we propose a novel method called function similarity using code-specific PPTs and order-sensitive GNN (FUSION). Since the similarity of binary codes is a symmetric/asymmetric problem, we were guided by the ideas of symmetry and asymmetry in our research. They measure the binary function similarity with two code-specific PTLM training strategies and an order-sensitive GNN, which, respectively, alleviate the aforementioned weaknesses. FUSION outperforms the state-of-the-art binary similarity methods by up to 5.4% in accuracy, and performs significantly better.

Keywords:

binary code; binary analysis; similarity measurement; pre-training tasks; order-sensitive graph neural network

1. Introduction

A binary code similarity measurement approach is a popular research area within binary analysis. It involves using two binary codes as inputs, converting the input codes to vectors, and outputting the “similarity” between their functions. It plays a critical role in many areas of system security research, such as vulnerability detection [1], malware matching [2,3], etc.

However, it is “non-trivial” to measure the similarities between binary codes precisely, as much meaningful semantic data on the code functions, such as the function, variable names, comments, and structure definitions are missed after source codes are compiled into their binary files. Given their wide applications and great challenges, many methods of binary code similarity measurements have emerged in the past two decades. Recently, with the development of deep learning (DL), many DL-based methods showed significant advantages in this task.

A typical DL-based binary similarity measurement method involves performing DL-based graph matching [4,5,6,7]. This method uses DL techniques to mine features in control flow graphs (CFGs) of binary codes to represent the CFGs. The similarities between the two binary codes will then be measured with their CFG features [4,5]. Another approach is to directly represent the semantic data of the assembly code texts using DL-based textual embedding techniques. These methods (with this idea) use natural language processing (NLP) techniques to form the embedding for binary assembly code texts; the similarities are measured based on these embeddings [8,9]. Some methods further integrate the semantic data extracted from the assembly codes into their CFGs and use the graph neural networks (GNNs) to aggregate the data in code texts and CFGs for more accurate similarity measurements [10,11,12,13]. These methods achieve state-of-the-art (SOTA) performances in binary code similarity measurements.

However, we found that previous methods still encounter two weaknesses in extracting textual semantic data and the CFG data. For textual semantic data, the previous methods ignored some specific features of assembly codes when formulating the embedding. Specifically, they used the advanced pre-training language model (PTLM) technique, which is powered by several pre-training tasks (PTTs) [14] to learn to formulate precise semantic embeddings for tokens and blocks in assembly codes. However, they almost directly borrow the PTTs for raw texts without adapting them by considering those specific features inside assembly codes, while simply regarding the binary instructions as text sentences [12]. As a result, the information in assembly codes may not be sufficiently mined. For CFG information, the methods have not efficiently leveraged the direction data of the edges in the CFG to represent the binary codes. They usually embed CFGs in an unordered manner [12]. This eliminates certain distinguishable information in the CFGs and limits the performance of similarity measurements. Yu et al. [10] considered the importance of direction information and used a dedicated independent neural network to model the direction between nodes in the CFG. This improved the accuracy. However, it also introduced many additional calculation burdens and made the whole method less efficient.

To tackle these two problems and obtain more precise binary code representations and similarity measurements, we propose a method called FUSION. FUSION measures the binary function similarity using code-specific PTTs and order-sensitive GNN. Specifically, FUSION follows the previous approaches to adopt two modules, i.e., one PTLM-based textual assembly code semantic embedding module as well as one GNN-based CFG embedding module to represent the binary codes. However, unlike the previous approaches, FUSION adapts its two PTTs to specifically consider the basic block as the complete semantic unit, so as to train the PTLM to better recognize the semantics of assembly codes. This helps the semantic embedding module learn to generate more accurate embeddings for the instruction tokens and blocks. Moreover, FUSION employs a novel GNN that is sensitive to the order of nodes to embed the CFG, which helps formulate a more accurate representation for the CFG. Moreover, this GNN captures the order of nodes when embedding the CFG and, thus, does not introduce a lot of extra burdens when taking the node order into account. As a result, FUSION can alleviate the aforementioned weaknesses and is expected to perform better when measuring the similarities between the given binary function codes.

In summary, we propose one novel method, FUSION, to measure the similarities of two binary function codes. It follows the previous methods to extract the textual semantic data from assembly codes and aggregate, such as information with the feature of the CFG. However, it introduces essential improvements in both steps to make the code representation more precise. The key contributions are as follows:

(1) By considering the features of the assembly codes, we adjusted the existing PTT–masked language model (MLM) and designed a novel PTT–neighboring block prediction (NBP) to train the PTLM. These PTTs can help the PTLM learn more precisely while representing the semantic information of the assembly codes and benefit the similarity measurement.

(2) We designed a novel GNN algorithm to capture the information in the CFG along the order of nodes. It allows the information on edge direction in the CFG to be simultaneously recorded by the embedding of CFG. This efficiently leverages the node order to refine the representation of CFG for similarity measurements.

(3) We evaluated FUSION with a widely-used binary function similarity measurement benchmark. The results demonstrate that FUSION can reach an AUC of over 95%, which improves the average AUC by 3.19% to 10.01% over the SOTA. Moreover, both proposed improvements were found to effectively benefit the precision and efficiency of FUSION.

The remainder of the paper is structured as follows. In Section 2, we present the current methods commonly used in the field of binary matching and the problems that exist. In Section 3, we describe FUSION’s workflow and present our proposed improvements and innovations. Then, in Section 4, we validate the model on a generic dataset and compare it with the most current work, proving that FUSION performs better. Section 5 provides the conclusion and suggestions for future work.

2. Background

2.1. Methods for Binary Similarity Measurements

Previous methods typically considered leveraging two types of information, i.e., the structural information from CFG and the semantic information in the assembly codes, to form numerical representation, namely embedding, for binary codes. Then, the similarities between binary codes were measured according to the embeddings. In this section, we will show how the two types of information are leveraged by previous methods.

2.1.1. Methods Based on Graph Matching

To leverage the structural information in the CFG and measure the similarity between binary codes, some methods use graph-matching algorithms to calculate the similarity between the CFGs of given binary codes [4,5,6,7,15,16]. These methods embed the structure of the CFG using several custom algorithms specific to measuring binary similarity [15,16] or existing deep learning techniques for general usages, such as graph neural networks (GNNs) [4,7]. Recently, the deep learning community proposed some variants of GNN, such as the graph matching network (GMN) [17]. They were also adopted to generate more precise embeddings for CFGs and improve the performances of binary similarity measurements [18].

2.1.2. Methods Based on Semantic Embedding

A large body of works have embedded the semantic information for binary codes, which can be disassembled from binary files, to formulate the representations for binary similarity measurements [6,8,12,19]. They are powered by a specific or general textual semantic information extractor, which is trained to understand the semantics of binary codes and generate corresponding precise embeddings. For example, Trex uses a hierarchical transformer model as the textual semantic extractor to generate embeddings that can represent the execution semantics of binary codes and help measure the similarity between binary codes. In recent years, there were also methods that first extracted semantic data from binary codes to enrich the CFG and then conduct graph matching to measure similarities with both the semantic and structural information. For example, PALMTREE [12] adopts a BERT-like PTLM trained with three specific PTTs to generate textual semantic embeddings for basic blocks. These embeddings are used to formulate an enhanced CFG with rich semantic information for later graph matching. Nevertheless, the assembly code still differs from natural language in terms of more complex language topology and larger basic language units. Based on our observations, the current work on applying preprocessing models to assembly statements simply migrates the preprocessing models from natural language processing to the assembly code without adapting the pre-training task to the relevant properties of the binary programs. This results in the existing model losing some semantic information when dealing with the assembly code and not yielding relevant information about the structure of the code

2.2. Graph Information Extraction

Graph matching is often conducted based on graph information extraction, which formulates one numerical representation for the graph; it is also known as graph embedding. The deep learning community has proposed numerous graph-embedding methods, such as structure2vec [4] and GNN. Due to the large capacity of GNNs, they have seen great performances in many fields, including binary similarity measurements. GNNs learn node classifications, connection predictions, and classifications of graphs to understand the data in graphs. When embedding, GNN has the nodes aggregate the data in the features of their neighbors to update its own feature. After several iterations of aggregation, the feature of one node can record the structural information within its k-hop neighbors. Next, the GNNs use an aggregation method to obtain the final graph representation, such as summing the aggregated features of all nodes. The aggregation method can also convert node feature matrices of different sizes into a uniform representation so that the model can easily embed graphs of any size. There are also many variants of GNNs, e.g., graph convolutional networks [20] and graph attention networks [21]. Moreover, typical GNN algorithms usually involve undirected graphs.

Inspired by the performance of GNNs, many methods use GNNs to embed the CFGs of binary codes for graph matching-based binary similarity measurements [12]. As CFGs are direct graphs and the order relations between their nodes show the essential order of the control flow, a recent work dealt with the node order using a dedicated neural network in addition to the GNN to handle the order information [10].

2.3. Textual Semantic Information Extraction

It is an essential task in the NLP field to extract semantic information from the text. Recent developments of the PTLM have given researchers easy access to accurate semantic embeddings for given texts. One popular PTLM in NLP is BERT [14]. BERT leverages two PTTs, i.e., masked language model (MLM) and next sentence prediction (NSP), to pretrain itself in a self-supervised manner on several large-scale unlabeled natural language corpora. It takes two natural language sentences as input, where the special tokens [CLS], [SEP], and [EOS] are used to mark the start, junction, and end in an input. To teach BERT to extract the semantic information of one token (i.e., a word or word-piece for natural language) and its context in a sentence, MLM randomly masks a few tokens in the input and requires BERT to recover these tokens using a simple extra classification layer with their embeddings. To teach BERT to extract the semantic information from the whole sentence, NSP needs BERT to judge whether the two sentences in a given input are adjacent or not using another classification layer based on the embedding of [CLS].

Since assembly codes are also texts, PTLM, such as BERT, have been widely used in binary analyses to extract semantic information for assembly instructions as well, including binary similarity measurements [8,10,12,19]. However, the excessive complexity of the graph-matching algorithm leads to high overheads when dealing with large binary programs.

2.4. Summary of Existing Approaches

As mentioned above, we summarize and compare the existing approaches concerning which encoding scheme or algorithm is used, whether the internal instruction structure is considered, and what context is evaluated for learning. In summary, graph-based matching methods are too slow and unable to convey higher-level control flow graph information, whereas the existing learning-based encoding approaches cannot address challenges in instructing semantic information and internal structures.

3. Proposed Approach

3.1. Overview of FUSION

In this work, we propose a new method called FUSION to measure the binary function similarity with code-specific PTTs and the order-sensitive GNN. FUSION aims at more precise and efficient similarity measurements by alleviating the weaknesses of the previous methods in extracting the textual semantic data and the CFG information. Similar to the existing SOTA methods [10,11,12,13], FUSION employs two modules to first adopt a PTLM to embed the textual semantic information of the assembly code tokens and blocks. Next, it aggregates such information into the CFG and uses a GNN to formulate the final representation. The similarities between two binary functions are later measured with their representations. For more precise and efficient representation, FUSION makes some essential improvements in both modules.

Figure 1 illustrates the overall process of FUSION. Given one binary function, before using the two modules to embed this function, we first preprocess this function via disassembling it into its CFG and formalizing the assembly instructions. Moreover, the instructions are organized into basic blocks according to the CFG. Next, we leverage one novel BERT-like PTLM to generate the semantic embedding of every basic block, every basic block in the CFG will then be represented as a numerical vector. After that, we used a novel order-sensitive GNN to embed the whole CFG into the final representation of the given binary function. Finally, a Siamese network obtains the representations of multiple binary functions as inputs and computes their similarities.

During the above process, a novel BERT-like PTLM and a novel order-sensitive GNN are used to establish more precise representations of binary functions for similarity measurements. The BERT-like PTLM uses PTTs that specifically consider the features of assembly codes to learn how to generate more precise semantic embeddings for basic blocks. Moreover, the order-sensitive GNN implicitly records the order of nodes in the CFG in the final representation, so as to efficiently generate precise CFG representation with direction information in the CFG.

In the following sections, we will elaborate on the designs of FUSION. In Section 3.2, we introduce the method to preprocess the binary function via disassembling. In Section 3.3, we show the method of constructing the BERT-like PTLM using special PTTs. In Section 3.4, we present the order-sensitive GNN for generating the final representations for binary functions and measuring their similarities.

3.2. Preprocessing from Binary Files

In FUSION, we follow the previous approaches for binary similarity measurements, which first disassemble the given binary functions (in the form of binary files) into their respective CFGs, and the formalized assembly instructions for later information extraction. The operations in this step are generally similar to the existing methods. In this section, we briefly describe them for a complete and independent introduction to our method.

To obtain the CFG from the binary file, we adopt angr [22] by disassembling the binary file into the assembly codes and formulating the corresponding CFG. The nodes in the formulated CFG refer to the basic blocks in the assembly programs; the edges are the control flow instructions, e.g., jumps, between the basic blocks. Since the control flow instructions are with the direction, the CFG is a directed graph. Moreover, we follow previous methods to further optimize the CFG based on the hidden data flow features [6,23]. After that, the final CFG of the given binary file is obtained for later processing.

To obtain the assembly codes that are friendly to the textual semantic embedding module, we formalize the raw assembly codes obtained in disassembling. Specifically, FUSION follows existing methods to use natural language processing (NLP) techniques to extract data from assembly codes. The NLP techniques usually only process a limited range of tokens in a pre-defined vocabulary. Thus, a severe problem in NLP is the out-of-vocabulary (OOV) problem. However, OOV is more severe on codes since the variables, such as constants or strings, have countless values and cannot be recorded in the pre-defined vocabulary. Thus, to make the best efforts to distinguish and embed their semantic embeddings, we formalize the operands in the raw assembly codes. Inspired by previous methods [23,24], we perform the formalization in the following steps:

1.: The original names of opcodes and general registers are all retained since they are finite and each is with specific and meaningful information.
2.: If the operands are memory types, we first determine if the operands are based on the base addresses. We will replace them with MEM if they do not affect the analysis of the CFG without instruction addresses. Moreover, the operands are replaced with [register name + IMM] if they are based on the base addresses or they are not combined with the indexed addresses.
3.: We follow SAFE [23] to replace all immediate operands over 5000 into a unified token, IMM. The finite small immediate operands are kept to embed sufficient information because the malware can misplace the stack pointer by a small value when the function returns.
4.: All strings are replaced with a unified token STR.

After preprocessing the raw assembly codes with these rules, an instruction will be converted into its formalized version for later processing. For example, an instruction such as jmp short loc40150F will be formalized into jmp MEM; while another instruction lea esi, [ecx+b8h] will be formalized into lea esi, [ecx+IMM].

3.3. Extraction of Textual Semantic Information

As mentioned above, we follow previous SOTA methods to first extract the semantic information of assembly instructions for each basic block with a PTLM before embedding the whole CFG of the given binary function [10,11,12,13]. We input all formalized instructions in each basic block. One semantic embedding vector will be returned by this PTLM to represent the basic block. Similar to previous approaches, FUSION also employs a PTLM with the architecture of BERT [14]. Moreover, this PTLM is trained from scratch over the assembly code corpus. However, FUSION uses special training strategies (also known as PTTs) on the features of assembly codes to guide the PTLM to learn how to generate more precise semantic embeddings.

Specifically, inspired by the impressive performance of BERT in embedding texts, some recent methods adopted BERT to embed the semantic information for assembly codes. However, they almost directly treat the assembly instructions as natural language sentences and use MLM or NSP to pretrain a BERT-like PTLM for embedding assembly instructions [10,12,13]. However, we do not believe it is wise to directly regard an assembly instruction as one sentence. First, natural language sentences usually provide rich contextual semantics for tokens (i.e., words or word pieces); the context in one instruction usually only provides some lexical restrictions on its tokens (i.e., keywords or operands). Solely focusing on the context within the instructions may not guide PTLM to learn to extract precise and meaningful representations for binary codes and, therefore, restrict the value of MLM. Secondly, the relations between multiple statements in a basic block do not imply much information about the CFG [9]. As a result, the PTLM may not learn how to extract meaningful representations for basic blocks by predicting the adjacency of two instructions. This limits the benefit of NSP. Some recent methods also adopted PTT, predicting the similarities between two instructions [11,25]. However, they still treat the individual instruction as the unit to mine semantic information, which may also be less effective.

Thus, in this work, we propose a basic block as the contextual unit. That is, we concatenate all instructions in one basic block and consider this concatenated result as one sentence. Moreover, after this modification, MLM and NSP (which are modified into a neighboring block prediction, NBP) are adapted to run as follows, to better extract semantic information for the given basic block.

For MLM in FUSION, we follow the steps of the original MLM for BERT. However, since the contextual unit, i.e., a sentence in the input, is the concatenation of all instructions in a basic block, the PTLM is now guided to learn precise embeddings using larger and more meaningful contexts, which may benefit the binary similarity measurement. Specifically, we randomly chose 15% of tokens (i.e., keywords or operands) of each basic block in the input as the masked target. Next, we replaced 80% of the selected tokens with [MASK], 10% to random tokens, and the other 10% were not changed. The context embeddings for the selected tokens were inputted into one simple fully connected layer to predict the original tokens. For example, as shown in Figure 2, given a basic block mov ebx, 0x1; mov rdx, rbx, we first made its masked version [CLS] mov [MASK] 0x1; mov rbx jz. We input this masked version to the PTLM and guided PTLM to generate embeddings that could restore the masked tokens ebx and rbx.

Moreover, for NBP in FUSION, we designed a totally new adjacent prediction task by considering the features of the basic blocks in the CFG to guide the PTLM to learn how to generate more meaningful embeddings for the basic blocks. Specifically, considering the relationships between basic blocks in the CFG can be far-reaching, for one target basic block, we regarded the basic blocks within two steps as neighbors. Moreover, given an input, there was a 30% chance that its two basic blocks were one-step neighbors. Moreover, for another 30% probability, the two basic blocks were two-step neighbors. For the other 40% chance, the two basic blocks were not adjacent. Next, following the original NSP for BERT, we used one simple, fully connected layer that took the embedding of [CLS] as the input and trained it and the PTLM to predict the adjacency of the two basic blocks in the input, so as to guide the PTLM to accurately embed the basic blocks. For example, as presented in Figure 3, block1 and block2 were directly adjacent in the CFG and, thus, are one-step neighbors; block2 and block5 have node flow directions from block1 and, thus, are two-step neighbors. As a reminder, the direction of the edge is considered. Although block1 and block5 have edges with common neighbors, such as block2, they are not considered two-step neighbors.

After training the PTLM with the above-adapted PTTs, we can easily obtain the semantic embedding for a basic block as the previous methods do. Specifically, following [12], we used the average pooling of the hidden state in the second last layer to represent the basic block. For one fair comparison, we also used a base-sized BERT-like PTLM as the previous works did [12].

3.4. Extraction of Structural CFG Information

After obtaining the semantic embeddings for all basic blocks in the CFG using the PTLM, the original CFG was transformed into an enhanced CFG with rich semantic embeddings for nodes. Afterward, previous approaches usually used an order-insensitive GNN to embed the CFG [10,12], which may not precisely represent the essential node order information in the CFG. Instead, FUSION adopts a novel order-sensitive GNN to formulate the CFG embedding in a precise and efficient way. This GNN is meant to excavate the CFG structure, node semantic, and node order information within the CFG of the binary function via three steps, i.e., node information sensing, attention-based graph embedding, and Siamese-based similarity scoring.

3.4.1. Order-Sensitive Node Information Sensing

To aggregate the information of each node (basic block) in the CFG to form the final CFG representation for a binary function, we adopted a common GNN, GraphSAGE [26], to embed the semantically enhanced CFG of the binary function. Furthermore, to embed the order information of the nodes, we enhanced the original GraphSAGE by making it sensitive to the direction of the edge.

We enhanced GraphSAGE by making it separately, considering the in-edge and out-edge of a node. Specifically, the original GraphSAGE gives an embedding

h_{v}^{t}

for node v at time step t as:

\begin{matrix} h_{v}^{t} & = σ (W_{t} \cdot [h_{v}^{t - 1} ∥ h_{N_{v}}^{t}]) \end{matrix}

(1)

\begin{matrix} h_{N_{v}}^{t} & = α (h_{u}^{t - 1}), \forall u \in N_{v} \end{matrix}

(2)

where

N_{v}

denotes the set of the nodes adjacent to node v,

α (\cdot)

denotes the aggregation function in GraphSAGE,

W_{t}

denotes a learnable weight, ‖ denotes the concatenation operation,

σ (\cdot)

denotes a nonlinear activation function, and

h_{v}^{0}

is the semantic embedding of v generated by the previously introduced PTLM. Moreover, to effectively recognize the direction of the edges in the CFG, the enhanced GraphSAGE in FUSION separately collects the out-degree and in-degree nodes of v into

N_{v}^{o}

and

N_{v}^{i}

, respectively. As a result, the order-sensitive embedding

{h^{'}}_{v}^{t}

for node v at time step t, which is finally used in FUSION, is calculated as:

\begin{matrix} {h^{'}}_{v}^{t} & = σ (W_{t} \cdot [{h^{'}}_{v}^{t - 1} ∥ {h^{'}}_{N_{v}}^{t}]) \end{matrix}

(3)

\begin{matrix} {h^{'}}_{N_{v}}^{t} & = [W_{o} \cdot α ({h^{'}}_{p}^{t - 1})] ∥ [W_{i} \cdot α ({h^{'}}_{q}^{t - 1})] \end{matrix}

(4)

where

W_{o}

and

W_{i}

are two distinct learnable weight matrices, and p, q belong to

N_{v}^{o}, N_{v}^{i}

. Moreover, FUSION uses this enhanced GraphSAGE to aggregate the information for three steps to obtain the order-sensitive embedding

{h^{'}}_{N_{v}}^{3}

for each node v in the CFG, during which ReLU and mean are used as the activation and aggregation functions, respectively.

3.4.2. Attention-Based Graph Embedding

After obtaining the embedding of each node v in the CFG, the next task is to combine them into one overall embedding h for the whole CFG. There are two methods to combine the embeddings of all nodes, i.e., averaging and attention-weighted averaging [26]. Consider the intuition in the inverse analysis that certain nodes in the CFG should have more importance in representing the whole function [9]. Based on this, we used attention-based averaging to combine the embeddings of nodes. The attention weights a on a specific CFG were calculated based on the context c as:

\begin{matrix} a_{v} & = relu (U_{v}^{T} \cdot c) \end{matrix}

(5)

c = relu ((\frac{1}{| V |} \sum_{v = 1}^{| V |} U_{v}) \cdot W)

(6)

where

U_{v} \in R^{N \times D}

denotes the order-sensitive node embedding matrix,

| V |

denotes the number of nodes in the CFG, and W is a learnable matrix. By learning the matrix W, the order-sensitive GNN can sense different CFGs and offer specific c to automatically place proper attention

a_{v}

on the nodes in the CFG. Moreover, we obtained the final embedding h for the whole CFG as:

\begin{matrix} h & = \sum_{N}^{i = 1} (a_{v} \cdot U_{v}) \end{matrix}

(7)

3.4.3. Siamese-Based Similarity Scoring

The final step of the order-sensitive GNN in FUSION is to score the similarity for a pair of given binary functions

〈i, j〉

. To realize this goal, we also used a neural network structure called the Siamese network to measure similarity based on the CFG embeddings for i and j, as the previous approaches do [10,11,12,13]. Moreover, this score is also used to guide the training of the GNN, with a loss defined as:

{Loss}_{M S E} = \frac{1}{| D_{train} |} \sum_{i, j \in D_{train}} {({\hat{y}}_{i, j} - y_{i, j})}^{2}

(8)

where

D_{train}

denotes the set of the binary functions in the training set;

y_{i, j}

and

{\hat{y}}_{i, j}

are the predicted and reference similarity scores, respectively. Moreover, this loss is optimized by a stochastic gradient descent (SGD) optimizer during training.

4. Evaluation

In this section, we evaluate the proposed FUSION in terms of its performance and its efficiency. We will first introduce the experimental setup. Next, we report the comparison results on the performance between FUSION and the baseline methods on the binary functions in a large-scale dataset. Afterward, we present the ablation results on the proposed special PTTs and novel order-sensitive GNN for extracting the textual and CFG data, respectively. Moreover, we discuss the efficiency of FUSION.

4.1. Experimental Setup

4.1.1. Datasets

We built our datasets based on a recent binary function set for similarity measurements established in [11]. This set consists of the binary functions of 14 common libraries or applications, with 1,577,688 binary functions in total. It provides binary functions in 4 optimization levels (O0, O1, O2, O3) compiled using GCC. It was also used in many other works [27]. Specifically, we compiled software projects based on “makefile” by specifying CFLAGS (to set the optimization flag), CC (to set the cross-compiler), and host (to set the cross-compilation target architecture). We compiled to dynamic shared objects but resorted to static linking when we encountered build errors. We compiled all projects with these treatments.

During preparation of the function pairs for evaluation, to make the dataset more meaningful, we first removed duplicate functions in the function set following previous studies [6,23]. The duplicated functions were identified based on their names and instruction hashes. After that, we followed [4,15] to create similar binary function pairs by collecting two functions originating from the same source code (but having unequal attributes, such as optimization level). Moreover, the dissimilar pairs were prepared, and vice versa. The ratio of these two types of binary function pairs was 1:5. Finally, we split these pairs into two separate subsets for training and testing at a ratio of 3:1.

To make the evaluation systematic, we followed [11] to organize three datasets with distinct setups used by existing works as follows: In XO, for a source code, only two binary functions that are compiled with different optimization levels (but with the same compiler version and architecture) are seen as similar. In XC the two binary functions that are compiled with different compiler versions can be regarded as similar. In textbfXM, any two binary functions compiled from the same source code will be considered similar.

4.1.2. Performance Metric

Because the binary similarity measurement can be considered as a binary classification problem, we used relevant metrics in our evaluation. In particular, since there is no general threshold to discrete the similarity score, the chosen thresholds largely determined the model’s predictive performance. To avoid the potential bias introduced by a specific threshold value, we considered the receiver operating characteristic (ROC) curve, which measured the model’s false positives/true positives under different thresholds. The horizontal and vertical coordinates of the ROC were the false positive rate (FPR) and true negative rate (TNR), which are calculated as:

\begin{matrix} T P R = \frac{T P}{(T P + F N)} \end{matrix}

(9)

\begin{matrix} F N R = \frac{T N}{(F P + T N)} \end{matrix}

(10)

where TP and FP are true positive and false positive, TN and FN mean true negative and false negative. Notably, we used the area under the curve (AUC) of the ROC curve to quantify the accuracy of FUSION to facilitate benchmarking—the higher the AUC score, the better the model’s accuracy.

4.1.3. Baselines

In the evaluation, we introduce two state-of-the-art methods that provide friendly codes to construct and train their models for our datasets as the baselines. Moreover, we compare FUSION with them in terms of performance and efficiency.

The first baseline is SAFE [23]. SAFE is representative of the methods that solely employ NLP encoders to embed the semantic information of binary codes. It adopts a self-attentive text encoder to generate semantic embeddings for given binary codes. The similarities between codes are based on the semantic embeddings. SAFE aims for one time-efficient solution to the binary similarity measurement task.

The other baseline is PALMTREE [12]. Similar to FUSION, PALMTREE first adopts a PTLM to embed the assembly instructions and then adopts a graph embedding model to embed the enhanced CFG with rich semantic embeddings for nodes. The PTLM in PALMTREE is trained with MLM and another two special PTTs, but all PTTs in PALMTREE consider instructions as the semantic units. The graph-embedding model in PALMTREE is a structure2vec model from Gemini [4].

As a reminder, there are other approaches for binary similarity measurements [10,19]. However, they do not release their codes to easily replicate their methods. Moreover, as PALMTREE is the latest approach with promising performance, we do not compare FUSION with those early and code-unavailable methods.

We introduce two variants of FUSION to perform the ablation study, so as to understand the benefits of the two improvements in FUSION, i.e., the basic block-based PTTs and the order-sensitive GNN. The two variants are (1) FUSION-P, whose PTLM is trained using the original PTTs for BERT; and (2) FUSION-O, whose GNN is an order-insensitive GraphSAGE.

4.2. Performance Comparison Result

We compare the binary similarity measurement performance of FUSION and the two baselines over the three datasets with distinct setups. To avoid randomness, we ran each method three times and report the mean AUC. Moreover, we show the results of each project for a concrete comparison.

Table 1 presents the results. From the results, we see that FUSION can significantly outperform the two SOTA baselines. Specifically, FUSION improves the average AUC of SAFE and PALMTREE by 7.70 and 4.90 over XO, 10.01 and 3.59 over XC, and 7.78 and 3.19 over XM, respectively. Moreover, FUSION shows fairly stable performances in all three datasets. In comparison, SAFE and PALMTREE are less effective in XC and XO, respectively. Moreover, FUSION stably outperforms the two baselines in all projects. From the above findings, we conclude that FUSION is able to effectively measure the similarity of binary functions in three mainstream setups and delivers a new SOTA performance in the binary similarity measurement task.

4.3. Ablation Study on Performance

To further understand the performance of FUSION, we carried out an ablation study against the two improvements and compare the performances of FUSION-P and FUSION-O as introduced in Section 4.1.3. Considering the performances are fairly consistent in different projects and setups, we show the average result of the more challenging XM dataset as the representative.

Figure 4 presents the ROCs of two ablation variants, the full FUSION, and two baselines. Compared to FUSION, we found that FUSION-P performs significantly worse. This shows the importance of using code-specific PTTs that consider the basic block as the semantic unit. Meanwhile, we also found that FUSION-O is less effective than FUSION. This demonstrates that the order of nodes can lead to a more precise embedding for the CFG.

In addition, we found that both FUSION-O and FUSION-P showed better or competitive performances to the baselines. This also confirms the effectiveness of the two improvements in FUSION. More specifically, we noticed that FUSION-O significantly outperformed both baselines, especially when a low false positive rate was allowed. This confirms that our code-specific PTTs can be quite helpful in reducing the false positive rate and improving the precision of this task. PALMTREE also uses MLM and proposes some code-specific PTTs but still regards the instruction as the semantic unit. The superiority of FUSION to PALMTREE shows the helpfulness to regard the basic block as the semantic unit. Moreover, we notice that FUSION-P surpasses SAFE and is competitive with PALMTREE, notably under a low false positive rate. This shows the benefit of using a GNN, especially an order-sensitive one, to extract precise information for the CFGs of binary codes.

4.4. Efficiency Comparison Result

In addition to the performance, we are interested in the efficiency of FUSION. Thus, we compare the total running time of every method in each project. Since different projects have different numbers of cases, to signify the efficiency differences, we report the results of four projects, in which FUSION takes more than 200 s to handle all test samples. Following the previous studies, we are mainly interested in the time for measuring similarity for the given function pairs.

Figure 5 shows the results. In general, FUSION is on par with the two state-of-the-art baselines in terms of efficiency. Moreover, we found that SAFE takes the least time to finish tasks. However, as discussed above, SAFE has the worst performance. In the meanwhile, PALMTREE is the least efficient in general. However, this does not make it the most effective. FUSION is better (or on par with) PALMTREE in terms of efficiency, but it performs better than PALMTREE. To summarize, these findings demonstrate that FUSION can balance efficiency and performance, and provide one efficient SOTA solution to the binary similarity measurement task.

5. Conclusions and Future Work

A binary similarity measurement is a popular research area in binary analysis. It plays a critical role in many areas of system security research as well. Recent works have proposed many methods based on deep learning and have shown great improvements in this task. However, previous methods considered little about the code-specific features when preparing their semantic extracting modules. Moreover, they barely considered the direction of edges within the CFG or realized this in some less efficient ways. To alleviate these weaknesses, we proposed one novel method named FUSION. It uses two code-specific PTTs that take basic blocks as semantic units to guide the PTLM to learn to generate precise semantic embeddings for assembly codes. Moreover, it employs a novel order-sensitive GNN to embed the CFG where the edge direction in the CFG is considered to represent the CFG in a precise and efficient manner. We evaluate FUSION using one widely-used benchmark. The results show that the two improvements help FUSION gain higher precision and efficiency and outperform previous methods by large margins. In the future, we will evaluate our method with more datasets and try to apply it in practice. Moreover, we are interested to see whether FUSION can benefit from the downstream tasks in system security research.

Author Contributions

Conceptualization, H.G.; methodology, H.G.; software, H.G., T.Z.; validation, H.G., T.Z. and S.C.; formal analysis, H.G.; investigation, H.G.; resources, H.G.; data curation, H.G. and T.Z.; writing—original draft preparation, H.G.; writing—review and editing, H.G. and T.Z. and S.C.; visualization, H.G. and F.Y.; supervision, L.W.; project administration, L.W.; funding acquisition, L.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (U1836112, 61876134) and the National Key R&D Program of China (no. 2020YFB1805400, no. 2021YFB3100700).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Brumley, D.; Poosankam, P.; Song, D.; Zheng, J. Automatic patch-based exploit generation is possible: Techniques and implications. In Proceedings of the 2008 IEEE Symposium on Security and Privacy (sp 2008), Oakland, CA, USA, 18–21 May 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 143–157. [Google Scholar]
Bayer, U.; Comparetti, P.M.; Hlauschek, C.; Kruegel, C.; Kirda, E. Scalable, behavior-based malware clustering. NDSS 2009, 9, 8–11. [Google Scholar]
Jang, J.; Woo, M.; Brumley, D. Towards automatic software lineage inference. In Proceedings of the 22nd USENIX Security Symposium (USENIX Security 13), Washington, DC, USA, 14–16 August 2013; pp. 81–96. [Google Scholar]
Xu, X.; Liu, C.; Feng, Q.; Yin, H.; Song, L.; Song, D.X. Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, Dallas, TX, USA, 30 October–3 November 2017. [Google Scholar]
Duan, Y.; Li, X.; Wang, J.; Yin, H. Deepbindiff: Learning program-wide code representations for binary diffing. In Proceedings of the Network and Distributed System Security Symposium, San Diego, CA, USA, 23–26 February 2020. [Google Scholar]
David, Y.; Alon, U.; Yahav, E. Neural reverse engineering of stripped binaries using augmented control flow graphs. Proc. Acm Program. Lang. 2020, 4, 1–28. [Google Scholar] [CrossRef]
Massarelli, L.; Luna, G.A.D.; Petroni, F.; Querzoni, L.; Baldoni, R. Investigating Graph Embedding Neural Networks with Unsupervised Features Extraction for Binary Analysis. In Proceedings of the 2019 Workshop on Binary Analysis Research, San Diego, CA, USA, 24–27 February 2019. [Google Scholar]
Ding, S.H.H.; Fung, B.C.M.; Charland, P. Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization. In Proceedings of the 2019 IEEE Symposium on Security and Privacy (SP), Francisco, CA, USA, 19–23 May 2019; pp. 472–489. [Google Scholar]
Zuo, F.; Li, X.; Zhang, Z.; Young, P.; Luo, L.; Zeng, Q. Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs. arXiv 2019, arXiv:1808.04706. [Google Scholar]
Yu, Z.; Cao, R.; Tang, Q.; Nie, S.; Huang, J.; Wu, S. Order Matters: Semantic-Aware Neural Networks for Binary Code Similarity Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020. [Google Scholar]
Pei, K.; Xuan, Z.; Yang, J.; Jana, S.S.; Ray, B. Trex: Learning Execution Semantics from Micro-Traces for Binary Similarity. arXiv 2020, arXiv:2012.08680. [Google Scholar]
Li, X.; Yu, Q.; Yin, H. PalmTree: Learning an Assembly Language Model for Instruction Embedding. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, Virtual Event, Republic of Korea, 15–19 November 2021. [Google Scholar]
Gui, Y.; Wan, Y.; Zhang, H.; Huang, H.; Sui, Y.; Xu, G.; Shao, Z.; Jin, H. Cross-Language Binary-Source Code Matching with Intermediate Representations. arXiv 2022, arXiv:2201.07420. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar]
Feng, Q.; Zhou, R.; Xu, C.; Cheng, Y.; Testa, B.; Yin, H. Scalable Graph-based Bug Search for Firmware Images. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, 24–28 October 2016. [Google Scholar]
Liu, B.; Huo, W.; Zhang, C.; Li, W.; Li, F.; Piao, A.; Zou, W. α Diff: Cross-Version Binary Code Similarity Detection with DNN. In Proceedings of the 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE), Montpellier, France, 3–7 September 2018; pp. 667–678. [Google Scholar]
Li, Y.; Gu, C.; Dullien, T.; Vinyals, O.; Kohli, P. Graph Matching Networks for Learning the Similarity of Graph Structured Objects. In Proceedings of the ICML International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019. [Google Scholar]
Ling, X.; Wu, L.; Wang, S.; Ma, T.; Xu, F.; Wu, C.; Ji, S. Hierarchical Graph Matching Networks for Deep Graph Similarity Learning. arXiv 2020, arXiv:2007.04395. [Google Scholar]
Yu, Z.; Zheng, W.; Wang, J.; Tang, Q.; Nie, S.; Wu, S. CodeCMR: Cross-Modal Retrieval For Function-Level Binary Source Code Matching. In Proceedings of the NeurIPS 2020, Vancouver, BC, Canada, 6–12 December 2020. [Google Scholar]
Bruna, J.; Zaremba, W.; Szlam, A.D.; LeCun, Y. Spectral Networks and Locally Connected Networks on Graphs. CoRR 2014, arXiv:1312.6203. [Google Scholar]
Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio’, P.; Bengio, Y. Graph Attention Networks. arXiv 2018, arXiv:1710.10903. [Google Scholar]
Shoshitaishvili, Y.; Wang, R.; Salls, C.; Stephens, N.; Polino, M.; Dutcher, A.; Grosen, J.; Feng, S.; Hauser, C.; Kruegel, C.; et al. SoK: (State of) The Art of War: Offensive Techniques in Binary Analysis. In Proceedings of the IEEE Symposium on Security and Privacy, San Jose, CA, USA, 22–26 May 2016. [Google Scholar]
Massarelli, L.; Luna, G.A.D.; Petroni, F.; Querzoni, L.; Baldoni, R. SAFE: Self-Attentive Function Embeddings for Binary Similarity. In Proceedings of the Detection of Intrusions and Malware, and Vulnerability Assessment—16th International Conference, DIMVA 2019, Gothenburg, Sweden, 19–20 June 2019; Springer: Berlin/Heidelberg, Germany, 2019; Volume 11543, pp. 309–329. [Google Scholar]
Li, Y.; Wang, B.; Hu, B. Semantically find similar binary codes with mixed key instruction sequence. Inf. Softw. Technol. 2020, 125, 106320. [Google Scholar] [CrossRef]
Luo, Z.; Wang, B.; Tang, Y.; Xie, W. Semantic-Based Representation Binary Clone Detection for Cross-Architectures in the Internet of Things. Appl. Sci. 2019, 9, 3283. [Google Scholar] [CrossRef] [Green Version]
Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. In Proceedings of the 2017 Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Marcelli, A.; Graziano, M.; Ugarte-Pedrero, X.; Fratantonio, Y.; Mansouri, M.; Balzarotti, D. How Machine Learning Is Solving the Binary Function Similarity Problem. In Proceedings of the 31st USENIX Security Symposium (USENIX Security 22), Boston, MA, USA, 10–12 August 2022. [Google Scholar]

Figure 1. Overview of FUSION to measure the similarity between the binary functions FUN_A and FUN_N.

Figure 2. Illustration of the adapted masked language model (MLM).

Figure 3. Illustration of the Adapted Neighboring Basic-block Prediction (NBP).

Figure 4. Comparison among ROCs of ablation variants and other methods, which shows the importance of using code-specific PTTs that consider the basic block as the semantic unit.

Figure 5. Efficiency (sec.) comparison between FUSION and baseline methods, which demonstrate that FUSION can balance the efficiency and performance to provide one efficient SOTA solution.

Table 1. Performance (AUC) comparison between FUSION and baseline methods.

	XO			XC			XM
	SA	PA	FU	SA	PA	FU	SA	PA	FU
BinUtils	86.89	91.13	96.73	85.39	92.00	95.16	88.30	92.72	96.87
BusyBox	87.66	91.74	96.62	85.01	92.87	96.63	88.36	91.29	95.63
Curl	88.03	91.12	94.10	85.24	92.09	97.15	87.85	92.41	95.34
CoreUtils	89.14	90.66	95.85	85.78	91.69	95.70	87.02	91.39	96.54
DiffUtils	88.57	91.35	95.64	86.19	91.86	95.28	88.41	91.49	95.38
FindUtils	89.52	91.74	96.35	86.85	92.18	95.41	87.42	91.62	95.02
GMP	86.90	89.09	95.01	85.78	92.24	95.23	88.00	92.24	96.43
ImageMagick	88.33	91.70	95.49	86.10	91.29	95.89	88.05	91.29	96.15
Libmicrohttpd	87.63	89.02	95.36	85.36	91.73	94.10	87.54	91.00	96.49
LibTomCrypt	88.04	90.67	95.76	86.10	91.04	96.15	87.34	93.48	95.18
OpenSSL	87.63	91.75	96.29	85.78	93.74	94.98	87.77	91.16	96.58
Putty	86.88	89.70	94.44	86.98	93.33	96.43	88.00	93.38	95.65
SQLite	86.90	89.31	94.51	83.37	91.54	94.78	86.17	93.38	96.26
Zlib	86.62	88.07	94.34	84.16	91.00	96.02	85.46	93.33	94.02
Average	87.80	90.60	95.50	85.66	92.08	95.67	87.60	92.19	95.38

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, H.; Zhang, T.; Chen, S.; Wang, L.; Yu, F. FUSION: Measuring Binary Function Similarity with Code-Specific Embedding and Order-Sensitive GNN. Symmetry 2022, 14, 2549. https://doi.org/10.3390/sym14122549

AMA Style

Gao H, Zhang T, Chen S, Wang L, Yu F. FUSION: Measuring Binary Function Similarity with Code-Specific Embedding and Order-Sensitive GNN. Symmetry. 2022; 14(12):2549. https://doi.org/10.3390/sym14122549

Chicago/Turabian Style

Gao, Hao, Tong Zhang, Songqiang Chen, Lina Wang, and Fajiang Yu. 2022. "FUSION: Measuring Binary Function Similarity with Code-Specific Embedding and Order-Sensitive GNN" Symmetry 14, no. 12: 2549. https://doi.org/10.3390/sym14122549

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FUSION: Measuring Binary Function Similarity with Code-Specific Embedding and Order-Sensitive GNN

Abstract

1. Introduction

2. Background

2.1. Methods for Binary Similarity Measurements

2.1.1. Methods Based on Graph Matching

2.1.2. Methods Based on Semantic Embedding

2.2. Graph Information Extraction

2.3. Textual Semantic Information Extraction

2.4. Summary of Existing Approaches

3. Proposed Approach

3.1. Overview of FUSION

3.2. Preprocessing from Binary Files

3.3. Extraction of Textual Semantic Information

3.4. Extraction of Structural CFG Information

3.4.1. Order-Sensitive Node Information Sensing

3.4.2. Attention-Based Graph Embedding

3.4.3. Siamese-Based Similarity Scoring

4. Evaluation

4.1. Experimental Setup

4.1.1. Datasets

4.1.2. Performance Metric

4.1.3. Baselines

4.2. Performance Comparison Result

4.3. Ablation Study on Performance

4.4. Efficiency Comparison Result

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI