In the methodology of this paper, we do not undertake the pre-training task ourselves; instead, the pre-training is handled by GraphCodeBERT [
28]. This is appropriate because the Solidity language, specifically designed for Ethereum smart contract development, exhibits features like static typing, inheritance, libraries, and complex user-defined types, drawing influences from JavaScript, Python, and C++. GraphCodeBERT, during its pre-training phase, utilized 2.3 million functions from six programming languages, encompassing three pre-training tasks: modeling programming languages, predicting edges in the program data graph (data flow graph), and aligning variables across program source code and its graph structure. Our method requires loading the pre-trained parameters from GraphCodeBERT, followed by fine-tuning them on a smart contract dataset. The workflow of this method is shown in
Figure 1 and is divided into two stages: (a) Graph Generation Stage, during which source code is converted into an AST, parsing the AST to retain different nodes and edges and generating a heterogeneous contract semantic graph. (b) Vulnerability Detection Stage, where smart contract vulnerabilities are detected based on the pre-trained model. We will now elaborate on these two stages.
3.1. Graph Generation Phase
In the process of generating the AST for Solidity code, most existing work, such as [
10,
30], predominantly employs the tool tree-sitter-solidity, which is known for its efficiency and incremental parsing capabilities. However, for Solidity code, the AST generated by this tool does not contain sufficiently detailed node information. For example, as shown in Listing A1, concerning the content at line 11 of the reentrancy vulnerability example code
msg.sender.call,
Figure 2a represents a portion of the AST nodes obtained from tree-sitter-solidity, whereas
Figure 2b shows a portion of the AST nodes parsed by the official Solidity compiler, solc. It is evident that the AST nodes parsed by tree-sitter-solidity only include type and text information within their description, lacking detailed metadata. In contrast, the AST generated by solc not only includes structural information within the AST hierarchy but also provides rich descriptive information for nodes such as id, src, nodeType, and typeDescriptions, particularly offering standardized descriptions in typeIdentifier and typeString for critical nodes associated with vulnerabilities like the low-level function call
call(). Given that solc has undergone multiple iterations since 2015, the AST structure and node information generated by different versions of solc can vary. The tool solc-typed-ast enables the creation of normalized type Solidity ASTs based on solc, mitigating the differences brought by various versions of the compiler. Therefore, we opt to use solc-typed-ast to generate the code AST.
The AST generated by the solc-typed-ast tool includes 56 types of nodes, which we categorize into three classes based on their functionalities. They are structure operation class nodes, variable and call class nodes, and other class nodes. The detailed classification is shown in
Table A1 in
Appendix A. The structural and operational class nodes encompass the structure and control information of an entire ‘.sol’ file’s AST, from top to bottom, including files, contracts, functions, statements, and operations. We utilize this information to construct the edges in the heterogeneous contract semantic graph, including both data flows and control flows. The variable and call class nodes, as components of statements, are constructed as nodes within the heterogeneous contract semantic graph. Other class nodes, which include import statements, structured documentation, and other nodes that do not contribute to vulnerability generation, or basic class nodes like names, contain information that can be encapsulated by higher-level nodes and may be omitted.
Specifically, in high-level programming languages, the basic units of execution are statements, including both simple and compound statements. Simple statements consist of a single logical line, such as expression statements (ExpressionStatement) and assignment statements (Assignment), whereas compound statements encompass other statements (groups of statements) that influence or control the execution of the included statements in some manner, such as if statements (IfStatement) and for statements (ForStatement). The logic within simple statements facilitates operations on variables and function calls. In our proposed method for constructing heterogeneous contract semantic graphs, we analyze each simple statement in the contract according to the typed-AST structure, extracting variables and function call nodes to serve as nodes within the heterogeneous contract semantic graph. It is important to note that Solidity includes certain special variables that are always present in the global namespace. These built-in variables are essentially of a basic type and are used to describe attributes of blocks and transactions, such as
block.prevrandao and
msg.sender [
34]. Although these variables are not explicitly declared, they are still integrated as nodes in our graph. Solidity allows for the manipulation of mappings, arrays, and structures through dereferencing, but we do not treat the results of such dereferences (e.g.,
balances[msg.sender]) as individual nodes. Instead, we regard these structures as a whole, which facilitates their inclusion as nodes when constructing the contract semantic graph.
The relationships among simple statements—such as sequence, loops, and conditionals—act as control flows, while the operations within simple statements (such as BinaryOperation, IndexAccess, etc.) and the inputs and outputs involved in function calls are constructed as data flows. Take the code shown in Listing 1 as an example:
Listing 1. Sample Code. |
|
The visualization of the typed-AST generated by the solc-typed-ast tool is shown in
Figure 3.
Based on the structural information in the typed-AST, the example code is transformed into a heterogeneous contract semantic graph as shown in
Figure 4.
The variable node features in the contract graph are represented by a four-tuple, F = (id, name, nodeType, typeDescriptions). The function call node features in the contract graph are represented by a five-tuple, F = (id, name, nodeType, typeDescriptions, parameter), where id is the node identifier, name is the variable name (function name), nodetype is the node type, typeDescriptions is the specific description of the node type, and parameter is the function parameter. The edge features in the contract graph are represented by a four-tuple, F = (order, type,
,
), where order is the time order, type is the edge type (control flow or data flow), and
and
are the start and end nodes. The resulting contract graph is denoted as
, where V is the node set and E is the edge set.
Figure 4b shows the heterogeneous contract semantic graph built based on the contract with the vulnerability, and
Figure 4c shows its edge features.
3.2. Vulnerability Detection Phase
This section will offer an in-depth description of how the proposed approach utilizes a graph-based pre-trained model to identify vulnerabilities in smart contracts. It will encompass details on data preparation, the model’s architecture, and the training procedure.
First is the data preparation process. This method is based on GraphCodeBERT [
28], but it differs in how it handles graph-structured data. In the original GraphCodeBERT architecture, given a source code
, a corresponding data flow graph
can be obtained, where
is a set of variables (also the nodes in the data flow graph), and
is a set of directed edges indicating the dependencies among variables. The source code and the set of variables are merged into a sequence
, with
as a special token preceding the sets, and
as a separator between the source code
and the variable set
V. For each token in the sequence
I, we generate the corresponding position embedding and add the token and the corresponding position embedding to represent the token. Special position embeddings are assigned to all variables to signify their roles as nodes within the data flow. The final input representation, denoted as
, is derived in this manner.
In our method, the embedding for the source code text follows the approach used in GraphCodeBERT. However, the embedding for the graph structure differs because the GraphCodeBERT method, which is suited for handling simple homogeneous graphs like data flow graphs, does not adequately capture the information in the heterogeneous contract semantic graph
generated earlier. In our graph embedding process, we choose HGT [
33] as the graph structure embedding method to extract the structural information of the heterogeneous contract semantic graph. The resulting node feature sequence replaces the variables sequence in the GraphCodeBERT method.
The purpose of HGT is to consolidate information from source nodes to obtain the contextual representation of target nodes. This procedure can be segmented into three distinct phases:
The first part is the calculation of heterogeneous mutual attention. The calculation of heterogeneous mutual attention begins by examining the meta-relationships between a target node t and each of its source nodes . These relationships are defined by the tuple , representing the source node type, the edge type, and the target node type, respectively. To accommodate the diverse and complex nature of these relationships in a heterogeneous graph, the model converts target node t into a query vector and each source nodes s into a key vector. Unlike standard Transformers that use a direct inner product for such calculations, HGT utilizes distinct attention matrices tailored to each edge type , ensuring that the nuances of different semantic associations are captured effectively.
The second part is the heterogeneous message-passing process. The message-passing process from the source node to the target node and the calculation of mutual attention are parallel. The goal is to merge the meta-relationships of various edges into the message-passing process. By doing so, it helps to balance the distribution disparities among different types of nodes and edges.
The third part is the aggregation for a specific task. It uses the attention vector as the weight to calculate the corresponding information from the source node and obtain the updated vector. The updated vector is linearly mapped and connected with the original vector of t in the previous layer as a residual. In this way, the output of the target node t in the layer of HGT is obtained. Stacking L layers can obtain a rich context representation for each node as the input of the downstream task.
In order to be able to handle the dynamic nature of the graph, relative time coding is introduced. Traditionally, time information has been integrated by constructing a separate graph for each time slot, a method that can lead to the loss of structural dependencies across different time slots. Moreover, the representation of a node at time
t might rely on edges from various other time slots. Thus, the appropriate method to model a dynamic graph is to preserve all edges occurring at various times and permit interactions between nodes and edges that possess different timings [
32]. Specifically, given a source node
s and a target node
t with the edge
between them, the edge information
includes temporal information order, which is used as an index for the relative temporal encoding. This encoding is applied through sine and cosine functions to capture the relative temporal dependencies.
In Equations (1) and (2), is the timing information carried by the edge in the heterogeneous graph, i is the position index, and d is the dimension of the feature vector in HGT. represents the basic relative temporal encoding calculated by sine and cosine functions. In Equation (3), represents the projection of the basic relative time code to obtain the final relative time code RTE(order). RTE(order) is added to the node representations to obtain . This allows the resulting node representations to capture the relative temporal information between the source node s and the target node t.
Secondly, the model architecture and training process are as follows. As shown in
Figure 5, the sequence
obtained in the data preparation stage is fed to the Join layer. In the Join layer, the sequence
I from the data preparation stage is converted into an input vector
. Subsequently, the input vector
proceeds through the multi-head attention layer, undergoes layer normalization, and passes through multiple Transformer layers
to produce distinct contextual representations,
. Equations (4) and (5) represent the training process of the model. In Equation (4),
H and
X are vectors,
denotes a multi-head self-attention operation, and
denotes a layer normalization operation. In Equation (5),
represents a two-layer feed-forward network, where each Transformer layer comprises a structurally identical transformer. As shown in Equations (4) and (5), the output
of the previous layer first undergoes a multi-head self-attention operation. The output of the self-attention operation is not directly passed to the next stage, but is first added to
to form a residual connection and normalized to obtain a vector
. After the vector
passes through the feedforward layer, which includes two linear transformation layers with an activation function in between, it also undergoes a residual connection and another layer of normalization to produce the output
. This process helps the model avoid potential gradient vanishing problems in deep networks while maintaining information from each layer’s input.
For the output
of the multi-head self-attention in the
transformer layer, the calculation process is shown in Equation (6) to (9):
In Equation (6),
is the output of
Transformer layers,
X,
W are vectors, and
Q,
K,
V are triplets.
is linearly projected onto the triplets of
,
, and
using model parameters
,
,
. To incorporate a graph structure within the Transformer and illustrate dependencies between graph nodes, we adopt the approach of GraphCodeBERT [
28], utilizing graph-guided masked attention to depict the interactions among tokens. Graph-guided masked attention is implemented through the mask matrix
M. In Equation (7),
is the head in multi-head attention,
is the dimension of the
,
M is the mask matrix,
, where if the
token and the
token are associated, then
, otherwise it is
. The calculation process of
M is shown in Equation (8),
is a special mark in front of the set,
is a separator,
is the set of tokens in the source code,
E is the set of edges
in the heterogeneous contract semantic graph
, representing the control flow and data flow relationship between nodes, and
is a set indicating the association between the smart contract source code token and the nodes in the heterogeneous contract semantic graph (
). Attention computations are permissible between nodes
and
when they are directly connected (i.e.,
) or are the same node (i.e.,
). To represent the relationship between the source code tokens and the nodes in the heterogeneous contract semantic graph, we first define a set
. If node
is determined by the token
in the source code (i.e.,
), they are allowed to perform attention computation with each other; in other cases, the attention is masked by assigning an attention score of
. After the Softmax calculation in Equation (7), it is assigned a value of 0. In Equation (9),
u is the number of heads in the multi-head attention, and
is the model parameter.
After the
layer of the Transformer model, layer normalization is employed for regularization. Subsequently, the output
y, which represents the probability of the contract containing a vulnerability, is derived using a linear layer followed by a
Sigmoid [
35] function, as demonstrated in Equation (10). A loss function is then formulated to measure the discrepancy between this output
y and the target value, where the target is set to 1 if the smart contract exhibits a specific vulnerability, and 0 otherwise. Finally, the backpropagation algorithm is utilized to train the network.