1. Introduction
Ransomware is a type of malicious software that encrypts or locks users’ files or systems to obtain their private information and demands a high ransom from the users. If the ransom is not paid in time, the ransomware may leak the user’s private information. Modern ransomware emerged around 2005 and quickly became a viable business strategy for attackers [
1,
2]. A notorious example of ransomware is WannaCry, which quickly spread across many computer networks in May 2017 [
3,
4]. Within a few days, it infected over 200,000 computers in 150 countries [
5]. In recent years, ransomware has been growing exponentially, expanding its reach across different platforms, and continuously evolving. This escalation has made ransomware detection crucial not only for protecting individuals’ private information but also for improving the overall governance of cyberspace security, thereby making it a primary focus in current information security risk detection research [
6,
7].
Ransomware can be classified into three main types: crypto, locker, and scareware [
8,
9], as shown in
Figure 1. Crypto ransomware extorts users by encrypting important files stored on their hard drives, making them inaccessible unless a ransom is paid to obtain the decryption key. Locker ransomware locks down the entire operating system, preventing users from accessing their desktop or any personal files and demands a fine to unlock the computer. Unlike crypto ransomware, it usually does not encrypt files but severely limits access to the computer. Scareware falsely informs users that their system has security issues or needs to clean up junk files, often using pop-up windows or warnings to intimidate users into paying for the “full version” of the software to fix these non-existent problems. The evolution of ransomware often involves hackers modifying existing codes to develop new variants, which, while differing in code structure, share similar behavioral goals. According to recent studies, over 98% of ransomware stems from established families, often incorporating or adapting functions from their predecessors [
10]. Hackers might also intersperse normal behaviors within the ransomware code to evade detection [
11]. The most commonly used ransomware detection methods are static detection and dynamic detection [
12]. Static detection focuses on feature extraction and matching of malicious programs in binary form in non-running states. Dynamic detection builds a simulated running environment for specified ransomware samples and records ransomware behavior. Despite their extensive use, these technologies suffer from high rates of false positives and false negatives [
13], primarily because they fail to recognize patterns involving non-continuous global word co-occurrences—critical for identifying ransomware.
Recent advancements underscore the importance of innovative technologies in combating the evolving threats posed by ransomware. Apruzzese et al. [
14] investigated the use of machine learning and graph neural networks in network security, conducting an in-depth analysis of these technologies in intrusion, malware, and spam detection. This research evaluated the current effectiveness and maturity of these solutions, which are crucial in addressing ransomware’s rapid evolution and complex evasion tactics. Especially, graph neural networks have become a powerful tool for text representation learning, significantly enhancing the understanding of semantic relationships in malicious software code [
15].
However, while CNN [
16] and RNN [
17] are adept at extracting features from continuous word sequences, they often overlook global word co-occurrence and the associations between key node steps. Global word co-occurrence information refers to the relationship between words that appear discontinuously in the text. They carry long-distance semantic information, which is crucial for ransomware recognition. For example, if the words “encryption” and “ransom” appear in the text, even if they are not continuous, it can be judged that this is a feature of ransomware. Consequently, relying solely on local adjacent information may render these methods ineffective, as they struggle to integrate crucial global context.
To address this challenge, the paper introduces ADC-TextGCN, an innovative ransomware detection model. It leverages co-occurrence information and adaptive diffusion learning through a Text Graph Convolutional Network [
18]. ADC-TextGCN is a representation learning method of adaptive graph neural network based on key ransom text, which can represent key ransom text as a word token node, where each node represents a word, and each edge represents the relationship between two words. ADC-TextGCN can use graph convolution operations to extract global word co-occurrence and association relationships from word graphs for the precise learning and recognition of ransomware. To achieve the adaptive representation and learning of the different text relationships of ransomware, it is very important to effectively aggregate multi-hop node information. Only when the node representation has enough global relationship knowledge can the whole learning process be more effective and more useful and the training results be more valuable.
Specifically, ADC-TextGCN uses Point Mutual Information Theory (COIR-PMI) to preserve co-occurrence information to calculate the edge weights between word nodes so as to preserve sequential and co-occurrence information as much as possible and improve the performance of ADC-TextGCN. In the model’s phase of learning node information, TextGCN typically captures data only from a node and its immediate neighbors. To enrich node representations with broader contextual information during multi-layer propagation, we introduce an Adaptive Diffusion Convolution strategy to the TextGCN framework. This innovation enables the automatic identification of the most informative neighborhood size for each dataset, facilitating the incorporation of information from more distant nodes across layers. Such an approach allows each dataset to have a tailored propagation neighborhood size, consistent across all layers of the graph neural network and feature channels, significantly enhancing the model’s ability to aggregate complex ransomware behavior patterns and improve its overall predictive performance.
To verify the effectiveness and superiority of our method, we conducted experiments on public datasets. We compared the ADC-TextGCN model with several baseline methods, including traditional methods based on CNN, RNN and TextGCN. Experimental results show that our method achieves the best performance on all the datasets, reaching a ransomware detection accuracy rate of over 96.6%, outperforming the performance of the benchmark methods. This proves that our method can effectively learn co-occurrence information and improve the recognition detection ability of ransomware.
The structure of this paper is as follows. In
Section 2, we introduce the related research published in recent years.
Section 3 details the proposed method. The experimental setup and results are discussed in
Section 4. Finally,
Section 5 provides a comprehensive summary and discussion of the findings.
3. Methodology
The main idea of ADC-TextGCN is to learn a dedicated propagation neighborhood for each GNN layer and feature channel so that the GNN architecture is fully coupled with the graph structure. This is a unique characteristic of GNNs that differs from traditional neural networks. The implementation of ADC-TextGCN formalizes the task as a two-layer optimization problem, allowing for the customized learning of an optimal propagation neighborhood size for each dataset. A two-level optimization problem is a special type of optimization problem in which the solution of one optimization problem (usually referred to as a lower-level problem) is constrained to the optimal solution of another optimization problem (usually referred to as an upper-level problem). These two issues interact with each other, forming a hierarchical structure. During the transmission of messages in each graph, all GNN layers and feature channels (dimensions) share the same neighborhood size. To further develop in this direction, ADC-TextGCN also allows for the automatic learning of custom neighborhood sizes for each GNN layer and each feature channel from the data.
The processing flow framework of the ADC-TextGCN is based on TextGCN, as shown in
Figure 2. First, we run both ransomware and benign software in a controlled area called a sandbox. Then, we extract API call sequences from the behavior logs of these programs. Leveraging both static and dynamic analysis methodologies, a text graph of the API call sequence is constructed and fed into the ADC-TextGCN network. The Adaptive Diffusion Convolution strategy is employed across each layer and feature channel within the ADC-TextGCN network, facilitating the adaptive acquisition of the optimal neighborhood size. Ultimately, the outcomes are produced via a classifier model.
Figure 3 illustrates the structure of the ADC-TextGCN and demonstrates the use of neighborhood radius (circles on the right) for semantic information integration for classification purposes. The concept of neighborhood radius is employed to describe the local connectivity patterns between nodes, signifying the co-occurrence strength of different words within the documents and their contribution to classification. In the diagram on the right, each word node is encompassed by a neighborhood radius that visualizes the density of its relationships with other words, as well as its impact on the document classes ‘Crypto’, ‘Locker’, and ‘Scareware’.
The classifier model used in conjunction with the ADC-TextGCN architecture has been carefully designed to utilize the subtle feature representations extracted by the Adaptive Diffusion Convolution strategy. It typically consists of multiple layers, including an input layer in vectorized form using API call sequences, one or more hidden layers applying adaptive diffusion processes, and an output layer that classifies sequences as ransomware or benign software. Each layer is configured with specific hyperparameters, such as the number of neurons and the type of activation function, to optimize the detection performance of the model. The selection of hyperparameters is determined by extensive experimental testing, aiming to achieve a subtle balance between maximizing detection accuracy and maintaining computational efficiency.
The training process of the model is summarized in Algorithm 1.
Algorithm 1: Training process of the ADC-TextGCN |
Input: Training set |
Output: Trained ADC-TextGCN network 1: Dataset organization: Organize the training set . 2: Obtain API call sequences: Extract API call sequences based on the training set . 3: Analyze the five key steps of a ransomware attack: By analyzing the five key steps of a ransomware attack, derive the set of sensitive API functions . 4: Construct the text graph: Build a text graph based on the API call sequences. For the nodes of the text graph, construct a node weight measurement formula based on the set of sensitive API functions , and amplify features related to sensitive APIs. 5: Input the constructed text graph into the TextGCN network: Input the constructed text graph into the TextGCN network , integrate the Adaptive Diffusion Convolution strategy, using the neighborhood radius formula to dynamically adjust the diffusion process:
where represents the influence received from nodes at an s-step distance, allowing the model to adjust its information propagation range based on the node’s connectivity density and structural information. 6: Use the classifier model to classify the output: Use a classifier model to classify the output of the ADC-TextGCN network and generate results. |
3.1. Extract API Call Sequences and Sensitive Functions
In the field of ransomware detection, the extraction of API call sequences is a key step, and there are mainly two methods: static analysis and dynamic analysis. The extraction of API call sequences involves obtaining the names and order of APIs called by a program from its code or execution process, which can reflect the program’s functionality and behavior, making it suitable for program analysis tasks such as malware detection, code cloning detection, and code search.
The static analysis method extracts API call sequences without executing the program. It analyzes program code from the perspectives of vocabulary, grammar, and semantics. However, due to the inability to handle dynamically loaded APIs or parse complex control and data flows, it may generate false positives or false negatives.
The dynamic analysis method extracts API call sequences by monitoring, recording, and analyzing the execution process of the program. The dynamic analysis method can accurately obtain the API actually called by the program, without being affected by the code structure, and can handle dynamically loaded APIs. However, it can only cover a portion of program branches and paths and is influenced by input data.
We tracks the execution process of the program and obtains the API call sequence by running the example executable file in the Cuckoo Sandbox, as shown in
Figure 4. In addition, sandbox technology can be used for online virus analysis and malware behavior detection [
40].
Based on the API-service set, we propose a hybrid API weight evaluation method and map it to graph nodes. We regard sensitive API call functions as sensitive nodes. If node i is a sensitive node, then the weight between node i and its neighbor node j will be increased. This means that the original weight between i and j is multiplied by α, which is a weight growth factor. If both nodes i and j are not sensitive, the original weight between nodes i and j is retained. The value of α is determined by a matrix search.
An example is shown in
Figure 5, where the dark-red nodes represent sensitive API call functions, i.e., sensitive nodes. In addition, the light-blue nodes represent general API call functions. The left figure is the original call sequence text graph. Since node 3 is a neighbor node of the sensitive node, the weight between nodes 2 and 3 should be multiplied by the weight growth factor α. However, node 4 is not directly connected to node 2, so A1–4 remains unchanged. Node 2 is a sensitive API call function, and nodes 3, 4, and 1 are not sensitive. The weight between nodes 2 and 3 is multiplied by α, but the weight between nodes 1 and 4 is still A14.
Building on this foundation, to understand the relationship between non-adjacent API calls in the sequence, we introduce the Adaptive Diffusion Convolution mechanism. This mechanism evolves from Graph Diffusion Convolution, which uses the manual adjustment of the neighborhood size for feature aggregation into an adaptive learning size. Adaptive Diffusion Convolution can automatically learn the neighborhood radius of each graph neural network (GNN) layer and each feature dimension. Its adaptive mechanism allows each node to have a neighborhood radius suitable for itself, which can be integrated into the learning process of the co-occurrence information of ransomware.
For a sensitive node with a larger weight, a node that is highly similar to it and has co-occurrence information may not be its neighbor node. The Adaptive Diffusion Convolution method can adaptively adjust the neighborhood radius of the node to the optimal situation, and the neighborhood radius will include that node. By aggregating more node co-occurrence information, the center point can be better enriched, potentially leading to improved learning results.
3.2. Constructing Graph Network by Combining API Sequences
A graph is constructed from the API call sequence text, incorporating word nodes, document nodes, and weighted edges. This structure aims to preserve as much word order information and co-occurrence information as possible. The total number of nodes |V| in the text graph is the aggregate of two types of documents (ransomware and benign documents) and the count of unique API call functions. The API call sequence text graph, as implemented in ADC-TextGCN, is a large heterogeneous text graph, thereby enabling it to explicitly model valuable feature information. Within the API call sequence text graph, API call functions serve as word nodes, while ransomware and benign software are represented as document nodes. The edges between nodes are established based on the occurrence of words within the document and the co-occurrence of words within the API call sequence.
Term Frequency-Inverse Document Frequency (TF-IDF) is a widely used weighting technique in information retrieval and data mining, which assesses a word’s relevance to a document within a corpus. TF-IDF consists of two parts:
TF and
IDF.
TF (Term Frequency): This represents the frequency of a term in a text. This number is usually normalized (generally, the Term Frequency divided by the total number of words in the article) to prevent it from favoring longer documents (the same word may have a higher Term Frequency in a long document than in a short document, regardless of the importance of the word). The calculation formula for
TF is
where
denotes the number of occurrences of term
in document
, and
represents the frequency of term
in document
.
Inverse Document Frequency (
IDF): This metric represents the general significance of a keyword. If fewer documents contain the term, the
IDF will be larger, indicating that the term has good category differentiation ability. The
IDF of a specific word can be obtained by dividing the total number of documents by the number of documents containing the word, and then taking the logarithm of the quotient. The calculation formula for
IDF is
where the denominator is increased by 1 to avoid the situation where the denominator is zero, |
D| represents the total number of all documents, and
represents the number of documents containing term
. Finally, the calculation formula for TF-IDF is
In this way, TF-IDF tends to filter out common words and retain important words, effectively calculating the weight between document nodes and word nodes. Multiplying TF and IDF can obtain the importance weight of a word in a document. Common weighting methods include TF × IDF, log(TF + 1) × IDF, etc. It is a statistical method used to evaluate the importance of a word to a document. The importance of a word increases in direct proportion to the number of times it appears in a document, but at the same time, its importance decreases inversely with the frequency it appears in the corpus. To utilize global word co-occurrence information, we use a fixed-size sliding window on all documents in the API call function to collect global word co-occurrence information and word order information. By adopting the COIR-PMI theory of co-occurrence information retention, changing the co-occurrence rules of PMI, and obtaining word order information, the weight between two word nodes can be calculated.
While building the API call sequence text graph, numerous API functions are invoked. A sliding window of size N scans each executable file’s API call sequence, forming an N-length fragment. For instance, consider an executable program that sequentially calls 12 functions, as demonstrated in
Table 1.
The fragment for this API call function when N = 5 is shown in
Figure 6.
In the COIR-PMI method, we have modified the PMI statistical rule to compute the weight of the edge between two word nodes. This algorithm uses PMI to measure the correlation between two variables, preserving both word order and co-occurrence information. We employ a sliding-window approach to extract valid sequence relationships for API call functions. The TF-IDF algorithm is utilized to retain API call interfaces that are pertinent to classification.
Calculating COIR-PMI: The
PMI value for word pair
i,
j is computed as follows:
In these equations, represents the count of sliding windows in a corpus that contains word i, is the count of sliding windows that contain words i and j in order, and is the total count of sliding windows in the API call sequences.
For instance, consider the calculation of co-occurrence of 47 and 48 in
Table 1. The count of co-occurrences of 47 and 48 in the window {(47,48,33,49,50)} is 1, but the count in the window {(48,47,33,49,50)} is 0. Therefore,
is not equal to
.
Whether COIR-PMI is positive or negative carries practical significance and represents characteristic information. To retain more comprehensive information, we add edges between word pairs with PMI values.
After constructing the text graph, we feed the graph into a simple two-layer GCN. The second layer node (word/document) embeddings have the same size as the label set and are fed into a softmax classifier:
where
, and
, with
.
3.3. Adaptive Diffusion Convolution of ADC-TextGCN
The Graph Diffusion Convolution extends the discrete information propagation process inherent in the TextGCN to a diffusion process, thereby enabling the aggregation of information from larger neighborhoods. For each input graph, graph diffusion convolution manually fine-tunes the optimal neighborhood size for feature aggregation by conducting a parameter grid search on the validation set. However, this approach exhibits certain limitations and sensitivities in real-world applications. To obviate the need for a manual search for the optimal propagation neighborhood in Graph Diffusion Convolution, the Adaptive Diffusion Convolution strategy was introduced in TextGCN. This strategy supports the automatic learning of the optimal neighborhood from the data. During the message-passing process in each graph, all GCN layers and feature channels (dimensions) share the same neighborhood size. This strategy empowers GCNs to adapt more flexibly to various graph structures, thereby enhancing the model’s generalization capability and performance. The neighborhood radius can serve as a metric for quantifying the propagation distance of features across each layer. The neighborhood radius
r is calculated as:
where
denotes the influence of nodes at a s-step distance. A large
r signifies that the model places a greater emphasis on distant nodes, thereby accentuating global information. Conversely, a small
r indicates a focus on local information.
For graph convolutional networks (GCNs), when the neighborhood radius r equals 1, it only encompasses the range directly connected to the node. To access information beyond the direct connection range, it becomes necessary to stack multiple GCN layers to probe the high-order neighborhood. Models such as MixHop [
41], JKNet [
42], and SGC [
43] strive to enhance the feature propagation function of the GCN by transitioning from a single-hop neighborhood to a multi-hop neighborhood. However, for all multi-hop models, the discrete nature of the number of hops renders the neighborhood radius r non-differentiable. This characteristic makes it improbable for r to adaptively participate in the Back-Propagation (BP) algorithm as a parameter. Consequently, the primary challenge in implementing the adaptive learning of the neighborhood radius lies in the non-differentiability of the radius.
Specially, we introduce Adaptive Diffusion Convolution into the construction process of the API call sequence TextGCN to capture the features of high-order text information more effectively. By employing this strategy to automatically learn the optimal neighborhood size strategy from the data, the TextGCN is fully integrated with the graph structure and all feature channels.
The emphasis is placed on the weight coefficients generated by the heat kernel version, denoted by
. The heat kernel incorporates prior knowledge into the TextGCN model, implying that the feature propagation between nodes adheres to Newton’s cooling law [
44], i.e., the speed of feature propagation between two nodes is proportional to the difference in their features. Here,
t can be interpreted as the diffusion time of node
i.
Upon substituting the heat kernel formula into the generalized neighborhood radius Formula (9), and following a mathematical derivation involving exponential series, it can be deduced that the neighborhood radius
r perfectly corresponds to the diffusion time
t.
This demonstrates that t, based on the heat kernel, is the neighborhood radius, implying that t becomes the perfect continuous substitute for the number of hops in the multi-hop model. This strategy provides a unique neighborhood size for each layer and channel, achieving more detailed adjustments and circumventing the need for manual adjustments.
4. Results
This section first presents the dataset employed in the experiment. Subsequently, we delve into a thorough analysis of the detailed steps involved in the implementation process, followed by a comprehensive evaluation of the experimental performance. Ultimately, we utilize these data for training and testing purposes to derive the experimental outcomes. Our code and dataset details can be found at
www.sogithub.com/Yangggy/ABC (accessed on 18 March 2024).
4.1. Datasets
Our study utilized a comprehensive dataset consisting of 3000 ransomware samples and 2000 benign software samples carefully selected from reputable sources such as VirusShare, VirusTotal, and other well-known repositories. This selection aims to capture a wide range of ransomware behaviors and benign software patterns, ensuring the diversity and representativeness of our dataset. In addition to our main dataset, we integrated data from the Ember dataset, malware_api_class dataset, and UNSW-NB15 dataset. These sources were integrated to expose our model to a wider range of malware features, behaviors, and attack types, including those that are not strictly classified as ransomware, thereby enhancing the robustness and accuracy of our detection model.
We used a Cuckoo Sandbox for the dynamic analysis of each sample, which enables us to capture the detailed execution of API call sequences. This analysis provides a rich feature set for distinguishing ransomware from benign software. We extracted relevant features from the API call sequence, with a focus on those features that have the highest frequency and significance in representing ransomware activities. Irrelevant information, including common stop words and uninformed API calls, was removed to increase the model’s focus on important patterns. All features have been normalized to ensure scale consistency, thereby promoting more effective learning. To address the potential imbalance between ransomware and benign software samples and enhance the model’s generalization ability, we adopted the Synthetic Minority Oversampling Technique (SMOTE). This technique generates composite ransomware samples by interpolating existing ones, enriching the diversity of ransomware behavior in our training set. When evaluating model performance, we focus on key indicators such as accuracy, precision, and recall to ensure that our research results can effectively identify ransomware and provide powerful tools for professionals in the field of network security.
4.2. Implementation Details
To accurately train the ADC-TextGCN model, our carefully constructed experimental environment mainly includes the following advanced hardware and software configurations: The core processor adopts the powerful i9-13900k, ensuring the efficiency and stability of the calculation process. All model training and testing work was completed on the top NVIDIA RTX 4090 GPU to achieve unparalleled processing speed and optimized performance. The development environment is based on the Windows 10 operating system and adopts the PyTorch framework and CUDA Toolkit 11.2. This choice significantly improves computational efficiency and optimizes the smoothness of program operation. In addition, the entire system runs on Ubuntu 22.04.4 LTS, further ensuring compatibility with advanced technology and system stability. This version is a Long Term Support release that receives regular updates, making it a reliable choice for maintaining consistent performance. This comprehensive configuration not only endows the experimental training environment with powerful computing power but also provides the optimal support and software environment for ADC-TextGCN model training, ensuring the efficient progress of research work and the output of excellent results.
The ADC-TextGCN model is structured to exploit the relational information inherent in textual data through graph-based learning. At its core, the model represents documents and words as nodes in a graph, where edges reflect co-occurrence relationships and document–word associations. This representation facilitates the learning of word and document embeddings in a unified feature space. The model includes two graph convolutional layers: the first layer converts the original feature representation into intermediate embeddings, and the second layer generates embeddings for final classification. The adaptive diffusion convolutional strategy is integrated into this architecture, allowing the model to adaptively adjust the neighborhood information aggregated at each node, thereby enhancing the model’s ability to capture nuanced semantic relationships.
We set the learning rate (LR) to 0.01, as this was determined through preliminary experiments to balance convergence speed and training stability. Two graph convolutional layers are used to optimize the capturing of both local and global textual features without overfitting. Intermediate embeddings in the first layer are 200-dimensional, balancing representational capacity and computational efficiency. When applied to two layers with a dropout rate of 0.5, overfitting is prevented by randomly omitting a portion of the feature detectors at each training instance. To minimize the excessive complexity of model weights, we used weight decay as a regularization method with a coefficient of 0.0005. The ADC-TextGCN is trained using a cross-entropy loss function, suitable for binary classification tasks like ransomware detection. Training is conducted over 300 epochs, with early stopping implemented to cease training if validation loss does not improve for 10 consecutive epochs. This strategy ensures the efficient use of computational resources while preventing overfitting. The Adam optimizer is selected for its adaptive learning rate properties, facilitating more efficient convergence. A grid search method is applied to explore the best combinations of the learning rate, dropout rate, and weight decay, ensuring optimal model performance. Our model settings are summarized in
Table 2.
4.3. Performance Estimation
The confusion matrix is a universally recognized method for assessing the performance of a model. It allows for the derivation of several key indicators, including the true positive rate (TPR), false positive rate (FPR), accuracy, Receiver Operating Characteristic (ROC), and Area Under the ROC Curve (AUC). In the specific context of ransomware sample detection, the AUC serves as a comprehensive metric of the model’s detection capability, while accuracy validates the model’s ability to correctly identify both benign software and ransomware samples. For this experiment, we selected TPR, FPR, AUC, and accuracy (ACC) as our evaluation metrics. A larger AUC value signifies a more effective classifier. The ROC is a two-dimensional curve, with FPR and TPR serving as its horizontal and vertical coordinates, respectively.
Within the confusion matrix, positive refers to a ransomware sample, while negative refers to a benign software sample. True positive indicates that the predicted result aligns with the actual situation, whereas false positive indicates a discrepancy between the prediction (ransomware) and the actual situation (benign software).
Table 3 provides an overview of the advantages and disadvantages of each model, including ADC-TextGCN, which provides a detailed perspective on its relationship with other methods from the perspectives of effectiveness, complexity, and application scope.
This comparison table emphasizes the advanced capabilities of the ADC-TextGCN model to synthesize and learn extensive and complex ransomware data, which is attributed to its adaptive diffusion strategy. However, it also emphasizes the necessity for substantial computing power and the complex model optimization process, which may pose challenges in certain environments.
In contrast, while other models, such as CNNs and LSTM, offer valuable advantages, such as ease of implementation and effectiveness in processing sequential data, they all have limitations that ADC-TextGCN aims to overcome, especially in dealing with long-term dependencies and global context understanding.
4.4. COIR-PMI Experiment and α Factor Experiment
The co-occurrence retention algorithm, leveraging the COIR-PMI theory delineated in
Section 3.2, was examined.
Table 4 showcases the accuracy attained by various statistical methods when N equals 7. The table illustrates that the accuracy is enhanced when the weight between two word nodes is calculated using the COIR-PMI theory to preserve co-occurrence information. This approach bolsters the precision of the classification during training.
The letters in parentheses represent the first letter of the dataset, where E represents the Ember dataset, M represents the malware_api_class dataset, and U represents the UNSW-NB15 dataset. As the following table is identical to the previously discussed content; hence, a detailed explanation is not provided.
We investigated the influence of the α factor on the comprehensive performance of ADC-TextGCN. As depicted in
Figure 7, our findings reveal that the accuracy initially escalates and subsequently declines with an increase in α, attaining a peak when α is set to 35.
4.5. Ablation Experiments
As depicted in
Figure 8, we conducted a comparative analysis of various ransomware detection models, including the CNN-LSTM [
45] hybrid, TextCNN [
46], TextGCN [
18], and LSTM [
47] models. Initially, the CNN-LSTM hybrid model exhibits high accuracy, which, however, diminishes with an increase in the number of API call sequences. As the length of the API call sequences escalates, more sensitive API call functions emerge, thereby enhancing the model’s detection accuracy. This trend persists until all sensitive API call functions are accounted for, at which point the network accuracy peaks at 0.966.
Figure 8 illustrates that longer sequence lengths yield more precise results. Although the positive impact of increasing the sequence length eventually tapers off, ADC-TextGCN still outperforms the LSTM, CNN-LSTM hybrid, TextGCN, and TextCNN models in terms of accuracy.
The choice of these models for comparison is motivated by their status as classic models in the field of deep learning. Comparing ADC-TextGCN against these classic models provides a clear reference point for readers to understand the performance of our model. The comparison remains fair and meaningful, as all models were evaluated under the same conditions, even if the comparison models are not the latest. Furthermore, the choice of these widely available and easily understood models enhances the reproducibility of our experiments.
The outcomes of the ablation experiment are delineated in
Table 5. The results show a notable enhancement in accuracy with the combined application of the α factor and COIR-PMI, resulting in an accuracy increase of approximately 6.8% for dataset E compared to the base model without any enhancements. Furthermore, the addition of the diffusion radius
t leads to further incremental improvements, achieving an impressive accuracy rate of 96.6% on dataset E. Similar trends are observed on datasets M and U, with accuracy improving to 95.3% and 92.7% respectively when all factors are applied. These results underscore the comprehensive effectiveness of the ADC-TextGCN, which significantly outperforms the baseline and other existing methods in accuracy.
Figure 9 shows the ablation studies under various diffusion settings. Graph Diffusion Convolution can be understood as fixing the heat kernel ‘t’ with sparsification. The findings suggest that the efficacy of Graph Diffusion Convolution is partly due to its sparsification effect on the propagation matrix. Notably, if sparsification is removed, training the diffusion time parameter ‘t’ for each feature channel and layer markedly outperforms merely fixing ‘t’ to its initial value. By comparing the enhancements brought about by training ‘t’ at different levels,
Figure 9 also demonstrates that training ‘t’ for each channel and layer makes the most contribution to the performance improvements in the node classification task.
We divided ablation experiments into four distinct scenarios. Among the scenarios, (1) can be viewed as Graph Diffusion Convolution without sparsification. Results from these experiments show that training the diffusion time parameter directly from the data significantly enhances performance, especially when sparsification is removed.
Figure 10 presents the accuracy comparison between the traditional TextGCN methods and those enhanced with Graph Diffusion Convolution and Adaptive Diffusion Convolution strategies. It is evident from the figure that the Text Graph Convolutional Network utilizing the Graph Diffusion Convolution strategy significantly outperforms the traditional methods. Moreover, the ADC-TextGCN surpasses the version with the Graph Diffusion Convolution strategy applied. These results further validate the efficacy of our re-search findings and underscore the advantages of the ADC-TextGCN.
To assess the impact of validation set size on model performance, a series of tests were conducted with various ratios between the training and validation sets while maintaining the same experimental conditions. The total number of nodes in the validation set remained unchanged, with only the distribution of nodes per class in the training set being adjusted. The outcomes of these tests are displayed in
Figure 11. In most tested scenarios, ADC-TextGCN consistently outperforms both Graph Diffusion Convolution and the traditional TextGCN model. This indicates that the enhanced performance of ADC-TextGCN is not attributable to training additional parameters on a large validation set, but is inherent to its architectural design and learning approach.
The experiments were conducted on three different datasets: (a) the Ember dataset, (b) the Malware API Class dataset, and (c) the UNSW-NB15 dataset.
Figure 11 aims to ascertain whether the observed improvement results from the overuse of the validation set, and the results indicate that this is not the case. Thus, we can conclude that the ADC-TextGCN model’s superior performance is not due to overfitting on the validation set.
From the results in
Figure 11, we can observe that (1) the ADC-TextGCN model demonstrates a clear advantage over the traditional TextGCN approach in terms of prediction accuracy. For example, in the UNSW-NB15 dataset, the ADC-TextGCN achieves an accuracy approximately 4.8% higher than the traditional TextGCN model. This suggests that the ADC-TextGCN is particularly effective in learning from global co-occurrence information. (2) The ADC-TextGCN also exhibits superior performance compared to the Graph Diffusion Convolution enhanced TextGCN, as evidenced by the higher accuracy rates across all datasets. In the Ember dataset, for instance, the ADC-TextGCN shows a nearly 1.6% increase in accuracy compared to the Graph Diffusion Convolution enhanced model, further underscoring the efficiency and effectiveness of the adaptive diffusion strategy employed by the ADC-TextGCN.
The excellent performance of the ADC-TextGCN model can be attributed to several key factors. The first is the understanding of the global context. Unlike traditional CNN and RNN models that mainly focus on local text features or sequence-based information, the ADC-TextGCN model utilizes global word co-occurrence and semantic relationships between words in the entire dataset. This comprehensive perspective allows for a more detailed understanding of ransomware-related texts. Next is the Adaptive Diffusion Convolution strategy, which enables the model to dynamically adjust the diffusion process, enabling each node to effectively aggregate information from the optimal neighborhood size. This adaptability is particularly beneficial in capturing the complex patterns and behavioral characteristics of ransomware code, which may not be fully represented through fixed-size local windows or sequences. Finally, there is the efficiency of learning from structured data. Graph-based models are inherently adept at capturing irregularities and subtle structural differences in data, such as the interconnectivity of text in ransomware detection tasks. Compared to traditional models, this structural advantage facilitates more effective learning.
5. Discussion and Conclusions
By analyzing the behavior rules of ransomware, it has been established that long-distance discontinuous semantic information, word order information, and co-occurrence information are extremely essential for ransomware identification. Our ADC-TextGCN method retains comprehensive information by applying the sensitive API call sequences and the COIR-PMI. To address the issue that traditional TextGCN can only learn the node information of itself and its neighbors and to include more cross-layer node information in the process of multi-layer propagation training, we also incorporate an Adaptive Diffusion Convolution strategy, which supports the automatic learning of the optimal neighborhood from ransomware data and customizes the best propagation neighborhood size. Experimental results show that our model achieves an accuracy of 0.966. In the future, to further enhance the detection accuracy of this method, we will apply an attention mechanism to improve the ADC-TextGCN performance.
Although our research findings are promising, we acknowledge certain limitations, paving the way for future research. Firstly, the computational requirements of the ADC-TextGCN model may limit its deployment in resource-constrained environments. Further optimizing the architecture of the model and exploring more effective training algorithms can solve this problem. In addition, the evolutionary nature of ransomware attacks requires the continuous updating of the model’s training data to ensure its relevance and effectiveness. Future research may also explore the integration of multimodal data sources, such as network traffic and user behavior logs, to enrich the learning environment of models and enhance their detection capabilities. Another challenge is that the model relies on a comprehensive and well-structured graphical representation of textual data. In situations where text data are sparse or lack clear semantic relationships, the performance of the model may be affected.
Our research holds significant practical value for cybersecurity professionals. The ADC-TextGCN model’s efficient capability in detecting ransomware offers a robust tool against network threats, enabling more precise and effective identification and prevention of potential ransomware attacks. Given the increasingly complex digital environment, frequent ransomware attacks, and continuously evolving methods, the significance of this scientific advancement is particularly notable. It has established a solid defensive line for protecting information security. In our research, we emphasize two future research directions: improving model scalability and exploring the use of attention mechanisms, highlighting the tremendous potential for further progress in this area. As network security threats continue to evolve, so must our strategies and methods to counter these threats. Our research provides profound insights and powerful tools for cybersecurity professionals and researchers, laying a solid foundation for addressing future security challenges.