Detecting and Analyzing Botnet Nodes via Advanced Graph Representation Learning Tools

Cuzzocrea, Alfredo; Hafsaoui, Abderraouf; Gallo, Carmine

doi:10.3390/a18050253

Open AccessArticle

Detecting and Analyzing Botnet Nodes via Advanced Graph Representation Learning Tools^†

by

Alfredo Cuzzocrea

^1,2,*,‡

,

Abderraouf Hafsaoui

¹

and

Carmine Gallo

¹

iDEA Lab, University of Calabria, 87036 Rende, Italy

²

Department of Computer Science, University of Paris City, 75006 Paris, France

^*

Author to whom correspondence should be addressed.

^†

This paper is an extension of the paper: Carpenter, J.; Layne, J.; Serra, E.; Cuzzocrea, A.; Gallo, C. Structural Node Representation Learning for Detecting Botnet Nodes. In Proceedings of the 23rd International Conference on Computational Science and Its Applications, Athens, Greece, 3–6 July 2023; pp. 731–743. https://doi.org/10.1007/978-3-031-36805-9_47.

^‡

This research has been made in the context of the Excellence Chair in Big Data Management and Analytics at University of Paris City, 75006 Paris, France.

Algorithms 2025, 18(5), 253; https://doi.org/10.3390/a18050253

Submission received: 21 December 2024 / Revised: 20 March 2025 / Accepted: 24 March 2025 / Published: 26 April 2025

(This article belongs to the Special Issue Machine Learning Algorithms and Optimization in the Digital Transition (2nd Edition))

Download

Browse Figures

Versions Notes

Abstract

Private consumers, small businesses, and even large enterprises are all at risk from botnets. These botnets are known for spearheading Distributed Denial-Of-Service (DDoS) attacks, spamming large populations of users, and causing critical harm to major organizations. The development of Internet of Things (IoT) devices led to the use of these devices for cryptocurrency mining, in-transit data interception, and sending logs containing private data to the master botnet. Different techniques were developed to identify these botnet activities, but only a few use Graph Neural Networks (GNNs) to analyze host activity by representing their communications with a directed graph. Although GNNs are intended to extract structural graph properties, they risk causing overfitting, which leads to failure when attempting to do so from an unidentified network. In this study, we test the notion that structural graph patterns might be used for efficient botnet detection. In this study, we also present SIR-GN, a structural iterative representation learning methodology for graph nodes. Our approach is built to work well with untested data, and our model is able to provide a vector representation for every node that captures its structural information. Finally, we demonstrate that, when the collection of node representation vectors is incorporated into a neural network classifier, our model outperforms the state-of-the-art GNN-based algorithms in the detection of bot nodes within unknown networks.

Keywords:

machine learning; deep learning; botnet detection; graph neural networks; structural graph representation learning; Internet of Things

1. Introduction

Nowadays, with the exponential growth of Internet of Things (IoT) devices in modern big data ecosystems, ranging from smart homes and industrial automation to healthcare (e.g., [1,2,3,4]), the surface area for cybersecurity threats has expanded intensively. Botnets, which consist of a network of compromised devices controlled by an attacker, have become a significant threat due to their ability to leverage the always-on connectivity feature of IoT devices. These botnets are often used to conduct large-scale cyberattacks, including Distributed Denial-of-Service (DDoS) attacks (e.g., [5,6,7]), credential theft, ransomware transmission, and cryptocurrency mining, which causes several interruptions and financial losses across various sectors.

Botnet detection is challenging due to its constantly evolving structures and evasion techniques. Traditional botnet detection methods, such as signature-based and rule-based approaches (e.g., [8,9]), rely on predefined patterns and are thus unable to adapt to new attack strategies. Moreover, with the increased adoption of Peer-to-Peer (P2P) communication protocols within botnets (e.g., [10,11,12]), attackers can dynamically modify network structures and distribute command-and-control (C&C) functions, making it significantly harder to detect these networks.

By taking a look into the active literature, Graph-based models, particularly those utilizing Graph Neural Networks (GNNs), have shown promise in capturing the relational structures of botnets by representing network communications as graph structures (e.g., [13,14,15,16]). In such models, devices (nodes) are connected by their interactions (edges), enabling the extraction of structural patterns indicative of botnet behavior. However, GNNs are highly sensitive to the topological structure of the networks on which they are trained. This dependency makes them prone to overfitting, meaning that models trained on a specific network structure may fail to generalize effectively to other networks with different topologies. This is especially problematic in botnet detection, where botnet configurations and network environments vary widely.

Additionally, as cybercriminals increasingly diversify their tactics, botnets are evolving to include more heterogeneous device types and connectivity methods, further complicating detection efforts. For instance, botnets now commonly mix malicious traffic with legitimate network traffic, using encrypted communication channels in order to avoid monitoring systems. This complexity necessitates a detection model that is not only robust to topological differences but also capable of identifying nuanced structural anomalies across a variety of network settings. Consequently, there is a pressing need for advanced botnet detection approaches capable of offering adaptable solutions for detecting botnets in diverse and evolving environments.

Following these considerations, this paper introduces Structural Iterative Representation for Graph Nodes (SIR-GN), a novel approach designed to overcome the limitations of conventional GNNs in botnet detection. The SIR-GN approach focuses on capturing the intrinsic structural properties of nodes within a graph, allowing for robust node representations that remain effective even when network topologies vary. Our method iteratively aggregates information from each node neighborhood by creating a structural representation that encapsulates not only direct connections but also multi-hop relationships. This iterative aggregation enables SIR-GN to capture both local and global structural patterns, making it more resilient to the diverse configurations seen in botnet networks.

Moreover, this paper is an extension of our previous short paper [17], where we presented the proposed SIR-GN methodology for botnet detection. With respect to [17], this paper largely overcomes the contributions of the previous short paper. Overall, in this paper, we make the following contributions: (i) we initiate by presenting a comprehensive overview of the context, challenges, and motivations behind the development of advanced botnet detection methods; (ii) we provide a practical case study within Cloud-based environments that highlights how our framework can be used for botnet detection within a big data ecosystem; (iii) we conduct an in-depth state-of-the-art analysis, by providing a clear synthesis of the most relevant work in botnet detection and also highlighting limitations of current GNN-based methods; (iv) we introduce and detail the proposed Structural Iterative Representation for Graph Nodes methodology, by presenting its architecture, key functionalities, and the novel approach to robust node representation across diverse network topologies; (v) we also provide a detailed description of the dataset used in our experimental campaign; (vi) finally, we perform extensive experimental evaluations using real-life and synthetic datasets to rigorously validate the effectiveness, generalizability, and scalability of SIR-GN, which demonstrates its superiority over existing botnet detection models.

The remaining part of this paper is structured as follows: Section 2 presents an innovative case study focusing on detecting bot nodes within a Cloud-based environment. Section 3 reviews a set of related work in the context of botnet detection. Section 4 presents the state-of-the-art methodologies in this field. In Section 5, we describe datasets employed in our experimental campaign. After that, the inferential SIR-GN methodology is demonstrated in Section 6, while Section 7 covers its evaluation and the results obtained. Finally, Section 8 concludes the paper and discusses potential directions for future research.

2. Case Study: Detection of Bot Nodes Within Unknown Networks in Cloud Environments

In this section, we provide an innovative case study that demonstrates how our SIR-GN framework can be applied within Cloud-based environments for botnet detection in cutting-edge big data ecosystems.

Figure 1 depicts our case study, demonstrating the implementation of the proposed framework within a Cloud-based environment as part of a big data ecosystem, focused on botnet node identification. This approach makes use of well-known big data management and analysis tools.

As depicted in Figure 1, the framework can operate on a Cloud-based computing infrastructure (“at the nodes”), enabling the big data prediction phase to utilize a collaborative environment where multiple Graph Learning Layers cooperate effectively to enhance the accuracy of botnet node detection. The referenced big data ecosystem adopts a multi-level architecture, comprising the following levels.

Input Layer: This layer serves as input for the system, as it collects data from both network traffic and node activities in the Cloud. The collected data may consist of various types of information, such as logs, network traffic measurements, and different signals, these information help in discovering the features and characteristics of the targeted botnet.
Graph Learning Layer: This part of the system is responsible for generating a graph that contains information about node interactions. At this level, graph learning techniques are used to model the relationships between all nodes in the graph. The goal is to discover groups of nodes exhibiting unusual or suspicious activity. Therefore, the detection of these nodes led to the revealing of the key operational behaviors of active botnets. Additionally, the graph learning model is trained using GNNs. The functionality of the Graph Learning Layer is supported by the integration of the following components:
- Original Graph: This component represents the primary graph, which provides the foundation for the data used to train and test the models. Graph nodes in the network are represented by circles and connections between them by lines. The graph has different labels or classes, denoted by nodes highlighted in red and blue.
- Training set and Test set: The representation of the original graph provides the opportunity to extract both the training set and test set. These are the data used for training the model and for testing its output. The separation step ensures that the model generalization capability is verified by using unseen data from the model.
- K-Fold Split: This step involves the use of the K-Fold Cross-Validation technique to divide the training set into K subsets, known as folds. This technique permits the model to be trained and validated on a variety of datasets, increasing its accuracy and robustness.
- Base Model: Several instances of GNN are used in the base model, each trained on a different subset of data obtained from K-Fold Split. Each GNN_i (where i varies from 1 to S, with S as the number of folds) represents a version of the model that learns from the graph features and the relationships between nodes in a specific data split.
- Training and Test Result: Test results for each GNN are generated for its specific fold, which results in a series of evaluations of the model performance. These outcomes can incorporate metrics such as precision, recall, and other accuracy metrics.
- Average of Results: The results of the different GNNs are combined through an average (indicated with a summation symbol and a final average) to obtain an overall estimate of model performance. By averaging, the variance in the results can be reduced, and the performance measurement becomes more stable.
- Secondary Training Set and Secondary Test Set: The end of the process involves using secondary training and test sets. An iterative training process or a fine-tuning phase is suggested to refine the model based on a new subset of data.
Big Data Analytics Layer: This layer includes advanced big data analytics in order to handle large quantities of data sourced from the input layer. Here, behavioral triangulation is carried out for multiple nodes and employs higher-level analysis to look for suspicious patterns. Machine learning (ML) and statistical inference techniques are also employed for traffic pattern identification or fraud detection.
Result Layer: This is the last layer of the analysis, with the resultant outputs of the investigation being displayed. The resulting layer presents the last analysis, to reveal which nodes are part of a botnet and which nodes have shown to be clean. The findings can be used to inform the security system or to inform automatic counteractive measures directed at the infected nodes so as to contain the intrusion.

The ecosystem of big data presented in this paper employs network modeling and statistical inference to detect botnet nodes present within Cloud environments, taking into consideration data overload and complex networked interactions among the nodes. This layered approach enables robust and scalable analysis, optimized for use in high-traffic Cloud environments. The scalability of solutions plays a fundamental role, in fact, it is well understood that classic ML algorithms cannot be used “as they are”, but specialized solutions are needed to manage the well-known 3V characteristics of big data (i.e., volume, Velocity, and variety), as highlighted in recent proposals from the active literature (e.g., [18]). This multi-layer structure of the framework enhances the adaptability and integration of advanced analytics tools to capture complex patterns in real-time, which makes it a resilient choice for evolving cybersecurity challenges in Cloud-based systems.

3. Related Work

In this section, an overview of the existing contributions in the field of study is provided, highlighting the main theories, methodologies, and results that have already been published. This allows us to place our work within the context of the existing literature.

In [19], authors take a look at how the Internet of Things has changed various parts of life today, like healthcare, transportation, home automation, and control systems in industries. However, more connected devices mean more security risks, especially from botnets, where many ML and deep learning (DL) techniques have been put out for IoT botnet identification in order to counter these threats. Numerous studies have examined the effectiveness of ML and DL techniques in detecting IoT botnet attacks. Particularly, a recent systematic review analyzed the most effective ML and DL techniques, considering benchmark datasets, evaluation metrics, and data preprocessing techniques. This review included studies published between 2018 and 2023, selecting 25 relevant studies out of 1567 initial records. The results indicate that ML and DL techniques outperform traditional signature-based methods in detecting IoT botnet attacks, although the effectiveness varies depending on the dataset, features used, and evaluation metrics adopted. A key problem in making these techniques is the lack of good IoT botnet datasets. Current datasets might not show real-life attacks well, which makes it hard to apply the proposed solutions broadly. Future studies should aim at making more detailed and varied datasets. Another big issue is the weakness of ML and DL models against attacks from adversaries, which can change input data and lead to wrong classifications, as well as more false positives and negatives. These future studies could also aim at making stronger models that can withstand these attacks. Additionally, because ML and DL techniques typically require sensitive data, privacy issues are a major concern. Therefore, future research should concentrate more on developing techniques that preserve privacy when working with encrypted data. The real-time detection of IoT botnet assaults on devices with limited capabilities is made more difficult by the significant processing resources required by many machine learning and deep learning algorithms nowadays. Future work should therefore concentrate on developing real-time, lightweight machine learning and deep learning techniques that run on IoT devices with less computing power.

The authors of [20] examine current network security research that uses graphs for data representation and analysis, offering new tools for botnets and intrusion detection. As a result, the use of graph-based analysis to enhance network security and the graphical display of network traffic data has been the subject of numerous studies. These graph-based network security approaches include using different types of graphs to display safety information, using advanced techniques for analyzing these charts, and using specific indicators for detecting and tracking threats. Most research works have examined graph models used to display and store network security information, and these studies highlight how graphs can improve the visualization of complex networks. This makes it easier for analysts to detect threats; however, the wide adoption of these techniques can face several challenges. This includes the need to select the most appropriate graph model and algorithm without extensive domain knowledge. Another important challenge is the underutilization of graph databases and large-scale graph processing systems. Despite the potential of these tools, network security applications have yet to take advantage of these tools to their full potential. Therefore, the authors conclude that future research should focus on several promising directions, such as improving the selection and use of the most efficient graph models and algorithms, increasing the use of graph processing structures, and growing new techniques that may be effortlessly used by security analysts. Additionally, it is critical to deal with demanding situations associated with the scalability and computational complexity of graph-primarily based strategies in order to be capable of truly using it and practicing it in an actual international context.

In [21], authors analyze botnets, describing them as one of the most challenging threats to cybersecurity, and that they are increasingly becoming more sophisticated and resistant to detection. Despite the specific behaviors of each botnet, there are sufficient similarities within each botnet that distinguish their behavior from benign traffic. Several botnet detection systems based on these similarities have been proposed. However, providing a solution to differentiate botnet traffic (even those using the same protocol, such as IRC) from normal traffic is not trivial. Extraction of host-level or network-level features to model a botnet has been one of the most prevalent methods in botnet detection. A subset of features, typically selected based on an intuitive understanding of botnets, is used by ML algorithms to classify/cluster botnet traffic. After being tested on two or three botnet tracks, these methods have largely produced satisfactory detection results. Their ability to identify other botnet networks or actual traffic, however, is still unknown. Furthermore, there has not been so much research conducted on how well various feature combinations work in terms of wider detection coverage. The authors examine flow-based features used in previous botnet detection research and assess how effective they are. They produced a dataset with a diverse collection of botnet traces and background traffic in order to guarantee sufficient evaluation. To further improve detection capabilities, ML-based detection methods could be strengthened with security tools (such as intrusion detection systems or firewalls).

The authors of [22] focus on botnets, which represent a significant challenge for the operational security of networks due to the continuous threats, where development based on new approaches is necessary to safeguard the integrity of digital infrastructures. For this purpose, the authors present an ML-based method to detect botnets in network traffic, and their research points to the continued threat posed by botnets, including the limitations of existing detection methods. The proposed method detects network botnets using ML techniques such as Support Vector Machines (SVMs) and Regularized Logistic Regression (RLR). Another strength lies in the combination of feature selection techniques and ML algorithms, which, in the IoT environment, improve search accuracy. This work also helps distinguish between botnet detection techniques and demonstrates the importance of combining ML techniques.

In [23], the authors focus on the IoT, which has led to an increase in IoT-based DDoS attacks. The paper presents a solution for detecting botnet activities within consumer IoT devices and networks. The authors employ an innovative application of DL to develop a detection model based on a Bidirectional Long Short-Term Memory Recurrent Neural Network (BLSTM-RNN). Word Embedding is used for text recognition and conversion of attack packets into tokenized integer format. The developed BLSTM-RNN detection model is compared with an LSTM-RNN to detect four attack vectors used by the Mirai botnet and is evaluated for accuracy and loss. The article demonstrates that despite the bidirectional approach, adding overhead and increasing processing times proves to be a progressively better model over time.

Finally, in [24], the authors notice that, in recent years, the widespread adoption of IoT technology has significantly transformed various industries, driven by enabling technologies that have accelerated this shift. The unprecedented possibilities offered by IoT have led to the development of smart applications integrated into national infrastructure. However, the growing popularity of IoT has also attracted adversaries who exploit the inherent vulnerabilities of IoT devices to launch sophisticated attacks, including Multi-Stage Attacks (MSAs), such as IoT botnet attacks. These attacks have resulted in substantial financial losses across industries, amounting to billions of dollars. To address this issue, the study proposes a two-phase system for IoT botnet detection. The first phase focuses on identifying IoT botnet traffic by subjecting IoT traffic to feature selection and classification model training to differentiate between malicious and normal traffic. The second phase analyzes the malicious traffic identified in the first phase to detect various botnet attack campaigns. This phase employs an alert correlation approach that combines Latent Semantic Analysis (LSA) with graph theory-based techniques. The proposed system was evaluated using a publicly available real IoT traffic dataset and demonstrated promising results.

4. Analysis of State-of-the-Art Approaches

In this section, we present state-of-the-art methodologies related to the context of botnet node detection, and we also highlight the limitations of current GNN-based methods that are still being investigated in this context.

4.1. The Detection of Botnets

Cybercriminals must first persuade people to install malware on their systems in order to gain control over the majority of devices, and the majority of malware used for this purpose is freely available online. We use the Mirai malware as an illustration, which preys on Linux-based IoT systems, including routers, IP cameras, and any home automation equipment. This botnet was utilized against organizations like Krebs on Security, the French web hosting service OVH, and even the DNS provider Dyn, which is a crucial service for regular internet communications, to cause widespread disruptions and produce internet traffic of up to 1 Tbit/s. The Mirai malware is regarded as the first of its kind, and even though its developers were found and turned over to the police, numerous additional varieties have since emerged, including the PureMasuta, Okiru, Satori, and Masuta malware. Figure 2 depicts the conceptual design of a Mirai-based bot that transforms IoT devices into proxy servers.

Both the complexity and attacks of botnets are constantly evolving. Fortunately, the methods for identifying botnets have evolved along with the evolution of the networks themselves. It is well-known that botnets have the ability to function in a way that tries to evade detection. In this instance, a lot of honeypots are made to draw in botnets, especially ones that are known to them and that they avoid, as they are able to bypass the controls and therefore avoid detection. Furthermore, the proliferation of P2P connections has made it challenging to detect botnets. It is essential to recognize that there are two commonly well-known connection architectures: Client–Server and P2P. These two models differ greatly in terms of structure and functionality. Figure 3 depicts a comparison of Client–Server and P2P connection types, highlighting their mode of connection.

In the case of Client–Server botnets, if the C&C control node is located and isolated, the entire botnet can be found and subsequently stopped. Contrary to P2P-type, botnets are capable of sharing C&C commands when discovered and have little knowledge of the other botnets, making detection even more difficult. In order to locate the central control node, BotMiner, for instance, uses clusters of nodes with similar communications and malicious traffic. Using fast-flowing server networks, P2P botnets can change the addresses of the C&C server node, enabling traffic monitoring between nodes to bypass them. Because of this, the traditional approaches, which mostly rely on static properties to characterize network traffic, have been rendered worthless. These traditional methods, like domain names and DNS blacklists, also call for a deeper understanding of the network and botnets. Therefore, this type of strategy only works successfully provided the data are available and have not been altered by the botnet.

4.2. Graph Representation Learning

The use of unsupervised learning techniques for representing graph data is growing in popularity. Networks store a significant amount of information regarding relationally structured data in a simple data structure, which is a list of entities named nodes. And also edges, which are the links that connect these nodes. Additionally, graphical representation can be used to instantly recognize the structure of a botnet. However, typical ML programs operate on a set of attributes that must be represented by graphical data structures. The similarity between nodes must be fully extracted using this type of representation. However, the structural position of the node within the network as well as that of its neighbors can serve as the foundation for the concept of similarity. For instance, the highly connected nodes will be close to one another in the feature space when a file is acquired near a node. Additionally, it suffices that the structures of surrounding nodes are comparable in order to classify them as similar nodes that represent the structural role of a node; and a path between them is not necessary.

Numerous representative learning techniques have been used with considerable success in a variety of academic subjects. For instance, the DeepWalk method, which employs Neuro-Linguistic Programming (NLP) and is based on the Word2Vec algorithm for the Skip-Gram model, seeks to anticipate the words in the context starting from a core word in order to optimize all the nearby nodes. In addition, DeepWalk extends the Skip-Gram model by switching from word sequences to graphs. It uses a randomized path traversal mechanism to establish this transformation and to provide insights about localized network topologies. The following is how the DeepWalk procedure operates:

For each node, perform $N$ “random passes” starting at that node.
Treat each walk as a sequence of node-id strings.
Train a Word2Vec model using the Skip-Gram algorithm on the string sequences obtained previously.

This manifests as a node representation learning approach with similar neighbor nodes that is based on node sharing and connectivity. Due to the necessity of traversing the graph, a major issue arises when the graph is partially connected because there are no links between the nodes. The Node2vec algorithm uses random walks around a network beginning at a target node, and this methodology is comparable to that used by that algorithm. It removes the connectivity requirement and results in performance upgrades. Each of these techniques is effective at capturing connectivity data between nodes, but they fall short when it comes to maintaining crucial network architecture characteristics for bot machine detection. The most effective techniques for learning representations while maintaining structural information are GNNs such as Graph Convolutional Neural Networks, Struct2Vec, GraphWave, and Iterative Procedures. Only the inferential SIR-GN iterative method and neural networks are able to perform inferences that offer predictions for graphs that are entirely different from those used for training. The primary goal of our botnet detection method is to learn the structural representation of inference-capable algorithms. However, other relevant efforts in the areas of AI and ML are mentioned in [25,26,27,28,29,30,31,32,33,34,35,36,37].

4.3. Cybersecurity Standards

Given the continuous evolution of cyber threats, many cybersecurity frameworks such as the NIST Cybersecurity Framework (NIST CSF) and the MITRE Cybersecurity Criteria have gained wide adoption. NIST CSF [38] focuses on sharing best practices for risk management and cyber flexibility, whereas MITRE ATT&CK focuses on a comprehensive classification of adversarial tactics and methods. Various studies proved that the integration of these frameworks enhances threat detection and mitigation across different industries (e.g., botnet detection). In addition, CIS Critical Security Controls and ISA/IEC 62443 are other standards that reinforce cybersecurity. Challenges in adoption persist despite advancements, as continuous updates are needed, which requires skilled personnel to manage emerging threats effectively.

Yet another research line consists of, for example [39], which explores the novel role of Large Language Models (LLMs) in cybersecurity, focusing on their emerging capabilities such as threat detection, cryptographic applications, and adversarial learning. It also shows how LLMs increase security by automating cyber defense processes, detecting trends in criminal activity, and using ML to enhance encryption approaches. It further extends to AI-driven intent-based networking, where AI automates network security policy and zero-touch Network Security for 5G/6G infrastructures to enable self-adaptive and resilient security mechanisms. Additionally, the discussion extends to how Blockchain will offer decentralization in LLM for secure AI collaboration and adversarial training for developing automated attack–defense strategies. That means it aligns with existing frameworks such as the NIST Cybersecurity Framework and the MITRE ATT&CK for identifying challenges: adversarial robustness, secure AI governance, and the ethics of autonomous cyber defense systems. These place AI-driven cybersecurity as a frontier for securing digital ecosystems against evolving threats such as botnet detection, as addressed in this paper.

5. Description of Datasets

In this section, we outline the dataset employed in our experimental campaign, explaining its characteristics and underlying structures. It is worth mentioning that our choice of dataset is aligned with the nature of the challenges addressed. As mentioned before, current datasets may not accurately represent real-life threats, making it difficult to implement the recommended solutions generally. Our focus is on trying to use more thorough and diverse datasets.

We have analyzed two distinct botnet models. The first model features a centralized architecture with a C&C structure, which enables quick identification due to its star topology, where a central hub connects all nodes. In contrast, the second model, known as P2P, is a decentralized system where nodes establish connections across the network in just one or two hops. As previously discussed, P2P botnet clusters are more challenging to detect because there is no central node that acts as a core hub. In our experiments, we employ the P2P model to demonstrate how graph-based representation learning can effectively detect anomalies that traditional detection techniques often miss. For this purpose, we use real background traffic data collected in 2018 from IP addresses on a backbone monitored by CAIDA, as detailed in [40]. Due to the aggregated nature of these data, it is not possible to identify the specific users responsible for generating the traffic.

To account for the various P2P topologies, we randomly selected background traffic across a group of nodes suspected to be part of the botnet. We then constructed controlled networks representing well-known P2P topologies, including De Bruijn [41], Kademlia [42], Chord [43], and Leetchord [44]. Moreover, we incorporate a genuine P2P network to further validate our approach to detecting botnet attacks within communication flows. The Log-log graph, which visualizes the degree distribution within these graphs, is presented in Figure 4. This graph serves as a crucial tool in understanding the connectivity and interaction patterns within the network topologies, highlighting anomalies and unusual node behaviors that standard detection methods might overlook.

As seen in Figure 4, the frequency (number of nodes) decreases as the degree (number of edges) of the nodes increases. This pattern suggests that most nodes have degrees under two and that there are not many nodes that are well related.

Each network has 960 P2P graphs with an average of 144,000 nodes, over 1.5 million edges, and 10,000 botnet nodes within each synthetic graph. The average cluster size in the graphs is 0.007, and each network contains 960 P2P graphs. Based on the existing botnet network, which consists of 144k nodes and 3k botnet nodes. 10,000 botnet nodes were utilized in the datasets we used for training, while 10,000, 1000, and 100 botnet nodes were used in the test dataset. As less than 10% of the network nodes are part of botnets, all of these networks are severely imbalanced. Figure 5 displays the values for the mean node structure across the various datasets.

As shown in Figure 5, the datasets used in our research consist of different network types, including Chord, Debru, Kadem, Leet, and P2P. Each dataset contains a substantial number of nodes, edges, and bot nodes represented in red, blue, and green, respectively. The average number of edges significantly surpasses the number of nodes across all network types, which highlights the dense connectivity of these botnet structures. For instance, the Debru dataset has the highest number of edges (1,671,000), whereas the Leet dataset has the lowest (1,509,858). The combined dataset aggregates characteristics from all individual networks by showing an overall balance in node and edge distribution, which reflects modern botnet behaviors.

6. The Proposed Methodology: Inferential SIR-GN

In this section, we describe in detail our proposed Structural Iterative Representation for Graph Nodes methodology, by presenting its architecture, key functionalities, and the novel approach to robust node representation across diverse network topologies.

Our proposed approach is based on the inferential SIR-GN, which is an iterative structural representation learning process with inference capabilities. The symbols we utilize in our methodological explanation are shown in Table 1.

Layne and Serra in [45] provided a detailed description of the inferential SIR-GN, which is used to extract node representations from directed graphs. The model is built using the SIR-GN methodology, which was first presented in Joaristi and Serra [46]. Using this methodology, a node representation is iteratively updated by first characterizing, then aggregating, its neighbors. Each iteration size of a node representation is determined by a user-selected

n c

hyperparameter. The actual node description, which first starts as the node degree, is clustered into an

n c

K-Means cluster to provide node descriptions. At each iteration, the representation is normalized before clustering; thus, the distance from each cluster centroid is changed into a node probability of being a member of that cluster. The neighbors of a node are then aggregated in its structural description once it is updated by adding the probability of all the nearby nodes that are part of each cluster for that cluster. Algorithm 1 shows Algorithm SIR-GN [45].

Algorithm 1 SIR-GN [45]

Input Set of Graph Nodes V

, Number of Clusters n c

, Number of Iterations k

.

Output Node Representation R

.

Begin

R \leftarrow

null;

v \leftarrow

null;

C \leftarrow

null;

P \leftarrow 0

;

for (v \in V

) do

R [v] \leftarrow i n i t i a l i z e N o d e R e p r e s e n t a t i o n (v)

;
end for

for (i = 1

to k

) do

for (v \in V

) do

R [v] . n o r m a l i z e ()

;

C \leftarrow K M e a n s (v, n c)

;

P \leftarrow c o m p u t e P r o b a b i l i t y (v, C)

;

R [v] . u p d a t e (P)

;
         end for
     end for

return R

;
End

The number of neighbors held by each node in each cluster is represented by the same number of nodes as predicted. Each iteration represents a deeper level of study, and after

k

iterations, a node description will include information about its

k

-hop neighborhood structure. The inferential SIR-GN differs from the conventional model in some ways. The structural descriptions of each node are really concatenated into a broader representation, which captures the evolution of the structure and information through a more in-depth examination of the neighborhood. First, it is tied to each interaction. The final representation is compressed into a dimension selected as a hyperparameter in Principal Component Analysis (PCA) to prevent information degradation as a result of the size growth.

For directed graphs, a node initial representation consists of two vectors of dimension

n c

, one of which represents the node internal degree and the other its exterior degree. Prior to clustering, these two vectors are concatenated. This vector is then clustered after each iteration, and the nearby nodes are then gathered together. For the next iteration, two intermediate vectors, one for neighbors and one for non-neighbors, are concatenated together to execute the aggregate in the case of direct data. By pre-training the K-Means algorithm and scaling at each iteration, our suggested model is able to make inferences. This procedure is performed along with the PCA model that will be used to create the final node embedding for each exploration depth where the training is established on saved and used to draw inference random graphs.

In order to make inferences, we repeatedly do normalization using pre-trained models, then clustering, then aggregation, utilizing PCA to create file representations of the end nodes during training. This approach enables the use of the same pre-trained model across several data sources while also decreasing the inference time, which Layne and Serra [45] present in great depth and include a thorough algorithm that explains the time complexity of the model. Any classifier can use the structural representation vectors of SIR-GN nodes to understand the topology of botnets for automatic botnet discovery. A three-layer neural network is also used in this work to make a final prediction about a node (machine) state as a bot. Algorithm 2 shows Algorithm Inferential SIR-GN [46].

Algorithm 2 Inferential SIR-GN [46]

Input Set of Graph Nodes V

, Number of Clusters n c

, Number of Iterations k,

Selected Dimension D

.

Output Node Representation R

.

Begin

R^{'} \leftarrow

null;

v \leftarrow

null;

C \leftarrow

null;

P \leftarrow 0

;

for (v \in V

) do

R^{'} [v] \leftarrow i n i t i a l i z e N o d e R e p r e s e n t a t i o n (v)

;
end for

for (i = 1

to k

) do

for (v \in V

) do

R^{'} [v] . n o r m a l i z e ()

;

C \leftarrow K M e a n s (v, n c)

;

P \leftarrow c o m p u t e P r o b a b i l i t y (v, C)

;

R^{'} [v] . u p d a t e (P)

;
         end for
     end for

for (v \in V

) do

R . c o n c a t e (R^{'} [v])

;
end for

R \leftarrow P C A (R, D)

;

return R

;
End

Furthermore, our proposed approach for botnet detection based on inferential SIR-GN has several benefits compared to similar approaches in the active literature, such as the following:

Providing robustness against evolving botnet structures, as it captures both local and global topological patterns, which enables robust detection in unknown network topologies.
Improving generalization capability by allowing the model to perform well on previously unseen datasets with minimal retraining requirements.
Reducing the risk of overfitting, as the iterative learning process ensures that structural representations remain meaningful across diverse network environments.
Enhancing classification accuracy by outperforming existing GNN-based approaches by leveraging neural network classifier trained on structurally representative node embedding.

In addition, one of the major challenges in botnet detection is the continuous evolution of adversarial strategies (e.g., [47,48]) aimed at evading security measures. Attackers leverage obfuscation techniques, encrypted communication channels, polymorphic malware, and adversarial perturbations to bypass traditional detection methods. Our proposed SIR-GN model enhances resilience against adversarial attacks through several key mechanisms.

Structural Representation Learning for Robustness: Unlike conventional machine learning models that rely on static feature sets, SIR-GN captures both local and global structural properties of network nodes. This characteristic mitigates the risk of adversarial perturbations that attempt to mask botnet behavior.
Generalization across Diverse Topologies: By leveraging iterative representation learning, our model is designed to generalize well across different network topologies. This ability to adapt to unseen network structures ensures that our detection approach is not restricted to predefined attack patterns.
Adversarial Training and Robust Classification: To further enhance resistance to adversarial attacks, we incorporate adversarial training strategies in our learning process. This involves exposing our model to manipulated botnet samples during training, thereby improving its ability to recognize and neutralize adversarial bot behaviors.
Detection of Covert Communication Patterns: Many botnets use encrypted communication channels to avoid detection. SIR-GN identifies botnets based on their structural communication patterns rather than relying exclusively on packet content analysis.
Resilience to Poisoning and Evasion Attacks: Data poisoning attacks, where attackers inject malicious samples to mislead the learning model, pose a significant threat to machine learning-based detection systems. Our approach mitigates this risk by leveraging an iterative representation learning process that aggregates information from multi-hop neighborhoods, thereby reducing the impact of poisoned samples on individual nodes.

7. Experimental Assessment and Analysis

In this section, we discuss the implementation of the model and the experiments we performed to validate the accuracy of our proposal.

7.1. Setup

The experimental environment consists of a desktop PC with Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz 2.90 GHz, 16 GB RAM, 16 logical cores, and an Intel UHD Graphics 630 graphics card.

Concerning the datasets used in our evaluation, we use four basic botnet topologies to create synthetic datasets (see Section 5), where 960 distinct graphs are produced for each topology by using that topology on actual traffic. The size and number of the graphs are scaled to match a real-life P2P dataset with 960 graphs of actual botnet attacks. For each set of graphs, representations of the structural nodes are produced using our inferential SIR-GN model. Then, we trained inferential SIR-GN using a collection of randomly generated graphs, as shown in the technique. And we computed the representation of nodes in a way that effectively allowed us to extract each node’s structural description. This technique produced outstanding results utilizing a relatively small fraction of these representations, which was trained using a classifier. Furthermore, the inferential SIR-GN can be used to train the upstream of any classifier and facilitate transfer learning, enabling the creation of a neural network (NN) classifier that outperforms all previously trained models using similar methods.

In our experimental evaluation, we simulate the operation of a network comprising 160 devices and collect real data by deploying actual malware samples, including Mirai, Bashlite, and Torii. The dataset primarily focuses on the propagation phase (both spreading and communication). It consists of 23,340,359 network packets, which are categorized into different types, as shown in Figure 6 and Figure 7.

Furthermore, in our experiments, network traffic features were extracted at regular intervals to summarize all traffic between hosts and protocol communications. Figure 8 shows the number of features across four different types: Type 1 corresponds to traffic generated by the same IP, Type 2 includes traffic from the same IP and MAC address, Type 3 represents traffic between the same source and destination IP addresses, and Type 4 captures traffic between the same source and destination TCP/UDP.

Moreover, as shown in Figure 9, datasets are categorized into eight groups to differentiate between legitimate and malicious instances. This classification process is applied to each type of malware, resulting in approximately 1,000,000 cases across the eight categories.

In order to assess the effectiveness of SIR-GN, we use Automating Botnet Detection with Graph Neural Networks (ABD-GN) model [49], which is a GNN-based model designed for identifying botnets. Specifically, ABD-GN models network communicates as a graph where nodes represent entities in the traffic and edges denote direct interactions. The model processes this graph by iteratively updating the representation of each node based on its neighboring connections, which allows it to capture structural patterns unique to botnets. We selected ABD-GN because this technique is very similar in concepts and close to ours, so that it becomes a very trustable test-bed to compare with. On the other hand, other alternatives, such as hybrid ML methods (e.g., [50,51,52]), are also significant comparison techniques to be considered. However, due to the different (e.g., hybrid) natures, their study and comparison with SIR-GN is outside the scope of this paper and is left as future work.

Furthermore, we use the following comparisons to demonstrate how inferential SIR-GN is useful for generalizing invisible data:

To begin, we contrast the inferential classifier SIR-GN plus a neural network that was trained on 50 graphs from a dataset (botnet topology) and used to classify 96 graphs from that dataset with the ABD-GN one, which was trained on 80% (768) of the dataset graphs and used to classify a test set of 20% (on the same 96 graphs) from the same topology.
Next, we contrast the ABD-GN classifier, which is trained on 768 graphs from topology and used to classify the test set, with the inferential SIR-GN plus classifier, which is trained on 50 graphs from a single topology and used to classify the test set of 96 graphs from each of the other topology datasets and real P2P attack data.

7.2. Experimental Results

In this section, we present the experimental results obtained from our extensive campaign. Specifically, our experimental evaluation is structured into two main phases. In the first phase, we benchmark the performance of state-of-the-art approaches in the context of botnet detection. This provides a solid comparative foundation for assessing the capabilities of existing approaches. Following this, we conduct a thorough evaluation of our proposed SIR-GN method, applying it to both binary and multi-class classification tasks, by analyzing their effectiveness based on multiple evaluation metrics, including accuracy, recall, F1-score, and precision. We compare its performance against a selection of ML and DL algorithms to highlight its robustness and accuracy in detecting botnet nodes. In the second phase, we extend our analysis by directly comparing SIR-GN with ABD-GN, which aims to assess its effectiveness in handling both synthetic and real-life data.

Although early detection of botnets is critical for preventing damage, research in this area remains limited. We begin our experimental evaluation by benchmarking several state-of-the-art approaches for botnet detection at both early and late stages. Specifically, we evaluate the following methods: (i) DRL-LR-NB [53]; (ii) PCA-Naïve [54]; (iii) LSTM- RNN [23]; (iv) CNN [55]; (v) DT-KNN [56]; (vi) TRW [57]; (vii) C4.5-DT [58]; (viii) OCSVN-GWO [59]; (ix) KNN-DT-RF [60]; (x) RF-MLPN-LSTM [61]; and (xi) DG-CNN [62]. Figure 10 presents the evaluation scores of these methods in detecting botnets.

Moreover, in order to perform our analysis over datasets, we have implemented three basic ML algorithms, K-Nearest Neighbors (KNN) [63], Decision Tree (DT) [64], and Random Forest (RF) [65], due to the following reasons: (i) KNN is a simple, efficient, and easy-to-apply algorithm used for classification and regression based on similarity scores such as Euclidean Distance. (ii) The DT algorithm supports decisions and potential outcomes with a hierarchical tree structure. It uses directed acyclic graphs, starting from a root node that divides into branches until it reaches leaf nodes, using the entropy coefficient. (iii) RF is a supervised algorithm used for classification and regression. It consists of many decision trees and predicts the final result based on the majority of predictions from all the trees. Figure 11 shows preliminary classification performance in terms of accuracy, recall, F1-score, and precision for all three ML algorithms.

As shown in Figure 11, the RF algorithm outperforms the KNN and DT algorithms in terms of all evaluation metrics, including accuracy, precision, recall, and F1-score. This superior performance highlights RF ability to effectively classify instances with higher reliability and robustness compared to the other models, making it a more suitable choice.

After performing the preliminary analysis using classic ML algorithms, our experiments include both binary and multi-class classifications using our proposed SIR-GN method. Specifically, in the binary classification task, network traffic is categorized as either malicious or legitimate. The performance evaluation results for this classification are presented in Figure 12.

Additionally, we conduct two multi-class classification tasks. The first task involves classifying data into three distinct categories, communication, Dissemination, and legitimate, as illustrated in Figure 13. This classification helps in distinguishing between various types of activities based on their nature and intent. Whereas the second task involves a more granular classification with four distinct categories, aimed at identifying and differentiating among the Mirai, Bashlite, and Torii botnets, along with a fourth class to capture other unidentified threats. The results of this classification are shown in Figure 14.

On the other hand, we extend our analysis using deep learning algorithms to assess if DL models outperform those using ML algorithms. It should be noted that some experiments do not provide all metrics, which results in some missing scores.

Figure 15 shows a comparative analysis of various ML classifiers based on four performance metrics: accuracy, recall, F1-score, and precision. The classifiers compared include our implementation of ML algorithms (i.e., Our-KNN, Our-DT, and Our-RF), KNN [63], DT [64], and RF [65], along with Cross CNN_LSTM (SNC) [66] and our SIR-GN. The results indicate that SNC and SIR-GN outperform all traditional ML classifiers, achieving accuracy, recall, F1-score, and precision values close to 100%. Among the traditional ML models, RF achieves the highest performance, followed by DT and KNN. Additionally, the customized models (i.e., Our-KNN, Our-DT, and Our-RF) demonstrate improvements over their standard counterparts, suggesting modifications that enhance their classification capabilities.

On the other hand, Figure 16 provides a detailed comparison of DL models by evaluating their performance using the same four metrics: accuracy, recall, F1-score, and precision. The models compared include SNC and SIR-GN, which were also present in Figure 15, and additional deep learning-based architectures (DG-CNN [62] and CNN [55]). The results indicate that SIR-GN achieves the highest performance across all four metrics, closely followed by SNC.

As highlighted in Figure 15 and Figure 16, DL models significantly outperform traditional ML classifiers, which highlights their superior ability to capture complex patterns in the data. The minimal performance gap among accuracy, recall, precision, and F1-score showcases that these DL models maintain strong generalization capabilities while minimizing false positives and false negatives.

Finally, to conclude our experimental campaign, we conduct a comprehensive comparative analysis between our proposed inferential SIR-GN model and ABD-GN model [49]. The purpose of this comparison is to evaluate the strengths and limitations of both models in addressing the specific challenges of the botnet detection context.

Figure 17 compares the node representations generated by inferential SIR-GN and ABD-GN, alongside the results obtained from the neural network classifier for different network topologies (see Section 5). Despite the fact that the model was trained to represent nodes using randomly generated graphs and specifically trained on 50 graphs from dataset, SIR-GN shows performance closely comparable to ABD-GN across all datasets. In contrast, the ABD-GN neural network model was trained on 80% (768 samples) of the dataset, which highlights the robust performance of inferential SIR-GN even with fewer training samples.

Figure 17 shows the outstanding performance of the ABD-GN model for a single topology. The outcomes, however, range from excellent to very poor when these same trained models are tested against an invisible topology, as seen in Table 2.

When evaluated on all other synthetic datasets, inferential SIR-GN and ABD-GN perform equally when trained on the Kadem topology. Similarly to training on Leet and Chord data, the F1-score for ABD-GN falls below 10%, with no variance in performance on the other synthetic datasets. It is interesting to note that when Debru data are used to train each model, ABD-GN performs poorly in classification across all other topologies in addition to failing to accurately detect bots in real-life datasets, whereas inferential SIR-GN works well in this situation.

As it can be noticed from Table 2, on the test data produced from the same dataset, SIR-GN and ABD-GN behave similarly when trained on real-life datasets. However, in the four stealth topologies, SIR-GN also performs well in classification tests, whereas ABD-GN significantly underperforms. These results demonstrate that despite being trained on a much smaller dataset, node representations through SIR-GN, combined with neural network classifier, outperform even the particular botnet detection approach on stealth data. The results also show that the ABD-GN model can only recognize bots in real-life data when it is given prior information about the structure of a particular botnet.

Then, node representations for a coupled topology were created using inferential SIR-GN. Furthermore, it is clear that utilizing these representations to train a Random Forest algorithm results in noticeably superior performance. In Table 2, we demonstrate how, in the situation of combined botnet attacks, the SIR-GN model offers remarkably accurate classifications on any dataset, even those on real-life P2P data. Then, using the node representations produced from the inferential SIR-GN model, we examine the training requirements of a classifier for a neural network. We now show that a modest quantity of data is necessary, even for transfer learning for inferential SIR-GN, given that it outperformed ABD-GN on real-life botnet classification after transfer learning with 50 graphs.

In Table 3, we demonstrate that even on real-life test data, it is significantly effective to train a neural network classifier for node structure representation using only one graph from a synthetic dataset.

By comparing the results in Table 2 and Table 3, we can see that transfer learning with a training set of 80% with ABD-GN is more efficient than transfer learning on a neural network trained with transfer learning on a single graph structural node. Additionally, the inferential SIR-GN model, which is used to create the node representations, was taught to transfer learning from training data to artificial random graphs.

7.3. Discussion and Remarks

Our proposed framework addresses, investigates, and proposes a proof-of-concept solution to the relevant issue of botnet nodes detection and analysis. Among many aspects, in our opinion, performance and scalability are critical aspects of botnet detection and analysis algorithms, particularly in the context of modern big data challenges (e.g., [67,68,69]). Given that botnet activities generate vast amounts of network traffic with dynamic and evolving structures, datasets in this domain inherently exhibit the classical 3V characteristics (i.e., volume, Velocity and variety). As a result, specialized solutions are required to ensure scalable execution, enabling efficient detection and mitigation of botnets as data volumes grow and network behaviors change.

Behind this, several open problems remain in the field. Therefore, in the following, we analyze some important open problems in the investigated research context.

Evolving Botnet Architectures. Botnets are constantly evolving, making it difficult to detect and analyze them. Traditional centralized botnets are being replaced by more resilient Peer-to-Peer (P2P) botnets (e.g., [70]), which are harder to detect and take down.
Detection Accuracy. Achieving high detection accuracy while minimizing false positives and false negatives remains a challenge. Machine learning algorithms like XGBoost [71] and Random Forest [72] have shown promise, but there is still room for improvement.
Real-Time Detection. Detecting botnet activities in real-time is crucial to prevent damage. However, real-time detection requires efficient algorithms and significant computational resources (e.g., [73]).
Encrypted Traffic. Many botnets use encryption to hide their communication (e.g., [74]), making it difficult to detect malicious activities. Developing methods to analyze encrypted traffic without compromising privacy is an ongoing challenge.
Scalability. As the number of devices connected to the internet grows, scalable solutions for botnet detection and analysis are needed (e.g., [75]). This includes handling large volumes of data and detecting botnets in diverse environments.
Adversarial Attacks. Botnet creators are constantly developing new techniques to evade detection, such as using adversarial attacks against machine learning models (e.g., [76,77]). Developing robust models that can withstand such attacks is an important area of research.

8. Conclusions and Future Work

We can draw the conclusion that inferential SIR-GN can produce vectors to represent the structural data of every node in a network. Additionally, an ML technique for botnet identification can be used with the graphical depiction of the nodes. The inclusion of a neural network classifier with inferential SIR-GN enables the identification of botnets, regardless of the topology used for training, which is a noteworthy advantage over models based on prior knowledge of the graph topology. We have also seen how quickly a botnet structure can alter, mostly as a result of the significant rise in P2P connections. The variety of botnet topologies has increased as a result of this circumstance. This means that, in contrast to methods that leverage GNNs, the usage of inferential SIR-GN in tandem with a neural network classifier will result in a very high botnet detection ability. Due to the difficulty of adaptability, GNN techniques may not be able to identify previously unknown botnet topologies. Another line of research to be followed consists of coupling our framework with emerging big data trends (e.g., [78,79,80,81,82,83,84,85]).

Author Contributions

Conceptualization, A.C.; methodology, A.C.; software, A.H. and C.G.; validation, A.C., A.H. and C.G.; formal analysis, A.C.; investigation, A.C.; resources, A.C.; data curation, A.H. and C.G.; writing—original draft preparation, A.C., A.H. and C.G.; writing—review and editing, A.C.; visualization, C.G.; supervision, A.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by project SERICS (PE00000014) under the MUR National Recovery and Resilience Plan funded by the European Union—NextGenerationEU.

Data Availability Statement

The datasets generated during and/or analyzed during the current study are publicly available.

Acknowledgments

The authors are grateful to Justin Carpenter, Janet Layne, and Edoardo Serra for their contributions to an earlier version of this research.

Conflicts of Interest

The authors declare that they have no conflicts of interest. They also declare that they do not have any competing financial interests.

References

Priyadarshini, I. Anomaly Detection of IoT Cyberattacks in Smart Cities Using Federated Learning and Split Learning. Big Data Cogn. Comput. 2024, 8, 21. [Google Scholar] [CrossRef]
Hasani, Z.; Krrabaj, S.; Krasniqi, M. Proposed Model for Real-Time Anomaly Detection in Big IoT Sensor Data for Smart City. Int. J. Interact. Mob. Technol. 2024, 18, 32–44. [Google Scholar] [CrossRef]
Alanhdi, A.; Toka, L. A Survey on Integrating Edge Computing with AI and Blockchain in Maritime Domain, Aerial Systems, IoT, and Industry 4.0. IEEE Access 2024, 12, 28684–28709. [Google Scholar] [CrossRef]
Saheb, T.; Izadi, L. Paradigm of IoT Big Data Analytics in the Healthcare Industry: A Review of Scientific Literature and Mapping of Research Trends. Telemat. Inform. 2019, 41, 70–85. [Google Scholar] [CrossRef]
Ould Rabah, M.A.; Drid, H.; Medjadba, Y.; Rahouti, M. Detection and Mitigation of Distributed Denial of Service Attacks Using Ensemble Learning and Honeypots in a Novel SDN-UAV Network Architecture. IEEE Access 2024, 12, 128929–128940. [Google Scholar] [CrossRef]
Ming, L.; Leau, Y.; Xie, Y. Distributed Denial of Service Attack in HTTP/2: Review on Security Issues and Future Challenges. IEEE Access 2024, 12, 33296–33308. [Google Scholar] [CrossRef]
Musa, N.S.; Mirza, N.M.; Rafique, S.H.; Abdallah, A.M.; Murugan, T. Machine Learning and Deep Learning Techniques for Distributed Denial of Service Anomaly Detection in Software Defined Networks—Current Research Solutions. IEEE Access 2024, 12, 17982–18011. [Google Scholar] [CrossRef]
Asad, H.; Adhikari, S.; Gashi, I. A Perspective-Retrospective Analysis of Diversity in Signature-Based Open-Source Network Intrusion Detection Systems. Int. J. Inf. Secur. 2024, 23, 1331–1346. [Google Scholar] [CrossRef]
Hussain, A.; Marín-Tordera, E.; Masip-Bruin, X.; Leligou, H.C. Rule-Based with Machine Learning IDS for DDoS Attack Detection in Cyber-Physical Production Systems (CPPS). IEEE Access 2024, 12, 114894–114911. [Google Scholar] [CrossRef]
Wu, T.; Tian, S.; Tang, S. Transmission Scheduling of P2P Real-Time Communication Based on Restless Multi-Armed Bandit. Telecommun. Syst. 2024, 86, 281–293. [Google Scholar] [CrossRef]
Joshi, H.P.; Dutta, R. GADFly: A Fast and Robust Algorithm to Detect P2P Botnets in Communication Graphs. In Proceedings of the GLOBECOM’ 18—2018 IEEE Global Communications Conference, Abu Dhabi, United Arab Emirates, 9–13 December 2018; pp. 1–6. [Google Scholar]
Karuppayah, S.; Böck, L.; Grube, T.; Manickam, S.; Mühlhäuser, M.; Fischer, M. SensorBuster: On Identifying Sensor Nodes in P2P Botnets. In ARES ’17, Proceedings of the 12th ACM International Conference on Availability, Reliability and Security, Reggio Calabria, Italy, 29 August–1 September 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 1–6. [Google Scholar]
Zhen, Z.; Zhao, X.; Zhang, J.; Wang, Y.; Chen, H. DA-GNN: A Smart Contract Vulnerability Detection Method Based on Dual Attention Graph Neural Network. Comput. Netw. 2024, 242, 110238. [Google Scholar] [CrossRef]
Liu, Y.; Wang, X.; Meng, T.; Ai, W.; Li, K. LG-GNN: Local and Global Information-Aware Graph Neural Network for Default Detection. Comput. Oper. Res. 2024, 169, 106738. [Google Scholar] [CrossRef]
Esmaeili, B.; Azmoodeh, A.; Dehghantanha, A.; Srivastava, G.; Karimipour, H.; Lin, J.C. A GNN-Based Adversarial Internet of Things Malware Detection Framework for Critical Infrastructure: Studying Gafgyt, Mirai, and Tsunami Campaigns. IEEE Internet Things J. 2024, 11, 26826–26836. [Google Scholar] [CrossRef]
Ben Yahia, N. Enhancing Social and Collaborative Learning Using a Stacked GNN-Based Community Detection. Soc. Netw. Anal. Min. 2024, 14, 205. [Google Scholar] [CrossRef]
Carpenter, J.; Layne, J.; Serra, E.; Cuzzocrea, A.; Gallo, C. Structural Node Representation Learning for Detecting Botnet Nodes. In Computational Science and Its Applications—ICCSA 2023, Proceedings of the 23rd International Conference on Computational Science and Its Applications, Athens, Greece, 3–6 July 2023; Springer: Cham, Switzerland, 2023; pp. 731–743. [Google Scholar]
Zhou, L.; Pan, S.; Wang, J.; Vasilakos, A.V. Machine Learning on Big Data: Opportunities and Challenges. Neurocomputing 2017, 237, 350–361. [Google Scholar] [CrossRef]
Nazir, A.; He, J.; Zhu, N.; Wajahat, A.; Ma, X.; Ullah, F.; Qureshi, S.; Pathan, M.S. Advancing IoT Security: A Systematic Review of Machine Learning Approaches for the Detection of IoT Botnets. J. King Saud Univ.—Comput. Inf. Sci. 2023, 35, 101820. [Google Scholar] [CrossRef]
Lagraa, S.; Husák, M.; Seba, H.; Vuppala, S.; State, R.; Ouedraogo, M. A Review on Graph-Based Approaches for Network Security Monitoring and Botnet Detection. Int. J. Inf. Secur. 2024, 23, 119–140. [Google Scholar] [CrossRef]
Beigi, E.B.; Jazi, H.H.; Stakhanova, N.; Ghorbani, A.A. Towards Effective Feature Selection in Machine Learning-Based Botnet Detection Approaches. In Proceedings of the 2014 IEEE Conference on Communications and Network Security, San Francisco, CA, USA, 29–31 October 2014; pp. 247–255. [Google Scholar]
Salih, Y.T.; Fenjan, A.; Ahmed, S.R.; Ali, H.; Abdulwahab, E.N.; Algruri, S.; Kurdi, N.A.; Al-Sarem, M.; Tawfeq, J.F. Machine Learning Approaches for Botnet Detection in Network Traffic. In AICCONF ’24, Proceedings of the 2024 ACM Cognitive Models and Artificial Intelligence Conference, Istanbul, Turkey, 25–26 May 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 310–315. [Google Scholar]
McDermott, C.D.; Majdani, F.; Petrovski, A. Botnet Detection in the Internet of Things Using Deep Learning Approaches. In Proceedings of the 2018 IEEE International Joint Conference on Neural Networks, Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–8. [Google Scholar]
Lefoane, M.; Ghafir, I.; Kabir, S.; Awan, I.U.; El Hindi, K.M.; Mahendran, A. Latent Semantic Analysis and Graph Theory for Alert Correlation: A Proposed Approach for IoT Botnet Detection. IEEE Open J. Commun. Soc. 2024, 5, 3904–3919. [Google Scholar] [CrossRef]
Ceci, M.; Cuzzocrea, A.; Malerba, D. Supporting Roll-Up and Drill-Down Operations over OLAP Data Cubes with Continuous Dimensions via Density-Based Hierarchical Clustering. In Proceedings of the 19th Italian Symposium on Advanced Database Systems, Maratea, Italy, 26–29 June 2011; pp. 57–65. [Google Scholar]
Serra, E.; Joaristi, M.; Cuzzocrea, A. Large-Scale Sparse Structural Node Representation. In Proceedings of the 2020 IEEE International Conference on Big Data, Atlanta, GA, USA, 10–13 December 2020; pp. 5247–5253. [Google Scholar]
Braun, P.; Cuzzocrea, A.; Keding, T.D.; Leung, C.K.; Padzor, A.G.M.; Sayson, D. Game Data Mining: Clustering and Visualization of Online Game Data in Cyber-Physical Worlds. In Proceedings of the 21st International Conference on Knowledge-Based and Intelligent Information & Engineering Systems, Marseille, France, 6–8 September 2017; pp. 2259–2268. [Google Scholar]
Ali, M.; Shahroz, M.; Mushtaq, M.F.; Alfarhood, S.; Safran, M.S.; Ashraf, I. Hybrid Machine Learning Model for Efficient Botnet Attack Detection in IoT Environment. IEEE Access 2024, 12, 40682–40699. [Google Scholar] [CrossRef]
Morris, K.J.; Egan, S.D.; Linsangan, J.L.; Leung, C.K.; Cuzzocrea, A.; Hoi, C.S.H. Token-Based Adaptive Time-Series Prediction by Ensembling Linear and Non-Linear Estimators: A Machine Learning Approach for Predictive Analytics on Big Stock Data. In Proceedings of the 17th IEEE International Conference on Machine Learning and Applications, Orlando, FL, USA, 17–20 December 2018; pp. 1486–1491. [Google Scholar]
Serra, E.; Subrahmanian, V.S. A Survey of Quantitative Models of Terror Group Behavior and an Analysis of Strategic Disclosure of Behavioral Models. IEEE Trans. Comput. Soc. Syst. 2014, 1, 66–88. [Google Scholar] [CrossRef]
Cuzzocrea, A.; Saccà, D.; Serafino, P. Semantics-Aware Advanced OLAP Visualization of Multidimensional Data Cubes. Int. J. Data Warehous. Min. 2007, 3, 1–30. [Google Scholar] [CrossRef][Green Version]
Korzh, O.; Joaristi, M.; Serra, E. Convolutional Neural Network Ensemble Fine-Tuning for Extended Transfer Learning. In Big Data—BigData 2018, Proceedings of the 7th International Congress on Big Data, Seattle, WA, USA, 25–30 June 2018; Springer: Cham, Switzerland, 2018; pp. 110–123. [Google Scholar]
Cuzzocrea, A. Improving Range-SUM Query Evaluation on Data Cubes via Polynomial Approximation. Data Knowl. Eng. 2006, 56, 85–121. [Google Scholar] [CrossRef]
Serra, E.; Sharma, A.; Joaristi, M.; Korzh, O. Unknown Landscape Identification with CNN Transfer Learning. In Proceedings of the 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Barcelona, Spain, 28–31 August 2018; pp. 813–820. [Google Scholar]
Serra, E.; Shrestha, A.; Spezzano, F.; Squicciarini, A.C. DeepTrust: An Automatic Framework to Detect Trustworthy Users in Opinion-Based Systems. In CODASPY ’20, Proceedings of the 10th ACM Conference on Data and Application Security and Privacy, New Orleans, LA, USA, 16–18 March 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 29–38. [Google Scholar]
Joaristi, M.; Serra, E.; Spezzano, F. Inferring Bad Entities through the Panama Papers Network. In Proceedings of the 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Barcelona, Spain, 28–31 August 2018; pp. 767–773. [Google Scholar]
Joaristi, M.; Serra, E.; Spezzano, F. Detecting Suspicious Entities in Offshore Leaks Networks. Soc. Netw. Anal. Min. 2019, 9, 62. [Google Scholar] [CrossRef]
Möller, D.P.F. NIST Cybersecurity Framework and MITRE Cybersecurity Criteria. In Guide to Cybersecurity in Digital Transformation: Trends, Methods, Technologies, Applications and Best Practices; Springer: Cham, Switzerland, 2023; Volume 103, pp. 231–271. [Google Scholar]
Pleshakova, E.; Osipov, A.; Gataullin, S.; Gataullin, T.; Vasilakos, A. Next Gen Cybersecurity Paradigm Towards Artificial General Intelligence: Russian Market Challenges and Future Global Technological Trends. J. Comput. Virol. Hacking Tech. 2024, 20, 429–440. [Google Scholar] [CrossRef]
CAIDA. The CAIDA UCSD Anonymized Internet Traces. 2018. Available online: https://www.caida.org/catalog/datasets/passive_dataset/ (accessed on 10 December 2024).
Kaashoek, M.F.; Karger, D.R. Koorde: A Simple Degree-Optimal Distributed Hash Table. In Peer-to-Peer Systems II, Proceedings of the 2nd International Workshop on Peer-To-Peer Systems, Berkeley, CA, USA, 21–22 February 2003; Springer: Berlin/Heidelberg, Germany, 2003; pp. 98–107. [Google Scholar]
Maymounkov, P.; Mazieres, D. Kademlia: A Peer-To-Peer Information System Based on the XOR Metric. In Peer-to-Peer Systems, Proceedings of the 1st International Workshop on Peer-To-Peer Systems, Cambridge, MA, USA, 7–8 March 2002; Springer: Berlin/Heidelberg, Germany, 2002; pp. 53–65. [Google Scholar]
Stoica, I.; Morris, R.; Karger, D.R.; Kaashoek, M.F.; Balakrishnan, H. Chord: A Scalable Peer-To-Peer Lookup Service for Internet Applications. In SIGCOMM ’01, Proceedings of the 2001 ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, San Diego, CA, USA, 27–31 August 2001; Association for Computing Machinery: New York, NY, USA, 2001; pp. 149–160. [Google Scholar]
Jelasity, M.; Bilicki, V. Towards Automated Detection of Peer-To-Peer Botnets: On the Limits of Local Approaches. In LEET’09, Proceedings of the 2nd USENIX Workshop on Large-Scale Exploits and Emergent Threats, Boston, MA, USA, 22–24 April 2009; USENIX Association: Berkeley, CA, USA, 2009. [Google Scholar]
Layne, J.; Serra, E. INFSIR-GN: Inferential Labeled Node and Graph Representation Learning. arXiv 2021, arXiv:1918.10503. [Google Scholar]
Joaristi, M.; Serra, E. SIR-GN: A Fast Structural Iterative Representation Learning Approach for Graph Nodes. ACM Trans. Knowl. Discov. Data 2021, 15, 100. [Google Scholar] [CrossRef]
Yumlembam, R.; Issac, B.; Jacob, S.M.; Yang, L. Comprehensive Botnet Detection by Mitigating Adversarial Attacks, Navigating the Subtleties of Perturbation Distances and Fortifying Predictions with Conformal Layers. Inf. Fusion 2024, 111, 102529. [Google Scholar] [CrossRef]
Krishnan, D.; Shrinath, P. Robust IoT Botnet Detection Framework Resilient to Gradient Based Adversarial Attacks. SN Comput. Sci. 2024, 5, 870. [Google Scholar] [CrossRef]
Zhou, J.; Xu, Z.; Rush, A.M.; Yu, M. Automating Botnet Detection with Graph Neural Networks. arXiv 2020, arXiv:2003.06344. [Google Scholar]
Al-Mashhadi, S.; Anbar, M.; Hasbullah, I.H.; Alamiedy, T.A. Hybrid Rule-Based Botnet Detection Approach Using Machine Learning for Analysing DNS Traffic. PeerJ Comput. Sci. 2021, 7, e640. [Google Scholar] [CrossRef]
Almuqren, L.; Alqahtani, H.; Aljameel, S.S.; Salama, A.S.; Yaseen, I.; Alneil, A.A. Hybrid Metaheuristics With Machine Learning Based Botnet Detection in Cloud Assisted Internet of Things Environment. IEEE Access 2023, 11, 115668–115676. [Google Scholar] [CrossRef]
Guerra-Manzanares, A.; Bahsi, H.; Nomm, S. Hybrid Feature Selection Models for Machine Learning Based Botnet Detection in IoT Networks. In Proceedings of the 2019 International Conference on Cyberworlds, Kyoto, Japan, 2–4 October 2019; pp. 324–327. [Google Scholar]
May Raju, P.; Gupta, G.P. Intrusion Detection Framework using an Improved Deep Reinforcement Learning Technique for IoT Network. In Soft Computing for Security Applications, Proceedings of the 2021 International Conference on Soft Computing for Security Applications, Salem, India, 10–11 June 2021; Springer: Singapore, 2021; pp. 765–779. [Google Scholar]
Manimurugan, S. IoT-Fog-Cloud Model for Anomaly Detection using Improved Naïve Bayes and Principal Component Analysis. J. Ambient. Intell. Humaniz. Comput. 2021, 1–10. [Google Scholar] [CrossRef]
Liu, J.; Liu, S.; Zhang, S. Detection of IoT Botnet Based on Deep Learning. In Proceedings of the 2019 IEEE Chinese Control Conference, Guangzhou, China, 27–30 July 2019; pp. 8381–8385. [Google Scholar]
Bahsi, H.; Nomm, S.; La Torre, F.B. Dimensionality Reduction for Machine Learning Based IoT Botnet Detection. In Proceedings of the 15th IEEE International Conference on Control, Automation, Robotics and Vision, Singapore, 18–21 November 2018; pp. 1857–1862. [Google Scholar]
Yin, L.; Luo, X.; Zhu, C.; Wang, L.; Xu, Z.; Lu, H. ConnSpoiler: Disrupting C&C Communication of IoT-Based Botnet Through Fast Detection of Anomalous Domain Queries. IEEE Trans. Ind. Inform. 2020, 16, 1373–1384. [Google Scholar]
Koroniotis, N.; Moustafa, N.; Sitnikova, E.; Slay, J. Towards Developing Network Forensic Mechanism for Botnet Activities in the IoT Based on Machine Learning Techniques. In Mobile Networks and Management, Proceedings of the 9th International Conference on Mobile Networks and Management, Melbourne, Australia, 13–15 December 2017; Springer: Cham, Switzerland, 2017; pp. 30–44. [Google Scholar]
Al Shorman, A.R.; Faris, H.; Aljarah, I. Unsupervised Intelligent System Based on One Class Support Vector Machine and Grey Wolf Optimization for IoT Botnet Detection. J. Ambient. Intell. Humaniz. Comput. 2020, 11, 2809–2825. [Google Scholar] [CrossRef]
Guerra-Manzanares, A.; Medina-Galindo, J.; Bahsi, H.; Nõmm, S. MedBIoT: Generation of an IoT Botnet Dataset in a Medium-Sized IoT Network. In Proceedings of the 6th International Conference on Information Systems Security and Privacy, Valletta, Malta, 25–27 February 2020; pp. 207–218. [Google Scholar]
Gandhi, R.; Li, Y. Comparing Machine Learning and Deep Learning for IoT Botnet Detection. In Proceedings of the 2021 IEEE International Conference on Smart Computing, Irvine, CA, USA, 23–27 August 2021; pp. 234–239. [Google Scholar]
Nguyen, H.T.; Ngo, Q.D.; Le, V.H. IoT Botnet Detection Approach Based on PSI Graph and DGCNN Classifier. In Proceedings of the 2018 IEEE International Conference on Information Communication and Signal Processing, Singapore, 28–30 September 2018; pp. 118–122. [Google Scholar]
Cunningham, P.; Delany, S.J. K-Nearest Neighbour Classifiers—A Tutorial. ACM Comput. Surv. 2022, 54, 128. [Google Scholar] [CrossRef]
Patel, H.H.; Prajapati, P. Study and Analysis of Decision Tree Based Classification Algorithms. Int. J. Comput. Sci. Eng. 2018, 6, 74–78. [Google Scholar] [CrossRef]
Resende, P.A.A.; Drummond, A.C. A Survey of Random Forest Based Methods for Intrusion Detection Systems. ACM Comput. Surv. 2018, 51, 48. [Google Scholar] [CrossRef]
Wazzan, M.; Alghazzawi, D.M.; Albeshri, A.; Hasan, S.H.; Rabie, O.B.J.; Asghar, M.Z. Cross Deep Learning Method for Effectively Detecting the Propagation of IoT Botnet. Sensors 2022, 22, 3895. [Google Scholar] [CrossRef]
Yu, B.; Cuzzocrea, A.; Jeong, D.H.; Maydebura, S. On Managing Very Large Sensor-Network Data Using Bigtable. In Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, Ottawa, ON, Canada, 13–16 May 2012; pp. 918–922. [Google Scholar]
Setiawan, Y.; Maulidevi, N.U.; Surendro, K. The Optimization of n-Gram Feature Extraction Based on Term Occurrence for Cyberbullying Classification. Data Sci. J. 2024, 23, 31. [Google Scholar] [CrossRef]
Singh, N.M.; Sharma, S.K. An Efficient Automated Multi-Modal Cyberbullying Detection using Decision Fusion Classifier on Social Media Platforms. Multimed. Tools Appl. 2024, 83, 20507–20535. [Google Scholar] [CrossRef]
Dehkordi, M.J.; Sadeghiyan, B. Reconstruction of C&C Channel for P2P Botnet. IET Commun. 2020, 14, 1318–1326. [Google Scholar]
Vajrobol, V.; Gupta, B.B.; Gaurav, A.; Chuang, H.M. Adversarial Learning for Mirai Botnet Detection based on Long Short-Term Memory and XGBoost. Int. J. Cogn. Comput. Eng. 2024, 5, 153–160. [Google Scholar] [CrossRef]
Hoang, X.D.; Vu, X.H. An Improved Model for Detecting DGA Botnets using Random Forest Algorithm. Inf. Secur. J. Glob. Perspect. 2022, 31, 441–450. [Google Scholar] [CrossRef]
Masoudi-Sobhanzadeh, Y.; Emami-Moghaddam, S. A Real-Time IoT-based Botnet Detection Method using a Novel Two-Step Feature Selection Technique and the Support Vector Machine Classifier. Comput. Netw. 2022, 217, 109365. [Google Scholar] [CrossRef]
Zhang, H.; Papadopoulos, C.; Massey, D. Detecting Encrypted Botnet Traffic. In Proceedings of the 2013 IEEE INFOCOM, Turin, Italy, 14–19 April 2013; pp. 3453–3458. [Google Scholar]
Mousavi, S.H.; Khansari, M.; Rahmani, R. A Fully Scalable Big Data Framework for Botnet Detection based on Network Traffic Analysis. Inf. Sci. 2020, 512, 629–640. [Google Scholar] [CrossRef]
Lin, Y.D.; Chan, W.H.; Lai, Y.C.; Yu, C.M.; Wu, Y.S.; Lee, W.B. Enhancing can Security with ML-based IDS: Strategies and Efficacies Against Adversarial Attacks. Comput. Secur. 2025, 151, 104322. [Google Scholar] [CrossRef]
Gómez, A.L.P.; Maimó, L.F.; Celdrán, A.H.; Clemente, F.J.G. Detection of Adversarial Attacks Using Deep Learning and Features Extracted from Interpretability Methods in Industrial Scenarios. IEEE Access 2025, 13, 2705–2722. [Google Scholar] [CrossRef]
Coronato, A.; Cuzzocrea, A. An Innovative Risk Assessment Methodology for Medical Information Systems. IEEE Trans. Knowl. Data Eng. 2020, 34, 3095–3110. [Google Scholar] [CrossRef]
Leung, C.K.; Cuzzocrea, A.; Mai, J.J.; Deng, D.; Jiang, F. Personalized DeepInf: Enhanced Social Influence Prediction with Deep Learning and Transfer Learning. In Proceedings of the 2019 IEEE International Conference on Big Data, Los Angeles, CA, USA, 9–12 December 2019; pp. 2871–2880. [Google Scholar]
Leung, C.K.; Braun, P.; Hoi, C.S.H.; Souza, J.; Cuzzocrea, A. Urban Analytics of Big Transportation Data for Supporting Smart Cities. In Big Data Analytics and Knowledge Discovery, Proceedings of the 21st International Conference on Big Data Analytics and Knowledge Discovery, Linz, Austria, 26–29 August 2019; Springer: Cham, Switzerland, 2019; pp. 24–33. [Google Scholar]
Leung, C.K.; Chen, Y.; Hoi, C.S.H.; Shang, S.; Wen, Y.; Cuzzocrea, A. Big Data Visualization and Visual Analytics of COVID-19 Data. In Proceedings of the 24th IEEE International Conference on Information Visualisation, Melbourne, Australia, 7–11 September 2020; pp. 415–420. [Google Scholar]
Leung, C.K.; Chen, Y.; Hoi, C.S.H.; Shang, S.; Cuzzocrea, A. Machine Learning and OLAP on Big COVID-19 Data. In Proceedings of the 2020 IEEE International Conference on Big Data, Atlanta, GA, USA, 10–13 December 2020; pp. 5118–5127. [Google Scholar]
Barkwell, K.E.; Cuzzocrea, A.; Leung, C.K.; Ocran, A.A.; Sanderson, J.M.; Stewart, J.A.; Wodi, B.H. Big Data Visualisation and Visual Analytics for Music Data Mining. In Proceedings of the 22nd IEEE International Conference Information Visualisation, Fisciano, Italy, 10–13 July 2018; pp. 235–240. [Google Scholar]
Camara, R.C.; Cuzzocrea, A.; Grasso, G.M.; Leung, C.K.; Powell, S.B.; Souza, J.; Tang, B. Fuzzy Logic-Based Data Analytics on Predicting the Effect of Hurricanes on the Stock Market. In Proceedings of the 2018 IEEE International Conference on Fuzzy Systems, Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–8. [Google Scholar]
Shi, M.; Tang, Y.; Zhu, X.; Liu, J. Multi-Label Graph Convolutional Network Representation Learning. IEEE Trans. Big Data 2022, 8, 1169–1181. [Google Scholar] [CrossRef]

Figure 1. Case study: big data ecosystem for detection of bot nodes within unknown networks in cloud-based environment.

Figure 2. A Mirai-based bot turning IoT devices into proxy servers.

Figure 3. Client–Server and P2P mode connections.

Figure 4. Log-log distribution of experimental data.

Figure 5. Average node structure of datasets.

Figure 6. Number of dataset devices categorized by traffic type.

Figure 7. Number of dataset packets categorized by traffic type.

Figure 8. Number of features across four different types.

Figure 9. Number of instances across different malware classes.

Figure 10. Evaluation score of state-of-the-art methods for botnet detection in early and late stages.

Figure 11. Classification performance evaluation of ML classifiers.

Figure 12. Binary classification performance of SIR-GN.

Figure 13. Three-class classification performance of SIR-GN.

Figure 14. Four-class classification performance of SIR-GN.

Figure 15. Classification performance of different ML models and SIR-GN.

Figure 16. Classification performance of different DL models and SIR-GN.

Figure 17. Comparative analysis between inferential SIR-GN and ABD-GN across different network Topologies.

Table 1. Notations used in model description.

Notation	Description
$n$	The number of clusters chosen for node representation
$n g c$	The number of clusters chosen for graph representation
$k$	The depth of exploration, equal to a node’s k-hop neighborhood

Table 2. F1-scores of the models trained on one topology and tested on another: comparison between SIR-GN and ABD-GN.

Trained on Chord						Trained on Debru
	Chord	Kadem	Debru	Leet	P2P	Chord	Kadem	Debru	Leet	P2P
ABD-GN	99.0	97.5	99.6	99.4	0.0	10.0	2.5	100.0	0.0	2.5
SIR-GN	99.4	93.0	100.0	99.0	97.0	93.0	94.0	99.5	92.5	97.0
Trained on Kadem						Trained on Leet
	Chord	Kadem	Debru	Leet	P2P	Chord	Kadem	Debru	Leet	P2P
ABD-GN	97.0	98.0	99.5	99.0	2.5	73.0	95.0	100.0	100.0	2.0
SIR-GN	99.0	99.0	99.0	99.0	97.5	99.0	99.2	94.5	100.0	98.0
Trained on P2P
	Chord		Kadem		Debru		Leet		P2P
ABD-GN	15.0		22.5		16.0		17.5		99.5
SIR-GN	93.0		93.0		93.0		93.0		97.5

Table 3. Comparison between SIR-GN and ABD-GN over real-life test data.

Trained on Chord				Trained on Debru
		1 Graph	50 Graphs	1 Graph	50 Graphs
Chord	SIR-GN	99.5	99.5	92.5	92.5
Chord	ABD-GN	98.7	98.7	90.3	90.5
Debru	SIR-GN	99.9	100.0	99.8	99.8
Debru	ABD-GN	98.0	99.0	98.9	99.1
Kadem	SIR-GN	94.0	96.0	94.0	93.0
Kadem	ABD-GN	92.5	94.0	92.0	92.0
Leet	SIR-GN	92.5	99.0	92.25	92.25
Leet	ABD-GN	88.0	90.0	90.5	90.5
P2P	SIR-GN	98.0	98.0	97.0	97.0
P2P	ABD-GN	95.0	95.0	93.0	90.0
Trained on Kadem				Trained on Leet
		1 Graph	50 Graphs	1 Graph	50 Graphs
Chord	SIR-GN	99.0	99.0	99.0	99.0
Chord	ABD-GN	98.0	98.0	98.0	98.0
Debru	SIR-GN	98.5	99.0	93.0	95.0
Debru	ABD-GN	95.5	95.5	89.5	90.0
Kadem	SIR-GN	99.25	99.0	99.0	99.25
Kadem	ABD-GN	98.0	98.5	98.5	98.0
Leet	SIR-GN	99.25	99.0	99.5	100.0
Leet	ABD-GN	98.5	98.5	98.0	98.0
P2P	SIR-GN	98.0	97.0	98.0	98.0
P2P	ABD-GN	95.0	95.5	95.0	95.5
Trained on P2P
		1 Graph		50 Graph
Chord	SIR-GN	93.0		93.0
Chord	ABD-GN	89.0		90.0
Debru	SIR-GN	93.0		93.0
Debru	ABD-GN	90.0		90.0
Kadem	SIR-GN	93.0		93.0
Kadem	ABD-GN	89.0		90.0
Leet	SIR-GN	93.0		93.0
Leet	ABD-GN	89.0		90.0
P2P	SIR-GN	98.0		98.0
P2P	ABD-GN	95.0		96.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cuzzocrea, A.; Hafsaoui, A.; Gallo, C. Detecting and Analyzing Botnet Nodes via Advanced Graph Representation Learning Tools. Algorithms 2025, 18, 253. https://doi.org/10.3390/a18050253

AMA Style

Cuzzocrea A, Hafsaoui A, Gallo C. Detecting and Analyzing Botnet Nodes via Advanced Graph Representation Learning Tools. Algorithms. 2025; 18(5):253. https://doi.org/10.3390/a18050253

Chicago/Turabian Style

Cuzzocrea, Alfredo, Abderraouf Hafsaoui, and Carmine Gallo. 2025. "Detecting and Analyzing Botnet Nodes via Advanced Graph Representation Learning Tools" Algorithms 18, no. 5: 253. https://doi.org/10.3390/a18050253

APA Style

Cuzzocrea, A., Hafsaoui, A., & Gallo, C. (2025). Detecting and Analyzing Botnet Nodes via Advanced Graph Representation Learning Tools. Algorithms, 18(5), 253. https://doi.org/10.3390/a18050253

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Detecting and Analyzing Botnet Nodes via Advanced Graph Representation Learning Tools^†

Abstract

1. Introduction

2. Case Study: Detection of Bot Nodes Within Unknown Networks in Cloud Environments

3. Related Work

4. Analysis of State-of-the-Art Approaches

4.1. The Detection of Botnets

4.2. Graph Representation Learning

4.3. Cybersecurity Standards

5. Description of Datasets

6. The Proposed Methodology: Inferential SIR-GN

7. Experimental Assessment and Analysis

7.1. Setup

7.2. Experimental Results

7.3. Discussion and Remarks

8. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Detecting and Analyzing Botnet Nodes via Advanced Graph Representation Learning Tools †

Abstract

1. Introduction

2. Case Study: Detection of Bot Nodes Within Unknown Networks in Cloud Environments

3. Related Work

4. Analysis of State-of-the-Art Approaches

4.1. The Detection of Botnets

4.2. Graph Representation Learning

4.3. Cybersecurity Standards

5. Description of Datasets

6. The Proposed Methodology: Inferential SIR-GN

7. Experimental Assessment and Analysis

7.1. Setup

7.2. Experimental Results

7.3. Discussion and Remarks

8. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Detecting and Analyzing Botnet Nodes via Advanced Graph Representation Learning Tools^†