1. Introduction
The advent and widespread adoption of new paradigms derived from the Internet of Things (IoT), such as Industry 4.0, Smart Cities, Smart Agriculture, etc., have brought wireless connectivity to the forefront. This ubiquitous connectivity is instrumental in managing vast datasets and facilitating intelligent process automation, which promises substantial societal benefits [
1]. Among the key enablers of these technological advancements, wireless sensor networks (WSNs) play a fundamental role [
2].
The proliferation of connected devices (of all types) within WSNs and the massive implementation of this technology in the most diverse environments also brings forth several cybersecurity threats [
3,
4]. Many of these exploit inherent limitations in WSN devices, particularly concerning their computing resources and energy consumption. Consequently, effectively managing cybersecurity in WSNs necessitates a keen focus on devising and implementing strategies for controlling one of the basic tools in cyber attacks: malware.
In this context, an in-depth understanding of the propagation and infection processes of the different specimens of malicious code used in cyber attacks against WSNs is paramount for designing effective detection and control techniques. As a consequence, the development and analysis of mathematical models that simulate malware spread become indispensable in mitigating its impact.
Most of the mathematical models that have appeared in the scientific literature to simulate malware propagation on WSNs are compartmental (the population of devices is divided into different classes based on their epidemiological state: susceptible, infected, recovered, etc.) and of a global nature. That is, the sole intention is to study the temporal evolution of the size (number of devices) of each of these classes. Consequently, the dynamics of malware are usually described by systems of ordinary differential equations [
5]. While these models offer theoretical insights, they suffer from significant limitations by overlooking both complex network topology and node-particular characteristics. Fortunately, these limitations can be addressed by adopting the individual-based paradigm [
6]. In this sense, individual-based models consider the unique and particular characteristics of each network node/device, including computing resources, energy consumption, location, activity level, communication attributes, and the explicit contact topology. As a result, they provide a detailed representation of the epidemiological state of each device over time, rather than only the global evolution of each compartment. Regrettably, only a handful of such individual-based models have been proposed [
7,
8,
9].
Additionally, mathematical models can be classified into deterministic and stochastic. These types of models represent two fundamental approaches to understanding malware propagation in wireless sensor networks. Deterministic models are based on fixed and predictable coefficients and relations to simulate the spread of malware. They assume that the behavior of malware and its spreading on network devices follow strict rules, making them suitable for analyzing scenarios with well-defined characteristics. On the other hand, stochastic models deal with randomness and uncertainty, recognizing that real-world malware propagation often involves probabilistic characteristics. In this sense, stochastic models incorporate randomness in both epidemiological coefficients (for example, infection or recovery rates) and structural relations (node interactions), making them more suitable for capturing the unpredictability inherent in the dynamics of malware. These two modeling approaches provide valuable insights into different aspects of malware propagation, enabling researchers and cybersecurity experts to develop effective strategies for prediction, prevention, and mitigation.
These mathematical models can be studied from different perspectives, mainly the following four: qualitative study, numerical study, deep learning-based study, and statistical study. Through qualitative study (for example, the study of ergodicity in the case of stochastic models whose dynamics are described by means of Markov chains), we can determine the nature of steady states and consequently understand the dynamics of the system subject to some conditions, in which the basic reproductive number plays a fundamental role. Once initial conditions are established, the numerical study allows us to obtain simulations and also to explore the possibility of developing ad hoc numerical methods. Very recently, different architectures have been proposed in the field of deep learning to solve problems modeled by differential equations. In this sense, malware propagation models can be analyzed using both differential neural networks and physics-informed neural networks. Finally, to complete the analysis of malware propagation models, we understand that a statistical study of the results derived from simulations is also necessary (and very important). This type of study can provide a detailed view of the variability and uncertainty associated with malware behavior in dynamic environments, especially wireless sensor networks. By considering different network topologies, and taking into account various initial conditions and numerical values of epidemiological coefficients, statistical approaches allow for the identification of hidden patterns and the assessment of the robustness of the proposed models. Furthermore, they facilitate the extraction of valuable information about the effectiveness of containment strategies and the vulnerability of the network.
WSN malware propagation models, whether global or individual, deterministic or stochastic, serve a dual purpose: predicting malware behavior and aiding in the development of control and mitigation strategies. Specifically, individual-based models are valuable not only for their ability to predict both global and individual malware evolution with greater detail but also for generating datasets suitable for machine learning malware detection techniques. Consequently, in addition to qualitative and numerical analyses on individual-based or global models, different types of statistical analyses also offer an intriguing avenue for extracting hidden correlations and meaningful insights from data obtained through simulations [
10].
Most statistical malware research focuses mainly on malware detection and identification, without predicting malware propagation trends and predicting subsequent infected nodes. For example, in [
11], Recurrent Neural Networks (RNNs) are used to analyze the execution operation codes (OpCodes) of ARM-based IoT applications, and in [
12], RNNs are also used to predict whether an executable is malicious or benign within the first five seconds of execution with 94% accuracy. AMalNet, a robust deep learning framework for Android malware detection, is proposed in [
13]. This framework uses graph convolutional networks (GCNs) to capture graphical semantics and independent recurrent neural networks (IndRNNs) to decode deep semantic information. In [
14], a graph attention network-based (GAN) framework is proposed to detect malware attacks targeting intelligent transportation systems (ITS). In [
15], GCNs are also used to detect Android malware. Few studies have attempted to model the dynamics of malware propagation and predict the transient behavior of these dynamics. In this sense, [
10] uses a Bayesian structural time series (BSTS) model to study the data-driven dynamics of malware propagation in cyberspace. In [
16], a malware propagation prediction model based on representation learning and GCNs is proposed, such that the proposed model predicts whether potential nodes will be infected by malware. As far as we know, the use of statistical techniques as a complement to the qualitative and numerical studies mentioned above has not been thoroughly explored. This is the motivation that has driven us to initiate this line of research.
The main goal of this work is to explore the use of two statistical techniques—queuing theory and multivariate analysis or, more specifically, the HJ-Biplot—to characterize the behavior of malware specimens as they propagate on wireless sensor networks (that can be mathematically described by complex networks). As mentioned earlier, and to the best of our knowledge, this approach has not been employed before, making the research relevant from an academic perspective. This statistical analysis is performed using a synthetic dataset generated from several simulations obtained from an individual-based stochastic SIRS compartmental model. In this regard, the results and conclusions obtained are limited to the scope of such examples, leaving the extension to other compartmental models for future developments.
Queuing theory has been applied to various fields. For example, it has been used in telecommunications to analyze Ethernet traffic [
17], in logistics and supply chains to analyze online shopping stores [
18], and to analyze healthcare systems [
19]. It has also been used in malware analysis to propose probabilistic models that aim to describe the overall behavior of the system when attacked by malicious nodes. In [
20], a probabilistic model based on the susceptible–infected–susceptible (SIS) paradigm and on the theory of closed queuing networks is proposed to analyze the spread of malicious software in ad hoc networks. Queuing theory enhanced with stochastic analysis is employed to overview the dynamics of epidemic cases in [
21] and to assess cyber disruptions and attacks to the national airspace system [
22].
As far as biplots are concerned, they have also been used to analyze data from different fields of knowledge, such as hydrology [
23], transport networks [
24], sustainability [
25,
26], healthcare [
27], and the chemosphere [
28], but never to analyze malware propagation in wireless sensor networks. The main advantage of biplot methods is that they do not take into account parametric assumptions and have the advantage of being a specific statistical tool for the representation of multivariate data.
Our analytical approach, combining queuing theory and HJ-Biplot with an SIRS model, represents a novel perspective on understanding malware propagation in wireless sensor networks. In connecting our results with previous studies, we emphasize the departure from traditional methods and highlight the potential benefits of our unique combination of analytical tools. By referencing established literature on malware dynamics, we underscore the innovative aspects of our work, positioning it within the evolving landscape of network security research. This integration serves not only to validate our methodology but also to contribute fresh insights to the existing body of knowledge.
Beyond the immediate scope of malware propagation, our results hold broader implications for epidemic modeling and network security. We explore how the principles derived from our study can be applied to understand the dynamics of other infectious phenomena in diverse networked environments. By considering the broader context, we underscore the versatility of our approach and its potential relevance to fields beyond malware research.
Our current work lays the theoretical groundwork for analyzing malware propagation using queuing theory and HJ-Biplot with an SIRS model. Future research should focus on empirical validation using real-world datasets to assess the model’s effectiveness in practical scenarios. Another avenue for further exploration involves a detailed sensitivity analysis of the model parameters, since investigating the impact of variations in key parameters on the model outcomes will provide a more nuanced understanding of the dynamics of malware propagation. Additionally, the generalizability of our approach to diverse topologies is an exciting prospect for future research, as exploring how the model performs in various network structures will enhance its applicability across different wireless sensor network configurations.
The rest of the paper is organized as follows:
Section 2 briefly describes the SIRS models (deterministic and stochastic) and the simulations performed to generate the datasets used in the analysis.
Section 3 presents the methodology used: closed queuing networks and HJ-Biplot.
Section 4 presents the results of applying the above methodology to the datasets obtained from the simulations. Finally,
Section 5 discusses the main conclusions.
2. SIRS Model and Data Description
2.1. SIRS Models
Two mathematical models, one deterministic and the other stochastic, for malware propagation on WSNs are used to generate the datasets employed in the statistical analysis. They are compartmental and individual models, in which the devices are classified into three different compartments: susceptible S (non-infected devices), infectious I (devices reached by malware), and recovered R (devices during a temporal immunity period). As a recovered device is endowed with temporal immunity, the dynamics of the models correspond to the SIRS type.
The SIRS models employed in this study are particularly well-suited to address the current problem due to the fact that the SIRS model closely mimics the dynamics of infectious diseases. In this context, it effectively represents the propagation and recovery dynamics of malware. The deterministic model allows for a precise mathematical description, while the stochastic model introduces randomness, acknowledging the inherent uncertainties in real-world scenarios. Therefore, the selection of the SIRS models for this study was motivated by a combination of their relevance to real-world malware dynamics and the practical constraints of working with a “simplified” dataset.
The communication topology of the wireless sensor networks determines the contact topology of nodes/devices; that is, the set of devices for which there exists a communication link with the
i-th device is called its neighborhood and is denoted by
. The state of the
i-th device at timestep
t is denoted by
; that is:
The deterministic local transition function that describes the dynamics of the deterministic individual-based model is the following:
If
, then
If
, then
If
, then
where is the set of infectious neighbors of the i-th node at timestep t, is the degree of the i-th node, is a threshold parameter that depends on the particular characteristics of the i-th node at the step of time t, (resp. ) is the length of the infectious (resp. immunity) period associated with the i-th node, and (resp. ) is the number of timesteps that the i-th node has been in the infectious (resp. immunity) period at timestep t.
On the other hand, the stochastic local transition function employed is defined as follows:
If
, then
If
, then
If
, then
2.2. Data Description
The SIRS model is used to perform the simulations. These simulations provide the datasets used in the proposed statistical analysis. In all of them, the number of devices is , and the simulations are computed during a period of 100 units of time: . Furthermore, it is assumed that there is only one infectious device (“patient zero”) at , which in all cases will be the device with the highest betweenness centrality. Different types of complex networks—taking into account their degree distributions—are considered to carry out the simulations, namely, full networks, random networks, scale-free networks, and small-world networks. For random networks, ER connecting probabilities of and are considered. The Barabási–Albert (BA) algorithm with values of 3 and 5 for the m parameter is used to generate the scale-free network and, in small-world networks, probabilities of and for the occurrence of a new edge are considered. The choice of patient zero was made on the basis that this assumption would favor a higher initial rate of spread of the epidemic.
Furthermore, deterministic and stochastic local transition functions are employed. When stochastic transition functions are used, the values of the parameters considered are and , as a random integer between 2 and 7 and between 5 and 9, and as a random integer between 3 and 5 and between 4 and 6. If the transition function used is of the stochastic type, the parameter values considered are the probability and , probability and , and probability and . It is important to note that the selection of parameter values listed above is for illustrative purposes only.
By modifying all the above parameters (type of complex network, transaction probabilities, length of periods), 56 stochastic and 56 deterministic simulations are carried out. The result of each simulation is a data matrix that stores the state of node i at time unit t (). Therefore, the methodology based on closed queuing networks is applied to a sequence of data matrices, , with , with , generated by a stochastic SIRS model, and another sequence generated by a deterministic SIRS model, each of them obtained by varying the parameters in the simulations. From now on, we will label these matrices as , and we will represent our dataset made of K matrices as .
4. Results
Let us now present a precise description of the results, specifically, the generated plots obtained through the application of HJ-Biplot on the two datasets. These databases contain information regarding the propagation and infection dynamics of various simulated malware types. Notably, this malware follows either a deterministic or a stochastic SIRS model for its propagation. Furthermore, in this section, we interpret these theoretical findings from an epidemiological perspective and draw meaningful conclusions.
Figure 5 and
Figure 6 show three groups of deterministic SIRS models according to their propagation dynamics. The first group, located on the left side, comprises malware propagation models characterized by a significant number of nodes being in the susceptible state (high
), a prolonged period of susceptibility (high
), and a higher probability of all nodes being susceptible (high
). These models may aptly be labeled as “passive” since they do not induce nodes to transition to infectious or recovered states; rather, they predominantly keep the nodes in a susceptible state. In the second group, situated in the bottom right corner, we find a different category of models. These are characterized by a substantial number of nodes being infectious (high
) and a higher probability that all of them are in the infectious state (high
). The third group, positioned in the top right corner, includes models featuring a significant proportion of nodes in the recovered state (high
). These malware models also exhibit a higher probability that all nodes have recovered (high
). Furthermore, they are distinguished by rapid transitions from recovered to susceptible and from infectious to recovered states, as indicated by high values of
and
, respectively. In summary, models within the second and third groups infect nodes swiftly, with the distinguishing factor being that those in the second group maintain nodes in an infectious state, while those in the third group enable nodes to recover quickly.
Performance measures and do not exhibit significant correlations with any malware propagation model. This lack of correlation is evident, as the angles they form with respect to the other measures or groups are approximately . In other words, the behavior of nodes is not notably influenced by the duration of time they spend in the infectious state () or the recovered state ().
Regarding the relationship between these groups, we can observe that malware models in the first group, the “passive” category, exhibit strong negative correlations with the other two groups, namely, those that keep nodes in an “infectious” state and those that facilitate a “quick recovery” of nodes (forming angles close to with these groups). This observation can be interpreted as follows: given that “passive” models are characterized by high values of , and , and models that keep nodes “infectious” or enable them to “recover quickly” are characterized by a high value of , it follows that a “passive” model will have a low value of (resulting in a slower transition of nodes from susceptible to infectious). Conversely, if a model is designed to keep nodes “infectious” or promote “quick recovery”, it will have low values of , and (with almost no nodes in the susceptible state; if any are, they spend minimal time in the susceptible state, making it highly unlikely that all nodes are susceptible).
Furthermore, groups 2 and 3 exhibit a strong positive correlation with each other and with , highlighting their common characteristic of facilitating rapid transitions of nodes from the susceptible to the infectious state.
Again, we can see how the different malware propagation models are also sorted into three large groups in the stochastic SIRS model plots (
Figure 7 and
Figure 8). These graphs are interpreted by talking about the groups the models form, the angles from which performance is measured, and the relationships between the two characteristics [
31].
A detailed description of the three groups to which the models belong is given. The first group, situated in the bottom left corner, consists of models characterized by their ability to render a significant number of nodes in a susceptible state (). These models also have a higher probability of keeping all nodes in a susceptible state (high ) and facilitate swift transitions from the recovered state to the susceptible state (). This group bears similarities to the first “passive” group observed in the deterministic SIRS model plot, with the key distinction that there is a high rate of transition to the susceptible state. Thus, we can still classify these models as “passive”, but with the added feature of a high rate of transition to susceptibility. In the second group, positioned in the top left corner, we find a different category of models. These models are characterized by their influence on a significant number of nodes transitioning to the recovered state (), a higher probability of all nodes being in the recovered state (high ), and swift transitions from the susceptible to infectious state and from the infectious to recovered state (high and ). This group shares similarities with the third “quickly recovered” group from the deterministic SIRS model plot, with the exception that, in this case, the transition speed from recovered to susceptible is not high (). Instead, the rate of recovery is high after being infectious, meaning that nodes quickly transition from the infectious to the recovered state. Lastly, on the right side, the third group is located. These malware models are characterized by the fact that nodes are predominantly in an infectious state (high ), and there is a high probability that all nodes are in the infectious state (high ). This is analogous to what was observed in the second “stay infectious” group from the deterministic SIRS model plot. However, in this scenario, the nodes spend a significant amount of time in the infectious state (high ), and the speed of transition from susceptible to infectious () does not significantly affect their behavior.
In the context of the stochastic SIRS model, performance measures and are not significantly correlated with any model. This observation suggests that the behavior of nodes is not substantially influenced by the duration of time they spend in the susceptible state () or the recovered state ().
In contrast to the deterministic SIRS model plot, in the stochastic SIRS model plot shows a different pattern of relationships between the groups of models. Specifically, models in the first group, the “passive” category, are highly negatively correlated with only the third group, which comprises models that keep the nodes “infectious”. Groups 2 and 3 also exhibit strong negative correlations with each other. Interestingly, “passive” models and models that promote “quick recovery” are positively correlated.
Once the HJ-Biplots have been interpreted to characterize the existing models, they can indeed be valuable for making predictions about new, unknown models. Suppose there is another data matrix containing information about an epidemiological process attributed to an unknown model, and the goal is to identify which model is responsible for it. In this scenario, one can plot this new matrix on both graphs associated with the deterministic SIRS model and the stochastic ones. By doing so, it becomes possible to assess which of the original models bear the closest resemblance to the new, unknown model. This approach enables making two predictions regarding the model that initiated the observed process: one based on the most similar deterministic SIRS model malware and another based on the most similar stochastic model malware.
5. Conclusions
The results of our study hold significant implications for the field of computer security and computational epidemiology. The understanding of malware propagation dynamics in different environments is crucial for developing robust cybersecurity measures. These findings could aid in the creation of more effective strategies to counteract malware propagation, enhancing the security of computer networks and systems.
The generated HJ-Biplots serve as powerful tools for the classification and prediction of new, unknown malware models. By plotting data from an unidentified model onto two distinct epidemiological contexts—deterministic and stochastic SIRS models—one can assess which of the original models closely resembles the new, unknown model. This dual-approach prediction methodology contributes to the potential identification of the responsible malware model in a given epidemiological scenario, thereby enhancing the preparedness of cybersecurity professionals.
In this comprehensive study, we have undertaken an in-depth analysis of the propagation and infection dynamics of simulated malware across both deterministic and stochastic SIRS models. Through the utilization of HJ-Biplots, we have generated insights into the behavior and characteristics of these malware types, yielding valuable knowledge with relevance to cybersecurity and computational epidemiology.
Our exploration began with a close examination of malware propagation in the context of a deterministic SIRS model. This analysis unveiled a division of the malware models into three discernible groups, each possessing unique attributes that profoundly impact propagation dynamics. The first group, often labeled as “passive” models, significantly maintains nodes in a susceptible state while providing extended susceptibility periods. The second group is characterized by the rapid transition of nodes to the infectious state, where they remain. The third group exhibits swift transitions from a recovered state to a susceptible state. The relationship between these groups unveils a complex interplay that sheds light on the diverse mechanisms behind malware propagation.
Moving on to the stochastic SIRS model, we conducted a similar analysis, once again discovering three predominant groups within the malware models. These groups are distinguished by the number of nodes in specific states and the speed of state transitions. The first group is similar to the “passive” group in deterministic models but involves a high rate of transition to susceptibility. The second group enables quick transitions from the infectious state to recovery, while the third group primarily maintains nodes in an infectious state, with little sensitivity to state transition speeds. The interplay among these groups provides further insights into the stochastic malware propagation landscape.
One of the most intriguing findings pertains to the relationships between these groups. In deterministic models, “passive” models exhibit a negative correlation with groups that keep nodes “infectious” or facilitate “quick recovery”. This suggests that the speed of transitioning nodes from the susceptible state is inversely related to the ability to maintain nodes in an infectious or quickly recovered state. Interestingly, positive correlations exist between the groups that keep nodes “infectious” and those that facilitate “quick recovery”, signifying a shared characteristic in terms of facilitating rapid transitions.
It is important to recognize certain inherent limitations of this study that must be taken into account when interpreting its results. Firstly, the validity and generalizability of our conclusions may be affected by the size of the dataset used to build the malware evolution prediction model. In addition, it should be noted that this study relied exclusively on two individual-based SIRS propagation models. Future research could address these limitations by incorporating larger datasets obtained by manipulating a greater number of parameters and exploring different propagation models to improve the robustness and comprehensiveness of the results obtained. As mentioned in the introduction, this is an initial exploration of the uses of statistical techniques to better understand the underlying mechanisms of malware propagation dynamics. Two techniques have been employed (queuing networks and HJ-Biplot), but it is clear that we should also consider, as another future research line, extending the study using additional statistical methods.