1. Introduction
With the close interaction and integration between computational and physical resources, cyber-physical systems (CPSs) have emerged and gained widespread attention. Due to the application of 3C (computation, communication, control) technologies, CPSs can perform real-time sensing and remote control [
1]. Therefore, CPSs are widely applied in critical infrastructure control, aerospace systems, military systems, etc. [
2]. In CPSs, wireless sensors are widely used due to their flexibility, power saving and easy scalability [
3]. However, while improving communication efficiency, the transmission of measurement data over wireless networks poses security issues such as significant damage to industrial systems [
4]. For example, Sberbank, Russia’s largest bank, was hit by the largest DDoS attack ever with peak traffic of 450 GB/s in May 2022, posing a huge threat to its cybersecurity. The frequent occurrence of such malicious cyber attacks has led many scholars and experts to pay great attention to security issues in CPSs. Hence, CPSs under cyber attacks are studied and countermeasures are proposed to ensure the security of the systems [
5,
6].
In terms of specific attack types, cyber attacks encountered on CPSs are mainly divided into DoS attacks [
7,
8], spoofing attacks [
9,
10] and injection attacks [
11,
12,
13]. Among them, DoS attacks have been paid more attention as one of the most frequent and easy cyber attacks to implement. They mainly prevent remote estimators from receiving and processing sensor data properly by interfering with the communication channel. The existing modeling approaches for the DoS attack problem are mainly divided into attack constraint modeling and stochastic modeling. The former is a modeling approach that constrains the duration and switches of DoS attacks. For example, Refs. [
14,
15] puts constraints on the frequency and duration of DoS attacks and focuses on finding the maximum attack frequency and maximum attack duration while maintaining system performance. The latter is a stochastic modeling approach related to Bernoulli distribution and Markov chain. This paper conducts a study on the latter modeling approach, where DoS attacks occur in accordance with Bernoulli distribution. Furthermore, the attack frequency is random and the attack duration is the sampling interval.
For CPSs under DoS attacks, numerous scholars have carried out research on state estimation based on systematic measurement sequences and prior models, with methods such as Kalman filtering, see [
16,
17,
18,
19] and references therein. From the attacker’s perspective, most of the existing literature focuses on the DoS attack scheduling strategies. In [
17], the problem of optimal scheduling of DoS attacks under energy-limited conditions was investigated based on signal-to-interference-plus-noise ratio (SINR) of the channel. Some literature, on the other hand, stand from the defender’s perspective and consider the stability of the system under DoS attacks, or propose DoS attack detection methods [
18,
19,
20]. Among them, a method to protect state privacy by maximizing the state estimation error of eavesdroppers for energy-constrained sensors was studied in [
20]. Furthermore, the PMU data of the bus was used by Hasnat in [
18] to perform state estimation of the attacked components of the system under different DoS attack strengths, so as to attenuate the impact brought by the attacks. Different from the above literature, this paper not only uses the state estimation method to evaluate the status of CPSs under DoS attacks, but also introduces game theory to study the optimal strategies of both attackers and defenders.
In CPSs, some scholars study the relationship between DoS attackers and system defenders in the system, and regard the confrontation between the two as a two-player game [
21,
22]. In some papers, the interactive game is further formulated as a Markov decision process, by optimizing the behavior strategy of the attacker or defender to meet the needs of its interaction with the environment [
23]. The game of attackers and defenders is transformed into a static Bayesian game to obtain the optimal strategy for both sides in [
24]. Furthermore, some papers considered how to reach a Nash equilibrium policy under the game between attackers and defenders, that is, neither player unilaterally changes his strategy under the Nash equilibrium to improve his own reward when the other keeps theirs unchanged [
25,
26,
27]. The interactive game process between the sensor and the attacker as a Markov game framework and employs an improved Nash learning algorithm in order to obtain a Nash equilibrium for the two sides was investigated in [
26]. Since the reward gained by the attacker comes entirely from the loss of the defender, ref. [
28] treated the game between the attacker and the defender as a zero-sum matrix game and designs a time-difference (TD) learning-based algorithm to obtain the optimal attack strategy. Based on the above discussions, a two-player zero-sum deterministic game is introduced in this paper to describe the interactive decision process between the attacker and the sensor. Since the existing static game methods cannot fully satisfy the demands of real-time state update in CPSs, this paper adopts the linear programming method to obtain the Nash equilibrium strategy of both sides as the optimal strategy.
With the development of artificial intelligence, reinforcement learning methods have attracted much attention, focusing on how agents learn optimal strategies by interacting with unknown environments [
29]. In recent research, reinforcement learning methods have been used to solve the game problem between attackers and system defenders in CPSs [
30,
31]. Numerous studies about reinforcement learning algorithms were carried out in different scenarios. In [
32], reinforcement learning is classified into model-based and model-free approaches, and a model-free reinforcement learning method is designed to solve the attacker’s security-aware planning. Furthermore, the game between attackers and sensors from the open-loop case and closed-loop case is studied in [
33], and centralized and distributed reinforcement learning methods are proposed to solve the Nash equilibrium of the two sides. Furthermore, some limitations of CPSs make it difficult to gather collective information and perform state estimation efficiently. For instance, sensor devices are often small in size and carry energy-constrained batteries, which are hard to replace in some situations. Moreover, bandwidth resources are also limited, which result in network congestion or packet loss with the increase of the number of sensor nodes. It is necessary to optimize the utilization of system resources by reducing the needless consumption of system resources. Up to now, little attention has been paid to the optimal strategy obtained by reinforcement learning from the perspective of channel reliability and resource utilization, which is also one of the motivations of this paper.
In summary, this paper investigates the problem of the remote security state estimation problem of CPSs with DoS attacks, where a reinforcement learning algorithm to achieve the Nash equilibrium policies for the two-player game between both the attacker and the sensor. This paper develops from two scenarios: reliable channel and unreliable channel, where the reason for packet loss in reliable channel transmission can only be DoS attacks, while unreliable channel transmission may lose packets due to other reasons. The contributions of the paper are as follows: (i) This paper introduces security state estimation into existing reinforcement learning methods to evaluate the impact of the policies of attackers and defenders on the state estimation; (ii) A two-player zero-sum game is introduced to describe the game between the sensor and the attacker, and the Nash equilibrium strategies are studied to obtain the optimal actions. Besides, resource constraints for the sensor and the attacker are considered in the game; (iii) Reinforcement learning algorithms are designed to enable sensors and attackers dynamically learn and adjust policies in the interaction, where -greedy policy is improved to achieve a balance of exploration and exploitation; (iv) Considering the influence of channel reliability on CPSs, the reinforcement learning algorithm is studied in two scenarios: reliable channels and unreliable channels, the packet loss probability of the two scenarios is compared.
The rest of this paper is organized as follows.
Section 2 formulates the system model and introduced some preparatory knowledge of a two-player zero-sum game and the Q-learning algorithm. In
Section 3, the state estimation algorithm based on Kalman filter is designed, and the influence of the DoS attack on state estimation is described. Reinforcement learning algorithms for reliable and unreliable channels are designed in
Section 4 and
Section 5, respectively. The simulation results of two cases in
Section 6 illustrate the effectiveness and efficiency of the reinforcement learning algorithm.
Section 7 draws the conclusion and discusses the future direction.
4. Reinforcement Learning for Reliable Channel
In CPSs, reliability is an important indicator of system performance [
41]. The measurement data from the sensor are transmitted over a wireless channel, and packets transmitted over a reliable channel will not be corrupted or lost by timeout, and vice versa for an unreliable channel. Assuming an ideal state in which the packets are transmitted in a reliable channel, there is no congestion and timeout, and packets are lost only because of DoS attacks. According to the description of
Section 3.2, the packet loss probability for a reliable channel can be expressed as:
That is, when the sensor chooses to transmit insecurely and the attacker chooses to perform a DoS attack, the packet loss probability is 100% and in other cases the packet loss probability is 0.
According to the above assumptions and the existing reinforcement learning framework, an MDP is established to describe the interactive decision-making problem between sensor and attacker. The elements of MDP can be described as follows:
State: We denote the finite set of states in the reinforcement learning problem by . The state of the system is represented by the estimation error covariance of the estimator, i.e., , where . At time k, the state of the system is affected by the state and actions and of sensor and attacker at time .
Action: In the zero-sum deterministic game, the sensor needs to choose whether to spend cost for secure transmission. Besides, the DoS attacker will choose to implement the DoS attack or not. It should be noted that the action selection of sensor and attacker are independent, that is to say, the actions of one player will not affect the other player. According to the description in the DoS attack model, the decision variable of the sensor is and the decision variable of the attacker is . Hence, the sensor and the attacker both have two strategies at state . There are four combinations of sensor and attacker actions at each time k. Thus, we denote A as the set of actions of and A has four elements.
State transition: As mentioned above, the estimation of the remote estimator is closely related to whether the packet loss occurs, and the same is true for the state transition. In reliable channel, DoS attack occurs when and only when the sensor chooses the insecure transmission and the attacker launches DoS attack i.e.,
, resulting in packet loss. Hence, the state of time
is determined by the action combination of the sensor and the attacker, and can be obtained according to (
10) as follows:
Reward function: A reward function is defined to evaluate the payoff of both the sensor and the attacker during the game. Under the cost setting in
Section 3.2, the reward of the system depends on the state of the system, as well as the cost and strategy of the attacker and the sensor. The sensor’s goal is to minimize the reward function and the attacker’s is the opposite. At time
k, the immediate reward
can be calculated as:
Discount factor: In the reinforcement learning problem, since we prefer to focus more on the current reward rather than the future reward, a discount factor is set to reduce the impact of future rewards on the current state. The discount factor is a parameter between 0 and 1, with a time-based penalty to achieve better performance of the algorithm. By setting the discount factor, the farther the future is, the greater the discount is given to the reward, which makes the algorithm converge faster.
For the two-player zero-sum game between the sensor and the attacker, this paper proposes a Q-learning-based game algorithm to find the Nash equilibrium policies. The algorithm is divided into the following steps. Firstly, the states, actions and the game matrix based on Q values are initialized. Secondly, at each time k, the sensor and the attacker choose actions according to the game matrix by employing the -greedy strategy. Thirdly, the current reward as well as the next state are obtained based on the current state and the combination of actions . Then, the Q-value matrix is adjusted and then the game process is carried out for the next moment. Finally, the converged Q matrix is obtained and the Nash equilibrium between the sensor and the attacker is observed. The algorithm can be described explicitly as follows:
Step 1: Input the system parameters, initialize the system state, action and Q-value matrix. Under the principles set in
Section 3.3, given the cost and error covariance matrix, then the system can be determined to have n states, assuming that the initial state
. Since the system has n states, and each state has four combinations of sensor and attacker actions, so the Q-value matrix with n rows and four columns is initialized. The initial value of the Q-value matrix is set to m, where m satisfies
, then the monotonically non-increasing property of
is guaranteed.
Step 2: At each time k, the sensor and the attacker select the action with the strategy. According to this strategy, the sensor selects the action randomly with a probability of and the optimal action with a probability of .
Remark 2. At the beginning of the iteration, the value of ε is set to be large, that is, the action selection has great randomness. As the algorithm iterates, the value of ε decreases gradually until the set minimum value is reached. The core idea of this method is to strike a balance between exploration and exploitation. Setting a large ε at the beginning of the iteration allows sensors and attackers to choose actions relatively randomly to learn the rewards for each action combination, which is called exploration. At the end of the iteration, sensors and attackers have observed some data, and the value of ε is very small, so they can choose the action that obtains the highest reward based on the existing data, which is exploitation. In this way, better actions are selected in the case of sufficient data collection, achieving a balance between exploration and exploitation.
The optimal action is obtained by calculating the Nash equilibrium using the linear programming method in Lemma 1.
Lemma 1. Let the value of the matrix game be . The optimal strategy of the sensor and the attacker is equivalent to the linear programming problem as follows: The probability distribution of the optimal strategy can be obtained by solving the linear programming problem.
According to the solution of the linear programming, the Nash equilibrium policies of the sensors and attackers can be obtained, and the two players implement their optimal actions depending on this equilibrium respectively.
Step 3: With the current state
and the combination of actions
of sensor and attacker, the reward
can be calculated by (
15). Meanwhile, the next state is obtained according to (
14). Note that when
, in accordance with the Markov chain shown in
Figure 2, the next state is determined, namely
.
Step 4: We use
to denote the Q-value function under the state
and action
. In order to update the Q-value matrix, the Q-value function is calculated according to the following iteration rules:
where
is the learning rate in
, which determines the extent of learning the results of new attempts.
is the discount factor,
is the immediate reward, and the Q-value matrix is obtained by the maxmin operation.
Step 5: Determine whether the termination condition is satisfied. If the termination condition is met, the algorithm terminates; otherwise k = k + 1, and go back to step 2.
Step 6: After the loop terminates, the converged Q-value matrix is obtained and the optimal policy based on Nash equilibrium and can be obtained for each state, where is the optimal policy of the sensor and is the optimal policy of the attacker.
The Q-learning algorithm for reliable channel is presented in Algorithm 1.
Algorithm 1 Q-learning Algorithm for Reliable channel |
Input :The parameters of the system A, C; the steady-state error covariance ; cost and ; learning rate , discount factor and exploration rate . |
Output :Optimal Q-value matrix , Nash equilibrium and . |
Initialize: Set initial state , initialize Q-value matrix with m for all s and , set . |
1: while do |
2: if then |
3: Choose actions randomly; |
4: else |
5: Find the optimal actions obtained by linear programming method. |
6: end if |
7: Observe the reward by (15). |
8: Observe the next state according to (14). |
9: Update the Q-value matrix by (16). |
10: |
11: |
12: end while |
13: Return Q-value matrix for . |
14: Observe the Nash equilibrium and . |
5. Reinforcement Learning for Unreliable Channel
In practical CPSs, the channels over which the packets are transmitted is usually unreliable channels. In this scenario, packet loss can occur due to different reasons besides DoS attacks, including signal degradation, channel fading and channel congestion. Whereas the occurrence of packet loss is related to the choice of sensors and attackers, the packet loss probability of an unreliable channel can be described as follows:
That is, when the sensor chooses insecure transmission and the attacker chooses not to attack, i.e., , packets may be lost due to other reasons such as channel congestion, so the packet loss probability is . Similarly, the packet loss probability under other action combinations can be obtained.
Remark 3. In practical CPSs, there usually exists a relationship that , the main reasons are as follows. Firstly, the sensor choosing insecure transmission namely and the attacker choosing to attack namely , will both cause channel insecurity and increase the packet loss probability. Secondly, DoS attacks are the main cause of packet loss, thus their impact on the packet loss probability is greater than other causes such as channel congestion.
An MDP is set up to depict the interactive process for sensor and attacker under the framework of an unreliable channel. In the five tuples of MDP, the state, action and discount factor are the same as in the reliable channel; however, the state transition and reward function are different, which can be described as follows.
State transition: When data packets are transmitted in an unreliable channel, the state transition is not only based on whether a DoS attack occurs, but also on the packet loss probability. Under the action combination of the sensor and the attacker, the packet loss probability
can be obtained according to (
17). The data packet is lost with a probability of
and is not lost with a probability of
, the corresponding arrival indicators are
and
respectively. Hence, the state transition occurs accordingly, which can be described as:
Reward function: Since the packet loss probability changes, the reward of the system at state
k has to change accordingly. If
and
are state and action, then the immediate reward
at time
k can be obtained as:
where
represents the average expectation of the remote estimation error covariance
, which is obtained by:
Remark 4. The algorithm for unreliable channels has the following differences from the algorithm for reliable channels. First of all, the packet loss probability ξ under different combinations of actions needs to be entered first in the algorithm for unreliable channels. Second, at each time k, after the sensor and the attacker select actions according to the strategy, an additional step is needed to obtain the packet loss probability based on the action combinations. Then, in the calculation of the reward, it is necessary to use the reward that combines the error covariance expectation in Equation (19). Finally, in the observation of the next state, Equation (18) is also used to obtain . The Q-learning algorithm for the unreliable channel is presented in Algorithm 2.
Algorithm 2 Q-learning Algorithm for Unreliable channel |
Input :The parameters of the system A, C; the steady-state error covariance ; cost and ; packet loss probability in each action combination; learning rate , discount factor and exploration rate . |
Output :Optimal Q-value matrix , Nash equilibrium and . |
Initialize: Set initial state , initialize Q-value matrix with m for all s and , set . |
1: while |
2: if then |
3: Choose actions randomly; |
4: else |
5: Find the optimal actions obtained by linear programming method. |
6: end if |
7: According to the actions of sensors and attackers , the packet loss probability is obtained by (17). |
8: Observe the reward by (19). |
9: Observe the next state according to (18). |
10: Update the Q-value matrix by (16). |
11: |
12: |
13: end while |
14: Return Q-value matrix for . |
15: Observe the Nash equilibrium and .
|
Remark 5. Whether the proposed algorithm is suitable for extending to DDoS attacks is also investigated. DDoS attacks are distributed denial of service attacks, which combines multiple computers as an attack platform to achieve the purpose of hindering the normal service of the computer or network. The attack–defense game problem under DDoS attacks adds a many-to-one dimension to the problem under DoS attacks. That is to say, DDoS attacks combine multiple attack sources to simultaneously attack a single sensor, and the attack cost and attack intensity of different attack sources may be different. For attackers, multiple attack sources need to be coordinated to minimize attack costs, while for system defenders, the impact of multi-source attacks needs to be minimized. In the future, we will focus on the impact of DDoS attacks on the attack and defense decisions of single-sensor systems and multi-sensor systems. One of our future directions is to coordinate multiple attack sources and sensors so that both attackers and defenders can make optimal decisions.