1. Introduction
Cyber–physical systems (CPSs), which integrate computational capabilities with physical processes, have found extensive application across a range of industrial operations and critical infrastructure [
1]. However, unreliable wireless communication networks render these systems highly susceptible to malicious cyber attacks, such as false-data injection (FDI), eavesdropping, and denial-of-service (DoS) attacks, which pose severe threats to their operational security and reliability [
2].
In CPSs, security against malicious attacks, especially DoS attacks, has become a critical concern [
3,
4]. Game theory has emerged as a widespread and powerful analytical tool for modeling such attacker–defender interactions [
5]. However, most existing game-theoretic studies are built upon the Nash equilibrium concept, which assumes simultaneous decision-making among players [
6]. This assumption does not adequately capture the hierarchical structure often present in security scenarios, where a defender typically acts first and an attacker responds accordingly. To address this, Stackelberg games have been introduced, providing a more realistic framework for sequential decision-making in layered security strategies [
7,
8]. Furthermore, the majority of existing Stackelberg models rely on the assumption of complete information, where all players have perfect knowledge of each other’s parameters and payoffs. This is often impractical in real-world communication environments because some critical attributes such as channel conditions, interference gains, or attacker capabilities may not be fully accessible to the defender. The lack of such information significantly affects the strategic design and performance of defensive measures. Although some recent studies have begun to address information asymmetry, their modeling of uncertainty remains limited. For instance, most of these works focus on single-type uncertainty or assume that the attacker’s channel gain follows a simple bivariate distribution [
9]. Such simplifications fail to capture the practical situation where the attacker’s fading channel gain may follow a multi-type probability distribution.
Motivated by these gaps, this paper investigates remote state estimation under SINR-based DoS attacks within a Bayesian Stackelberg game framework. It is worth noting that our work focuses on the strategic response and resource allocation following the detection or presumption of an attack, rather than on the attack detection mechanisms themselves [
10]. While Bayesian Stackelberg models have been studied for general network security, this work distinctively integrates this framework into the SINR-based remote state estimation to handle multi-type channel gain uncertainty and further develops a model-free Stackelberg Q-learning algorithm to derive optimal strategies under this specific information structure. In the incomplete information setting, probabilistic distribution information is incorporated to model the uncertainty of the attacker’s fading channel gain, and a sequential power allocation strategy is developed to balance estimation accuracy and energy efficiency. The proposed approach not only extends current Stackelberg-based security analysis to more realistic communication scenarios but also offers implementable learning-based strategies for deriving equilibrium policies under information asymmetry. The main contributions of this paper are summarized as follows:
We study the strategic interaction between a sensor and a DoS attacker under energy constraints in an SINR-based remote state estimation system with unknown channel gain information. The sequential decision-making process is formulated as a finite-state and finite-action MDP, capturing the dynamic and uncertain nature of the environment.
To address the incomplete information regarding the attacker’s channel gain, we model the interaction as a BSG, where the attacker is treated as having multiple possible types. Within this framework, we define the best-response strategies, prove the existence of an SE, and analytically derive the SE strategies for both players.
A Stackelberg Q-learning algorithm is used to enable both players to learn their optimal strategies without prior knowledge of the opponent’s payoff structure. This model-free approach ensures adaptability and practicality in real-world settings. Finally, numerical simulations validate the effectiveness and robustness of the proposed BSG-based method, demonstrating its clear advantage over conventional Nash-based approaches under information asymmetry.
The remainder of this article is organized as follows.
Section 2 reviews related work on secure state estimation under cyber-attacks.
Section 3 establishes the system model for remote state estimation under SINR-based DoS attacks with incomplete channel gain information.
Section 4 formulates the sequential decision-making process as a two-player Markov Decision Process (MDP). Building upon the MDP,
Section 5 constructs the Bayesian Stackelberg Game (BSG) framework and analyzes the Stackelberg equilibrium.
Section 6 designs a Stackelberg Q-learning algorithm to compute the optimal strategies for both players.
Section 7 provides simulation results to verify the effectiveness of the proposed approach. Finally, conclusions are drawn in
Section 8.
2. Related Work
Accurate and secure state estimation serves as the cornerstone of reliable decision-making in CPSs. However, the widespread dependence on wireless networks for data transmission exposes the estimation process to various cyber attacks, including FDI, eavesdropping, and DoS attacks. These attacks compromise data integrity, availability, and confidentiality [
11]. These attacks can deliberately degrade estimation accuracy by corrupting, intercepting, or blocking data transmissions, which, in turn, destabilize system operation and threaten physical security [
12,
13].
DoS attacks, especially those targeting the signal-to-interference-and-noise ratio (SINR) of wireless channels, are the most disruptive to remote state estimation [
4,
14,
15]. The sensor, as the defender, aims to enhance the system performance with less transmission power. On the contrary, the goal of the DoS attacker is to reduce estimation accuracy while consuming more communication resources. To describe such adversarial dynamics between attackers and defenders, game theory has emerged as a powerful analytical tool for the interactive decision-making process [
16,
17,
18,
19]. With energy constraints for both the sensor and the attacker, a Nash equilibrium (NE) of a zero-sum game was proven in [
20] to be the optimal strategies for both sides. Then, the authors in [
21] proposed a a Markov game framework under the SINR-based DoS attacks and applied a modified Nash Q-learning algorithm to obtain the optimal solutions. For multi-channel networks, a two-player zero-sum stochastic game was formulated to design the mixed strategies for the channel selection of the sensor and DoS attacker [
22].
It is worth pointing out that most works on attack–defense games adopt NE as the game solution based on the assumption that the defender and the attacker choose their actions simultaneously, which is not applicable for situations where the players deploy their strategies sequentially. Consequently, the Stackelberg game, with its explicit “leader–follower” structure, has been a more appropriate and realistic framework for capturing the hierarchical interactions [
23]. For the SINR-based fading wireless network, a Stackelberg equilibrium (SE) within a two-player nonzero-sum Markov game framework was constructed in [
24] to obtain the transmission and interference strategies for the sensor and attacker, respectively. By taking the Stackelberg game method, the linear quadratic Gaussian (LQG) control strategy was analyzed in [
25,
26] for FDI attacks and DoS attacks, respectively. For distributed Kalman filtering, a Stackelberg game-based distributed reinforcement learning algorithm was developed to produce joint optimal strategies based on local observation information [
27]. Within the Stackelberg game framework, an optimal stealthy robust attack method was designed based on Wasserstein ambiguity sets [
28].
The aforementioned Stackelberg-based studies rely on the assumption of complete information, which rarely holds in practical wireless communication scenarios. For players in games, some crucial environmental information and other’s key attributes are often difficult to obtain, such as energy budgets, acknowledgment (ACK) information, and channel gains of the wireless network. An anti-jamming Bayesian Stackelberg game was proposed in [
29], where the uncertainties of the channel state information and transmission cost information were incorporated. For the problem of multi-channel power schedule SINR-based DoS attacks, the SE was studied under two types of incomplete information, where the existence of the attacker and the total power of the attacker were, respectively, unknown to the defender [
9]. For a multi-hop network under DoS attacks, the defender had no access to the available energy of the attacker; then, a Bayesian Stackelberg game was implemented and a Stackelberg Q-learning algorithm was presented to obtain the SE [
30]. Furthermore, a stochastic Bayesian game was formulated in [
31], where the ACK information from the remote estimator to the sensor was hidden from the attacker. While game theory provides a robust framework for strategic analysis, it is also worth noting that formal methods have been extensively explored for the verification and analysis of CPSs under attacks [
32]. In contrast to these verification-based approaches, this paper emphasizes the adaptive decision-making process in scenarios where the sensor faces incomplete information about an attacker’s characteristics.
3. Problem Formulation
Notations: denotes the set of nonnegative integers. and are the n-dimensional Euclidean space and set of all matrices, respectively. and represent the sets of real symmetric positive semi-definite and positive definite matrices, respectively. If , we simply write ( if ). Notation means that the matrix is negative definite. is the probability of a random event. and stand for the expectation and covariance of a random vector, respectively. For functions g, h with appropriate domains, represents the function composition , with .
In this section, we introduce the remote state estimation system under DoS attack depicted in
Figure 1. The local sensor estimates the system states by the Kalman filter and then transmits the estimates to the remote estimator through the signal-to-noise ratio (SNR)-based network. However, the network is unreliable and may be attacked by a DoS attacker.
3.1. System and Sensor Model
Consider the discrete-time linear time invariant system:
where
is the system state and
is the measurement output. The process noise
and the measurement noise
are assumed to be independent white Gaussian noises with covariance matrices
and
, respectively. The initial state
is zero-mean Gaussian with covariance
, and independent of
and
for all
. The pair
is detectable.
The smart sensor runs a local Kalman filter to estimate the system states based on the the measurement set up to the current time
k, that is,
. The local minimum mean-squared error (MMSE) estimate of the system state
and the corresponding estimation error are, respectively, defined as
We define the the Lyapunov and Riccati operators
h and
g:
as follows:
The estimation error covariance of the Kalman filter converges to a unique value irrespective of the initial condition. For simplicity, it is assumed that the local Kalman filter has reached the steady state, and we let
where
is the steady-state error covariance given by the unique positive semidefinite solution of
.
3.2. Communication Channel and Attack Model with Incomplete Information
After obtaining the state estimate , the sensor transmits the estimate to the remote estimator in the form of a data packet. However, random data packet drops may occur owing to channel fading and interference. Here, we assume that the sensor communicates with the remote estimator over an additive white Gaussian noise (AWGN) network to model this scenario. The packet dropout probability at time k is described by , where is the fading channel gain of the sensor, is the transmission power of the sensor at time k, and is the additive white Gaussian noise power.
Considering a DoS attacker against the channel, the SNR is revised as the following SINR:
where
is the fading channel gain of the attacker;
is the transmission power of the attacker at time
k.
The transmission of
between the sensor and the remote estimator can be characterized by a binary random process
:
Then, the packet arrival rate
is given as
where
is a parameter;
represents the standard
Q-function. It should be noted that the scalar function
here is distinct from the system noise covariance matrix
Q defined in the system models (1) and (2). It is seen that the SINR is not only dependent on the sensor’s transmission power but is also influenced by the interference power generated by the DoS attacker. Obviously, the larger SINR leads to the lower packet loss rate and better estimation performance.
In practice, the channel state information is not perfect. The uncertainties of the channel gain information need to be considered. Suppose that the attacker interferes with the transmission under channel gain
with probability
, where
and
. Then, the SINR with the channel interference gain
is defined as
Correspondingly, the packet arrival rate under
is given as
This multi-type probabilistic model captures a realistic scenario where the sensor, as a defender, cannot precisely identify the attacker’s equipment or instantaneous channel condition but can estimate a probability distribution over a few distinct levels of threat severity. In this case, the sensor does not know the channel interference gain of the attacker but possesses the probability distribution of
. Therefore, with different levels of channel interference gain, we can think that there exists
H type attackers in the environment.
Remark 1. In practice, the acquisition of channel information often has a non-symmetric characteristic. An attacker can frequently intercept the sensor’s open pilot signals or reference transmissions. Then, the attacker can accurately estimate the sensor’s transmission channel gain by processing these known sequences [33,34]. When launching an active attack, the attacker employs non-cooperative, noise-like, or protocol-aware jamming signals, which are deliberately designed to be unpredictable [35]. Thus, it is difficult for the sensor to accurately estimate the interference channel gain from the attacker. 3.3. Remote State Estimation
The MMSE estimate of
and the estimation error covariance at the remote estimator are denoted as
and
, respectively. According to whether the date is received successfully, the state estimation is given by
As a result, the error covariance
is computed as
Assume that the initial value of the error covariance at the remote estimator also starts from
, i.e.,
. The remote estimator will send ACKs to the sensor to indicate whether it has received the estimate at each time. Hence, the sensor can also calculate
by (9). Note that
takes values from the infinitely set
.
3.4. Strategy and Objective Function
Considering that both the sensor and the attacker have power constraints, we assume that the transmission power and attack power take values from the finite sets. Specifically, the transmission power belongs to a finite set with level , which is denoted as . The interference power belongs to a finite set with level , which is denoted as . This discretization models the practical constraints of digital power amplifiers and energy-limited devices and is a common simplification in power control problems.
The strategies of the sensor and the
j-th type attacker at time
k are
and
for all
, respectively. Then, the strategy of the attacker at time
k is denoted as
From (6) and (7), different types of attacker lead to different packet arrival rates and further affect the expected estimation error. The trace of the expected estimation error covariance under the attacker’s channel gain
is given as
The sensor aims to optimize the expected estimation error while reducing the total transmission cost. In contrast, the goal of the attacker is to deteriorate the estimation performance of the remote estimator and consume as much transmission energy of the sensor as possible while minimizing the total attack cost. Therefore, the one-step rewards of the sensor and the attacker with channel gain
are, respectively, given as
where
and
are the costs per unit power for the sensor and DoS attacker, respectively. Then, the corresponding payoff functions over the infinite time horizon for the sensor and the
j-th type attacker are, respectively, given as follows:
where
is the discount factor.
For the considered system under attack, the sensor as the defender first decides its transmission power level and then the DoS attacker chooses the interference power level based on the current situation. For such sequentially interactive decision-making processes with incomplete information, we use the Bayesian Stackelberg game framework to analyze the optimal strategies for the sensor and attacker.
Remark 2. Existing studies on SINR-based attacks against state estimation predominantly design transmission and interference strategies under the assumption of complete channel information [20,21,22,26]. This idealization ignores the reality that channel knowledge is incomplete and asymmetric in actual wireless networks, leading to multi-modal players and their decision-making spaces. Therefore, investigating the resulting hierarchical game is both highly meaningful and challenging. 4. MDP Formulation
In this section, a Markov decision process is first formulated to describe the dynamics of the state estimation at the remote estimator. Obviously, the scenario involves two players: the sensor and DoS attacker. The sensor does not know the attacker’s channel gain but has knowledge of the probability distribution information. At each time, the sensor and attacker take action based on the current process state and the information that they have previously collected. Then, they respectively receive the rewards, and the process moves to the next state according to the transition probability.
Taking into account the decision-making interaction between the sensor and the attacker, the state estimation process at the remote estimator can be formulated as a two-player MDP model, which consists of the following five essential elements:
(1) Desion epoch: Let T denote the infinite discrete set of decision epochs, i.e., .
(2) State: According to (9), the estimation error covariance can be alternatively written as , where is the holding time at time k, namely, the time interval between the current time k and the latest moment of the receiving packet. Then, the set of the possible estimation error covariance can be represented as . Therefore, the state at time k is defined as the holding time of the time , i.e., . Intuitively, this state represents the time interval elapsed since the last successful packet reception. According to the update Formula (9), it directly determines the current estimation error covariance at the remote estimator. Due to the low probability of a large holding time, the final state K is used to represent all states with . Therefore, the state space is .
(3) Action: At each time, based on the state , the sensor and the j-th type attacker choose the actions and from the action spaces and , respectively.
(4)
Transmission probability: Given the action
of the sensor and the action
of the
j-th type attacker, the probability that the state transmits from
to
is given as
.
(5)
Reward: The one-stage reward functions for the sensor and the
j-th type attacker are, respectively, given as
and
Note that as the one-stage reward functions are time-invariant and stationary and can also be denoted as
and
.
The sensor’s strategy
means that the sensor takes action
under state
. Similarly, the strategy of the
j-th type attacker under state
s is denoted as
. Then, the strategies of the sensor and the
j-th type attacker are, respectively, written as
where
and
are the set of all stationary and deterministic policies for the sensor and
j-th type attacker, respectively. Furthermore, the strategy of the attacker is denoted as
where
is the set of all stationary and deterministic policies for the attacker. Then, the payoff functions of the sensor and
j-th type attacker are, respectively, given as
The goal of both the sensor and the attacker is to seek the optimal strategy to maximize the payoff function.
5. Bayesian Stackelberg Game
Based on the above MDP framework, a BSG with two players was investigated for designing the optimal strategies for both the sensor and the attacker. In this hierarchical game, the sensor, as the leader, first makes its action according to the state and declares its strategy. It is noted that the sensor knows the reaction from the attacker in a Stackelberg game. Then, the attacker, as the follower, takes action based on the acquired channel gain and the transmission power of the sensor. The payoff functions of the sensor and j-th type attacker are given as (18) and (19), respectively. The SE of the BSG is analyzed and the corresponding optimal strategies for both players are provided. The best response for each side is first defined as follows.
Definition 1. The best response is that a player takes an action that optimizes its own payoff while taking into account other players’ actions. Specifically, the best responses for the sensor and j-th type attacker are given asThen, the best responses for the attacker are denoted as The best-response set of the j-th type attacker to the sensor’s strategy is denoted as and is the best-response set of the attacker to strategy .
Lemma 1. For the proposed BSG with two players, the Stackelberg equilibrium of the sensor satisfieswhere is expressed as (22).
Proof. In the BSG, the channel gain and the strategy of the sensor can be obtained and imposed on the attacker; then, each type of attacker takes the best response to maximize its own payoff function. The sensor chooses the optimal strategy by taking account into the follower’s best response to maximize its own payoff functions, which completes the proof. □
Based on Lemma 1, the solution to the proposed Stackelberg game with incomplete channel gain information is given by the following theorem.
Theorem 1. The solution to the proposed BSG is given by first solvingand then calculatingwhereTherefore, the SE of this game is . Proof. In the proposed BSG, the sensor as the leader and the DoS attacker as the follower make decisions sequentially. Given the sensor’s strategy
, the
j-th type attacker will take the best response according to
. The sensor knows the reaction from the attacker; hence, it will choose the optimal strategy to maximize its own payoff function while taking account into the attacker’s strategy, that is,
After obtaining the sensor’s strategy
, the
j-th type attacker will choose the optimal strategy
by calculating
which completes the proof. □
Remark 3. This theorem provides a constructive method to find the SE for the proposed BSG. It confirms that within the finite strategy spaces considered, the sensor can determine its optimal strategy by anticipating and incorporating the attacker’s best response, which, in turn, is computed based on the sensor’s declared action. This fixed-point characterization forms the theoretical foundation for the learning algorithm developed in Section 6. Proposition 1. There exists at least one SE in the BSG.
Proof. Given the sensor’s strategy , the j-th type attacker will choose the strategy from the set . Given the final state K, the numbers of possible strategies for the sensor and the j-th type attacker are and , respectively. The j-th type attacker’s strategy space is finite, its optimal response always exists for the fixed sensor’s strategy . Moreover, the finite strategy space implies the existence of an equilibrium strategy for the sensor by (24). Therefore, there always exists an SE for the proposed BSG. □
Remark 4. This conclusion substantiates the existence and characterization of the SE in our Bayesian game setting. It implies that despite the incomplete information regarding the attacker’s type, an optimal hierarchical strategy profile exists and can be sought through iterative best-response dynamics.
6. Reinforcement Learning
As mentioned previously, the sensor is unable to access to the SINR of the AWGN channel. In this section, model-free reinforcement learning, i.e., Q-learning is introduced to find the SE. Then, a Stackelberg Q-learning algorithm for the two-player BSG is used to obtain the optimal policy for both the sensor and the attacker.
The optimal payoff functions are given as
where the optimal state-action-value functions (Q-functions) for the sensor and the
j-th type attacker are, respectively, defined as follows:
Then, the SE strategy is obtained by
Considering the multi-type channel interference gain of the attacker, the Q-value functions of the sensor and the attacker are recursively updated by SE presented in Theorem 1:
where
with the learning rate
,
and
are the Q-values of of the SE solutions at state
for the sensor and the
j-th type attacker, respectively.
In the BSG, the sensor and the attacker acts as the leader and the follower, respectively; hence, the update of Stackelberg Q-value is also hierarchically updated. To find the maximum Q-value of the attacker for the state
, the attacker chooses the optimal action from the best response (26). Based on the observation of the sensor’s action, the optimal action that the
j-th type attacker takes in response to the sensor’s action
is determined as
Then the optimal action of the attacker is given as
Based on the BSG framework, the optimal action of the sensor is given by
Consequently, we compute
and
in the Q-functions update as
As a result, the Bayesian Stackelberg Q-learning Algorithm for the incomplete channel gain is presented in Algorithm 1. Key implementation details are as follows: The Q-values for both players are initialized to zero, a common practice that does not bias the asymptotic convergence. This initialization does not affect the final optimal policy learned, ensuring the algorithm’s robustness to initial conditions. An
-greedy exploration strategy is employed, where, with probability
, the actions are chosen randomly, and with probability
, they are chosen as the SE actions based on current Q-values. The learning rate
for updating the Q-tables follows a schedule that ensures the Robbins-Monro conditions are met (as required by Theorem 2), typically by decaying with the number of visits to each state-action pair. The specific forms of
and
used in our simulations are provided in
Section 7.
| Algorithm 1 Bayesian Stackelberg Q-learning Algorithm with Incomplete Channel Gain Information |
Input: System parameter matrices A, C, Q, R, 1:
- 2:
Initialization: Initialize and for all , , , and ; initial state ; Total iterations T; exploration probability . Set iteration counter ; - 3:
while do - 4:
Randomize a number ; - 5:
if then - 6:
Take actions and for based on (36)–(38); - 7:
else - 8:
Select random actions and for ; - 9:
end if - 10:
Obtain rewards and ; - 11:
Update and , and move to next state; - 12:
; - 13:
end while Output: The optimal strategies and the optimal payoff functions for the sensor and the attacker
|
Remark 5. The per-iteration computational cost of the Stackelberg Q-learning algorithm (Algorithm 1) is linear in the product of the state space size and the action space sizes , as it requires updating Q-tables for all state–action pairs encountered. This makes it efficient for the moderately sized problems presented here. For systems with significantly larger state or action spaces, the well-known curse of dimensionality would necessitate the use of function approximation (e.g., deep Q-networks), which is a promising direction for future work, as noted in Section 8. The convergence of Algorithm 1 to the optimal Q-functions is guaranteed under the conditions specified in Theorem 2. The learning rate and exploration schedules described above are designed to satisfy these conditions, in particular the Robbins–Monro requirements for the learning rate.
Theorem 2 ([36]). The Bayesian Stackelberg Q-learning sequences described in (32)
and (33)
will converge to the optimal values if the following assumptions hold: - 1.
The recursive processes (32) and (33) converge to and with probability 1, respectively.
- 2.
There exists a number and a sequence converging to zero with probability 1, such thatwhere is the Q-value space. - 3.
The learning rate satisfies that , and converges to infinity uniformly as .
Remark 6. Condition 1 assumes that the recursive updates for both players converge with probability 1, meaning the learning process is inherently stable and reaches a fixed point. This is approximated by implementing ε-greedy exploration in Algorithm 1 and running the training for a large number of steps. Condition 2 ensures that the algorithm is driven toward a unique fixed point while noise vanishes by a contraction operator with a decaying perturbation. The learning rate in Condition 3 must satisfy the Robbins–Monro conditions for stochastic approximation. In the simulation section, the learning rate is designed to be a nonzero decreasing function of time step k and the current state and actions.
7. Simulation Results
In this section, a numerical example is provided to verify the performance of the proposed stochastic attack strategy. The following system parameter are considered:
Then the steady-state error covariance
is
. The AWGN power and the modulation parameter are
and
, respectively. The fading channel gain and the action space of the sensor are given as
and
. Assume that the fading channel gains of the DoS attacker are
with probability distribution
. This setting creates a representative asymmetric information scenario in which the sensor needs to design a strategy against an attacker that is most likely to have a moderate interference capability but must also account for a significant possibility of facing a more potent attacker. Correspondingly, the action spaces for the two types of the attacker are
and
, respectively. The unit power costs for the sensor and the attacker are
and
, respectively. Set the finial state to
, and the state space is given as
, which means the estimation error takes the value from
. The discount factor for payoff functions is
.
In the Bayesian Stackelberg Q-learning algorithm, the learning rate and the initial exploring rate are set as and , respectively, where is the number of the occurrence of state–action pair . This design ensures that the learning rate satisfies the Robbins–Monro conditions, a standard requirement for the convergence of stochastic approximation algorithms like Q-learning. The initial values are set as and for all .
After 100,000 learning steps, the Q function values in different states converge. The equilibrium payoffs to the sensor and attacker for states
are given as follows:
The corresponding optimal transmission and interference strategies are given as follows:
The resulting Stackelberg equilibrium strategies are explicitly state-dependent, quantitatively showing how the sensor optimally increases transmission power as the estimation error grows. Finally, the equilibrium payoffs are quantitatively characterized, showing the cost–performance trade-off across states. The optimal strategies under the BSG confirm the intuition that a relatively small error covariance at the remote estimator enables the sensor to use less power for transmission, also leading to less interference power for the attacker. For a larger estimation error covariance, the sensor is inclined to choose a high power level to increase the packet arrival rate and, therefore, improve estimation performance. In this case, the attacker also increases the interference power accordingly.
The algorithm converges to distinct, stable Q-values for different state–action pairs. Due to the high dimension of the Q-values
, next, we present the learning process for the initial state
as an illustrative example. When the sensor takes the action
, the corresponding Q-functions converge, as shown in
Table 1.
Table 2 gives all converged Q-values of the attacker. We can see that the optimal strategy for the attacker is
under state
.
Figure 2,
Figure 3 and
Figure 4 depict the learning processes for the optimal strategies of the sensor, the first type attacker, and the second type attacker for state
, respectively. In the figures, different colored lines respectively represent the iterative Q-values under different actions of the sensor and attacker. These figures collectively depict the convergence of Q-functions for all possible action combinations at state
. The high density of lines illustrates the learning algorithm’s exploration across the entire strategy space. While individual lines are not labeled for clarity, their collective trend towards stabilization after around 5000 iterations is evident, confirming the convergence of the learning process. The precise converged Q-values that underpin the optimal strategies are detailed in
Table 1 and
Table 2.
Next, we consider the situation that the attacker can launch attacks without any cost constraints, that is, the cost of the interference power can be ignored. In the case of
, the equilibrium payoffs to the sensor and attacker for states
are given as follows:
The corresponding optimal transmission and interference strategies are given as follows:
The optimal strategies confirm that when the cost of interference power
is negligible, both types of attackers consistently select the highest available power level in their respective action sets. This behavior aligns perfectly with game-theoretic intuition: as the marginal cost of interference diminishes, the rational objective for the attacker shifts overwhelmingly towards maximizing the degradation of the estimation performance, leading to the selection of maximum jamming power. This result underscores the critical role of cost parameters in shaping the equilibrium of the security game.
The learning processes for the optimal strategies of the sensor, the first type attacker, and the second type attacker for state
are shown in
Figure 5,
Figure 6 and
Figure 7. Similar to the previous case, the convergence trends for all action combinations are shown. The eventual stabilization of all trajectories validates the algorithm’s effectiveness even in this complex setting. Notably, as seen in
Figure 5, the Q-values of the sensor exhibit significant fluctuation and converge more slowly compared to those of the attackers in
Figure 6 and
Figure 7. This can be attributed to the information asymmetry inherent in the problem: the sensor lacks precise knowledge of the attacker’s channel gain and must learn an optimal strategy based only on the probabilistic distribution over attacker types. This uncertainty inherently increases the exploration burden and complexity of the learning process for the leader.
Therefore, the numerical results validate the algorithm’s effectiveness by demonstrating its capability to solve the proposed BSG model: It is seen that the algorithm achieves convergence to a stable equilibrium where both players’ strategies are mutually optimal in the sequential sense, as theorized in
Section 5 and
Section 6. The logical adaptation of strategies to different states and costs further confirms that the learned policies align with game-theoretic intuition, providing strong empirical support for the proposed framework.
8. Conclusions
This paper investigated the Bayesian Stackelberg game for remote state estimation under SINR-based DoS attacks. The two players sequentially decide their transmission and interference powers under incomplete information. Specifically, the sensor lacks exact knowledge of the attacker’s fading channel gain. The optimization problem over an infinite-time horizon is first formulated as an MDP with finite state and action spaces. By taking advantage of the probabilistic information about the channel interference gain, a BSG is modeled to describe the iterative decision-making process between the sensor and the various types of attackers. Based on the solution to the BSG, a Stackelberg Q-learning algorithm is used to obtain the optimal strategies of the two players. Numerical results validate the effectiveness of the proposed game-theoretic approach in the case of uncertain channel gains. The presented framework operates under core assumptions of a known attacker type distribution and discrete action spaces, which define its current scope but also present opportunities for future generalization. Future work also includes analyzing games where both players have incomplete information and extending the framework to larger-scale systems via function approximation (e.g., deep reinforcement learning) to mitigate the curse of dimensionality.