Next Article in Journal
Multi-Object Detection of Forage Density and Dairy Cow Feeding Behavior Based on an Improved YOLOv10 Model for Smart Pasture Applications
Previous Article in Journal
High-Sensitivity Defect Inspection for Unpatterned Wafers via Integrating Dark-Field Scattering and Diffraction Phase Microscopy
Previous Article in Special Issue
Cross-Layer Analysis of Machine Learning Models for Secure and Energy-Efficient IoT Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Bayesian Stackelberg Game Approach to Remote State Estimation Under SINR-Based DoS Attacks with Incomplete Information

1
Department of Control Science and Engineering, Tongji University, Shanghai 201804, China
2
Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai 201210, China
3
College of Science, National University of Defense Technology, Changsha 410073, China
4
School of Physical Science and Engineering, Tongji University, Shanghai 200092, China
*
Author to whom correspondence should be addressed.
Sensors 2026, 26(4), 1272; https://doi.org/10.3390/s26041272
Submission received: 31 December 2025 / Revised: 5 February 2026 / Accepted: 12 February 2026 / Published: 15 February 2026
(This article belongs to the Special Issue Security Issues and Solutions for the Internet of Things)

Abstract

With limited energy constraints, the issue of transmission and interference strategies have received considerable critical attention in cyber–physical security. In this paper, for remote state estimation under signal-to-interference-plus-noise ratio-based denial-of-service (DoS) attacks, the Stackelberg game between the sensor and the attacker is investigated. To balance estimation performance and energy consumption, the two players determine the transmission power and interference power sequentially under an incomplete information structure where the sensor does not know the fading channel gain of the attacker exactly. The schedule problem over the infinite-time horizon is first formulated as a Markov decision process with finite state and action spaces. Then, a Bayesian Stackelberg game (BSG) is constructed by incorporating the probability information of the channel interference gain. Based on the definition of best-response, the solution of the BSG is presented and the existence of the Stackelberg equilibrium is proven. Furthermore, a Stackelberg Q-learning algorithm is used to obtain the optimal strategies for the two players. Numerical results demonstrate the effectiveness of the proposed game method when the sensor is unable to access an attacker’s channel gain information.

Graphical Abstract

1. Introduction

Cyber–physical systems (CPSs), which integrate computational capabilities with physical processes, have found extensive application across a range of industrial operations and critical infrastructure [1]. However, unreliable wireless communication networks render these systems highly susceptible to malicious cyber attacks, such as false-data injection (FDI), eavesdropping, and denial-of-service (DoS) attacks, which pose severe threats to their operational security and reliability [2].
In CPSs, security against malicious attacks, especially DoS attacks, has become a critical concern [3,4]. Game theory has emerged as a widespread and powerful analytical tool for modeling such attacker–defender interactions [5]. However, most existing game-theoretic studies are built upon the Nash equilibrium concept, which assumes simultaneous decision-making among players [6]. This assumption does not adequately capture the hierarchical structure often present in security scenarios, where a defender typically acts first and an attacker responds accordingly. To address this, Stackelberg games have been introduced, providing a more realistic framework for sequential decision-making in layered security strategies [7,8]. Furthermore, the majority of existing Stackelberg models rely on the assumption of complete information, where all players have perfect knowledge of each other’s parameters and payoffs. This is often impractical in real-world communication environments because some critical attributes such as channel conditions, interference gains, or attacker capabilities may not be fully accessible to the defender. The lack of such information significantly affects the strategic design and performance of defensive measures. Although some recent studies have begun to address information asymmetry, their modeling of uncertainty remains limited. For instance, most of these works focus on single-type uncertainty or assume that the attacker’s channel gain follows a simple bivariate distribution [9]. Such simplifications fail to capture the practical situation where the attacker’s fading channel gain may follow a multi-type probability distribution.
Motivated by these gaps, this paper investigates remote state estimation under SINR-based DoS attacks within a Bayesian Stackelberg game framework. It is worth noting that our work focuses on the strategic response and resource allocation following the detection or presumption of an attack, rather than on the attack detection mechanisms themselves [10]. While Bayesian Stackelberg models have been studied for general network security, this work distinctively integrates this framework into the SINR-based remote state estimation to handle multi-type channel gain uncertainty and further develops a model-free Stackelberg Q-learning algorithm to derive optimal strategies under this specific information structure. In the incomplete information setting, probabilistic distribution information is incorporated to model the uncertainty of the attacker’s fading channel gain, and a sequential power allocation strategy is developed to balance estimation accuracy and energy efficiency. The proposed approach not only extends current Stackelberg-based security analysis to more realistic communication scenarios but also offers implementable learning-based strategies for deriving equilibrium policies under information asymmetry. The main contributions of this paper are summarized as follows:
  • We study the strategic interaction between a sensor and a DoS attacker under energy constraints in an SINR-based remote state estimation system with unknown channel gain information. The sequential decision-making process is formulated as a finite-state and finite-action MDP, capturing the dynamic and uncertain nature of the environment.
  • To address the incomplete information regarding the attacker’s channel gain, we model the interaction as a BSG, where the attacker is treated as having multiple possible types. Within this framework, we define the best-response strategies, prove the existence of an SE, and analytically derive the SE strategies for both players.
  • A Stackelberg Q-learning algorithm is used to enable both players to learn their optimal strategies without prior knowledge of the opponent’s payoff structure. This model-free approach ensures adaptability and practicality in real-world settings. Finally, numerical simulations validate the effectiveness and robustness of the proposed BSG-based method, demonstrating its clear advantage over conventional Nash-based approaches under information asymmetry.
The remainder of this article is organized as follows. Section 2 reviews related work on secure state estimation under cyber-attacks. Section 3 establishes the system model for remote state estimation under SINR-based DoS attacks with incomplete channel gain information. Section 4 formulates the sequential decision-making process as a two-player Markov Decision Process (MDP). Building upon the MDP, Section 5 constructs the Bayesian Stackelberg Game (BSG) framework and analyzes the Stackelberg equilibrium. Section 6 designs a Stackelberg Q-learning algorithm to compute the optimal strategies for both players. Section 7 provides simulation results to verify the effectiveness of the proposed approach. Finally, conclusions are drawn in Section 8.

2. Related Work

Accurate and secure state estimation serves as the cornerstone of reliable decision-making in CPSs. However, the widespread dependence on wireless networks for data transmission exposes the estimation process to various cyber attacks, including FDI, eavesdropping, and DoS attacks. These attacks compromise data integrity, availability, and confidentiality [11]. These attacks can deliberately degrade estimation accuracy by corrupting, intercepting, or blocking data transmissions, which, in turn, destabilize system operation and threaten physical security [12,13].
DoS attacks, especially those targeting the signal-to-interference-and-noise ratio (SINR) of wireless channels, are the most disruptive to remote state estimation [4,14,15]. The sensor, as the defender, aims to enhance the system performance with less transmission power. On the contrary, the goal of the DoS attacker is to reduce estimation accuracy while consuming more communication resources. To describe such adversarial dynamics between attackers and defenders, game theory has emerged as a powerful analytical tool for the interactive decision-making process [16,17,18,19]. With energy constraints for both the sensor and the attacker, a Nash equilibrium (NE) of a zero-sum game was proven in [20] to be the optimal strategies for both sides. Then, the authors in [21] proposed a a Markov game framework under the SINR-based DoS attacks and applied a modified Nash Q-learning algorithm to obtain the optimal solutions. For multi-channel networks, a two-player zero-sum stochastic game was formulated to design the mixed strategies for the channel selection of the sensor and DoS attacker [22].
It is worth pointing out that most works on attack–defense games adopt NE as the game solution based on the assumption that the defender and the attacker choose their actions simultaneously, which is not applicable for situations where the players deploy their strategies sequentially. Consequently, the Stackelberg game, with its explicit “leader–follower” structure, has been a more appropriate and realistic framework for capturing the hierarchical interactions [23]. For the SINR-based fading wireless network, a Stackelberg equilibrium (SE) within a two-player nonzero-sum Markov game framework was constructed in [24] to obtain the transmission and interference strategies for the sensor and attacker, respectively. By taking the Stackelberg game method, the linear quadratic Gaussian (LQG) control strategy was analyzed in [25,26] for FDI attacks and DoS attacks, respectively. For distributed Kalman filtering, a Stackelberg game-based distributed reinforcement learning algorithm was developed to produce joint optimal strategies based on local observation information [27]. Within the Stackelberg game framework, an optimal stealthy robust attack method was designed based on Wasserstein ambiguity sets [28].
The aforementioned Stackelberg-based studies rely on the assumption of complete information, which rarely holds in practical wireless communication scenarios. For players in games, some crucial environmental information and other’s key attributes are often difficult to obtain, such as energy budgets, acknowledgment (ACK) information, and channel gains of the wireless network. An anti-jamming Bayesian Stackelberg game was proposed in [29], where the uncertainties of the channel state information and transmission cost information were incorporated. For the problem of multi-channel power schedule SINR-based DoS attacks, the SE was studied under two types of incomplete information, where the existence of the attacker and the total power of the attacker were, respectively, unknown to the defender [9]. For a multi-hop network under DoS attacks, the defender had no access to the available energy of the attacker; then, a Bayesian Stackelberg game was implemented and a Stackelberg Q-learning algorithm was presented to obtain the SE [30]. Furthermore, a stochastic Bayesian game was formulated in [31], where the ACK information from the remote estimator to the sensor was hidden from the attacker. While game theory provides a robust framework for strategic analysis, it is also worth noting that formal methods have been extensively explored for the verification and analysis of CPSs under attacks [32]. In contrast to these verification-based approaches, this paper emphasizes the adaptive decision-making process in scenarios where the sensor faces incomplete information about an attacker’s characteristics.

3. Problem Formulation

Notations:  N 0 denotes the set of nonnegative integers. R n and R m × n are the n-dimensional Euclidean space and set of all m × n matrices, respectively. S 0 n and S 0 n represent the sets of n × n real symmetric positive semi-definite and positive definite matrices, respectively. If  X S 0 n , we simply write X 0 ( X > 0 if X S 0 n ). Notation X < Y means that the matrix X Y is negative definite. Pr ( · ) is the probability of a random event. E [ · ] and Cov [ · ] stand for the expectation and covariance of a random vector, respectively. For functions g, h with appropriate domains, g h ( · ) represents the function composition g ( h ( · ) ) , with  h n ( · ) h ( h n 1 ( · ) ) .
In this section, we introduce the remote state estimation system under DoS attack depicted in Figure 1. The local sensor estimates the system states by the Kalman filter and then transmits the estimates to the remote estimator through the signal-to-noise ratio (SNR)-based network. However, the network is unreliable and may be attacked by a DoS attacker.

3.1. System and Sensor Model

Consider the discrete-time linear time invariant system:
x k + 1 = A x k + w k ,
y k = C x k + v k ,
where x k R n is the system state and  y k R m is the measurement output. The process noise w k R n and the measurement noise v k R m are assumed to be independent white Gaussian noises with covariance matrices Q S 0 n and R S 0 m , respectively. The initial state x 0 is zero-mean Gaussian with covariance P 0 S 0 n , and independent of w k and v k for all k N 0 . The pair ( A , C ) is detectable.
The smart sensor runs a local Kalman filter to estimate the system states based on the the measurement set up to the current time k, that is, Y k = { y 1 , y 2 , , y k } . The local minimum mean-squared error (MMSE) estimate of the system state x k and the corresponding estimation error are, respectively, defined as
x ^ k s = E [ x k | Y k ] , P k s = E [ ( x k x ^ k s ) ( x k x ^ k s ) | Y k ] .
We define the the Lyapunov and Riccati operators h and g: S 0 n S 0 n as follows:
h ( X ) A X A + Q , g ( X ) X X C C X C + R 1 C X .
The estimation error covariance of the Kalman filter converges to a unique value irrespective of the initial condition. For simplicity, it is assumed that the local Kalman filter has reached the steady state, and we let
P 0 s = P ¯ ,
where P ¯ is the steady-state error covariance given by the unique positive semidefinite solution of g h ( X ) = X .

3.2. Communication Channel and Attack Model with Incomplete Information

After obtaining the state estimate x k s , the sensor transmits the estimate to the remote estimator in the form of a data packet. However, random data packet drops may occur owing to channel fading and interference. Here, we assume that the sensor communicates with the remote estimator over an additive white Gaussian noise (AWGN) network to model this scenario. The packet dropout probability at time k is described by S N R k = α λ k σ k , where α > 0 is the fading channel gain of the sensor, λ k is the transmission power of the sensor at time k, and  σ k is the additive white Gaussian noise power.
Considering a DoS attacker against the channel, the SNR is revised as the following SINR:
S I N R k = α λ k σ k + β δ k ,
where β > 0 is the fading channel gain of the attacker; δ k is the transmission power of the attacker at time k.
The transmission of x k s between the sensor and the remote estimator can be characterized by a binary random process { γ k } :
γ k = 1 , if   x k s is   received   successfully , 0 , otherwise .
Then, the packet arrival rate μ k is given as
μ k Pr ( γ k = 1 ) = 1 2 Q ( κ S I N R k ) ,
where κ > 0 is a parameter; Q ( x ) = 1 2 π x exp ( u 2 2 ) d u represents the standard Q-function. It should be noted that the scalar function Q ( · ) here is distinct from the system noise covariance matrix Q defined in the system models (1) and (2). It is seen that the SINR is not only dependent on the sensor’s transmission power but is also influenced by the interference power generated by the DoS attacker. Obviously, the larger SINR leads to the lower packet loss rate and better estimation performance.
In practice, the channel state information is not perfect. The uncertainties of the channel gain information need to be considered. Suppose that the attacker interferes with the transmission under channel gain β j with probability η ( β j ) , where j N = { 1 , 2 , , H } and  j = 1 H η ( β j ) = 1 , H 2 . Then, the SINR with the channel interference gain β j is defined as
S I N R k j = α λ k σ k + β j δ k j .
Correspondingly, the packet arrival rate under SIN R k j is given as
μ k j Pr ( γ k = 1 ) = 1 2 Q ( κ S I N R k j ) .
This multi-type probabilistic model captures a realistic scenario where the sensor, as a defender, cannot precisely identify the attacker’s equipment or instantaneous channel condition but can estimate a probability distribution over a few distinct levels of threat severity. In this case, the sensor does not know the channel interference gain of the attacker but possesses the probability distribution of β . Therefore, with different levels of channel interference gain, we can think that there exists H type attackers in the environment.
Remark 1.
In practice, the acquisition of channel information often has a non-symmetric characteristic. An attacker can frequently intercept the sensor’s open pilot signals or reference transmissions. Then, the attacker can accurately estimate the sensor’s transmission channel gain by processing these known sequences [33,34]. When launching an active attack, the attacker employs non-cooperative, noise-like, or protocol-aware jamming signals, which are deliberately designed to be unpredictable [35]. Thus, it is difficult for the sensor to accurately estimate the interference channel gain from the attacker.

3.3. Remote State Estimation

The MMSE estimate of x k and the estimation error covariance at the remote estimator are denoted as x ^ k and P k , respectively. According to whether the date is received successfully, the state estimation is given by
x ^ k = x ^ k s , γ k = 1 , A x k 1 , γ k = 0 .
As a result, the error covariance P k is computed as
P k = P ¯ , γ k = 1 , h ( P k 1 ) , γ k = 0 .
Assume that the initial value of the error covariance at the remote estimator also starts from P ¯ , i.e.,  P 0 = P ¯ . The remote estimator will send ACKs to the sensor to indicate whether it has received the estimate at each time. Hence, the sensor can also calculate P k by (9). Note that P k takes values from the infinitely set { P ¯ , h ( P ¯ ) , h 2 ( P ¯ ) , } .

3.4. Strategy and Objective Function

Considering that both the sensor and the attacker have power constraints, we assume that the transmission power and attack power take values from the finite sets. Specifically, the transmission power λ k belongs to a finite set with level l d , which is denoted as Λ = { λ 1 , , λ l d } . The interference power δ k j belongs to a finite set with level l a j , which is denoted as Δ j = { δ 1 j , , δ l a j j } . This discretization models the practical constraints of digital power amplifiers and energy-limited devices and is a common simplification in power control problems.
The strategies of the sensor and the j-th type attacker at time k are λ k Λ and δ k j Δ j for all k N 0 , respectively. Then, the strategy of the attacker at time k is denoted as
δ k = { δ k 1 , δ k 2 , , δ k H } .
From (6) and (7), different types of attacker lead to different packet arrival rates and further affect the expected estimation error. The trace of the expected estimation error covariance under the attacker’s channel gain β j is given as
f k j E [ tr ( P k j ) ] = tr μ k j P ¯ + ( 1 μ k j ) h ( P k 1 ) .
The sensor aims to optimize the expected estimation error while reducing the total transmission cost. In contrast, the goal of the attacker is to deteriorate the estimation performance of the remote estimator and consume as much transmission energy of the sensor as possible while minimizing the total attack cost. Therefore, the one-step rewards of the sensor and the attacker with channel gain β j are, respectively, given as
r k , d ( λ k , δ k ) = j = 1 H η ( β j ) f k j C d λ k ,
r k , a j ( λ k , δ k j ) = f k j + C d λ k C a δ k j ,
where C d and C a are the costs per unit power for the sensor and DoS attacker, respectively. Then, the corresponding payoff functions over the infinite time horizon for the sensor and the j-th type attacker are, respectively, given as follows:
J d ( λ k , δ k ) = k = 1 ρ k 1 r k , d ( λ k , δ k ) ,
J a j ( λ k , δ k j ) = k = 1 ρ k 1 r k , a j ( λ k , δ k j ) ,
where ρ ( 0 , 1 ) is the discount factor.
For the considered system under attack, the sensor as the defender first decides its transmission power level and then the DoS attacker chooses the interference power level based on the current situation. For such sequentially interactive decision-making processes with incomplete information, we use the Bayesian Stackelberg game framework to analyze the optimal strategies for the sensor and attacker.
Remark 2.
Existing studies on SINR-based attacks against state estimation predominantly design transmission and interference strategies under the assumption of complete channel information [20,21,22,26]. This idealization ignores the reality that channel knowledge is incomplete and asymmetric in actual wireless networks, leading to multi-modal players and their decision-making spaces. Therefore, investigating the resulting hierarchical game is both highly meaningful and challenging.

4. MDP Formulation

In this section, a Markov decision process is first formulated to describe the dynamics of the state estimation at the remote estimator. Obviously, the scenario involves two players: the sensor and DoS attacker. The sensor does not know the attacker’s channel gain but has knowledge of the probability distribution information. At each time, the sensor and attacker take action based on the current process state and the information that they have previously collected. Then, they respectively receive the rewards, and the process moves to the next state according to the transition probability.
Taking into account the decision-making interaction between the sensor and the attacker, the state estimation process at the remote estimator can be formulated as a two-player MDP model, which consists of the following five essential elements:
(1) Desion epoch: Let T denote the infinite discrete set of decision epochs, i.e., T = { 1 , 2 , } .
(2) State: According to (9), the estimation error covariance can be alternatively written as P k = h τ k ( P ¯ ) , where τ k = { 1 , 2 , } is the holding time at time k, namely, the time interval between the current time k and the latest moment of the receiving packet. Then, the set of the possible estimation error covariance P k can be represented as Z k = { P ¯ , h ( P ¯ ) , , h k ( P ¯ ) } . Therefore, the state at time k is defined as the holding time of the time k 1 , i.e.,  s k = τ k 1 . Intuitively, this state s k represents the time interval elapsed since the last successful packet reception. According to the update Formula (9), it directly determines the current estimation error covariance at the remote estimator. Due to the low probability of a large holding time, the final state K is used to represent all states with τ i K . Therefore, the state space is S = { 0 , 1 , , K } .
(3) Action: At each time, based on the state s k , the sensor and the j-th type attacker choose the actions λ k and δ k j from the action spaces A d = Λ and A a j = Δ j , respectively.
(4) Transmission probability: Given the action λ k of the sensor and the action δ k j of the j-th type attacker, the probability that the state transmits from s k to s k + 1 is given as μ .
Pr ( s k + 1 | s k , λ k , δ k j ) = μ k j , s k + 1 = 0 , 1 μ k j , s k + 1 = s k + 1 , 0 , otherwise .
(5) Reward: The one-stage reward functions for the sensor and the j-th type attacker are, respectively, given as
r k , d ( s k , λ k , δ k ) = j = 1 H η ( β j ) tr [ μ k j P ¯ + ( 1 μ k j ) h s k + 1 ( P ¯ ) ] C d λ k ,
and
  r k , a j ( s k , λ k , δ k j ) = tr [ μ k j P ¯ + ( 1 μ k j ) h s k + 1 ( P ¯ ) ] + C d λ k C a δ k j .
Note that as the one-stage reward functions are time-invariant and stationary and can also be denoted as r d ( s k , λ k , δ k ) and r a j ( s k , λ k , δ k j ) .
The sensor’s strategy π d ( s ) = λ k ( s ) means that the sensor takes action λ k under state s k . Similarly, the strategy of the j-th type attacker under state s is denoted as π a j ( s ) = δ k j ( s ) . Then, the strategies of the sensor and the j-th type attacker are, respectively, written as
π d = { π d ( s ) | s S } Π d , π a j = { π a j ( s ) | s S } Π a j ,
where Π d and Π d i are the set of all stationary and deterministic policies for the sensor and j-th type attacker, respectively. Furthermore, the strategy of the attacker is denoted as
π a = [ π a 1 , π a 2 , , π d H ] Π a
where Π a is the set of all stationary and deterministic policies for the attacker. Then, the payoff functions of the sensor and j-th type attacker are, respectively, given as
J d ( s , π d , π a ) = k = 1 ρ k 1 r d ( s k , λ k , δ k ) ,
J a j ( s , π d , π a j ) = k = 1 ρ k 1 r a j ( s k , λ k , δ k j ) .
The goal of both the sensor and the attacker is to seek the optimal strategy to maximize the payoff function.

5. Bayesian Stackelberg Game

Based on the above MDP framework, a BSG with two players was investigated for designing the optimal strategies for both the sensor and the attacker. In this hierarchical game, the sensor, as the leader, first makes its action according to the state and declares its strategy. It is noted that the sensor knows the reaction from the attacker in a Stackelberg game. Then, the attacker, as the follower, takes action based on the acquired channel gain and the transmission power of the sensor. The payoff functions of the sensor and j-th type attacker are given as (18) and (19), respectively. The SE of the BSG is analyzed and the corresponding optimal strategies for both players are provided. The best response for each side is first defined as follows.
Definition 1.
The best response is that a player takes an action that optimizes its own payoff while taking into account other players’ actions. Specifically, the best responses for the sensor and j-th type attacker are given as
ϕ d * ( π a ) = arg max π d Π d J d ( s , π d , π a ) ,
ϕ a j , ( π d ) = arg max π a j Π a j J a j ( s , π d , π a j ) ,   j N .
Then, the best responses for the attacker are denoted as
ϕ a * ( π d ) = [ ϕ a 1 , ( π d ) , ϕ a 2 , ( π d ) , , ϕ a H , ( π d ) ] .
The best-response set of the j-th type attacker to the sensor’s strategy π d is denoted as R a j ( π d ) and R a ( π d ) = [ R a 1 ( π d ) , , R a H ( π d ) ] is the best-response set of the attacker to strategy π d .
Lemma 1.
For the proposed BSG with two players, the Stackelberg equilibrium π d S E of the sensor satisfies
J d ( s , π d S E , ϕ a * ( π d S E ) ) J d ( s , π d , ϕ a * ( π d ) ) ,   π d Π d .
where ϕ a * ( π d S E ) is expressed as (22).
Proof. 
In the BSG, the channel gain and the strategy of the sensor can be obtained and imposed on the attacker; then, each type of attacker takes the best response to maximize its own payoff function. The sensor chooses the optimal strategy by taking account into the follower’s best response to maximize its own payoff functions, which completes the proof. □
Based on Lemma 1, the solution to the proposed Stackelberg game with incomplete channel gain information is given by the following theorem.
Theorem 1.
The solution to the proposed BSG is given by first solving
π d S E = ϕ d * ϕ a * ( π d S E ) ,
and then calculating
π a S E = [ π a 1 , S E , π a 2 , S E , , π a H , S E ] ,
where
π a j , S E = ϕ a j , ( π d S E ) , j N .
Therefore, the SE of this game is ( π d S E , π a S E ) Π d × Π a .
Proof. 
In the proposed BSG, the sensor as the leader and the DoS attacker as the follower make decisions sequentially. Given the sensor’s strategy π d , the j-th type attacker will take the best response according to π a j = ϕ a j , ( π d ) . The sensor knows the reaction from the attacker; hence, it will choose the optimal strategy to maximize its own payoff function while taking account into the attacker’s strategy, that is,
π d S E = arg max π d Π d J d ( s , π d , ϕ a * ( π d S E ) )   = ϕ d * ϕ a * ( π d S E ) .
After obtaining the sensor’s strategy π d S E , the j-th type attacker will choose the optimal strategy π a j , S E by calculating
π a j , S E = ϕ a j ( π d S E ) , j N
which completes the proof. □
Remark 3.
This theorem provides a constructive method to find the SE for the proposed BSG. It confirms that within the finite strategy spaces considered, the sensor can determine its optimal strategy by anticipating and incorporating the attacker’s best response, which, in turn, is computed based on the sensor’s declared action. This fixed-point characterization forms the theoretical foundation for the learning algorithm developed in Section 6.
Proposition 1.
There exists at least one SE in the BSG.
Proof. 
Given the sensor’s strategy π d , the j-th type attacker will choose the strategy from the set π a j . Given the final state K, the numbers of possible strategies for the sensor and the j-th type attacker are l d K + 1 and l a j , K + 1 , respectively. The j-th type attacker’s strategy space π a j is finite, its optimal response ϕ a i , ( π d ) always exists for the fixed sensor’s strategy π d . Moreover, the finite strategy space implies the existence of an equilibrium strategy for the sensor by (24). Therefore, there always exists an SE for the proposed BSG. □
Remark 4.
This conclusion substantiates the existence and characterization of the SE in our Bayesian game setting. It implies that despite the incomplete information regarding the attacker’s type, an optimal hierarchical strategy profile exists and can be sought through iterative best-response dynamics.

6. Reinforcement Learning

As mentioned previously, the sensor is unable to access to the SINR of the AWGN channel. In this section, model-free reinforcement learning, i.e., Q-learning is introduced to find the SE. Then, a Stackelberg Q-learning algorithm for the two-player BSG is used to obtain the optimal policy for both the sensor and the attacker.
The optimal payoff functions are given as
J d * ( s ) = max π d Π d min π a R a ( π d ) Q d * ( s , λ , δ ) ,
J a j , ( s ) = max π a j R a j ( π d S E ) Q a j , ( s , λ S E , δ j ) ,
where the optimal state-action-value functions (Q-functions) for the sensor and the j-th type attacker are, respectively, defined as follows:
Q d * ( s , λ , δ ) = r d ( s , λ , δ ) + ρ s S j = 1 H η ( β j ) Pr ( s | s , λ , δ j ) J d * ( s ) , Q a j , ( s , λ , δ j ) = r a j ( s , λ , δ j ) + ρ s S Pr ( s | s , λ , δ j ) J a j , ( s ) .
Then, the SE strategy is obtained by
π d ( s ) = arg max π d Π d min π a R a ( π d ) Q d * ( s , λ , δ ) ,
π a j ( s ) = arg max π a j R a j ( π d S E ) Q a j , ( s , λ S E , δ j ) .
Considering the multi-type channel interference gain of the attacker, the Q-value functions of the sensor and the attacker are recursively updated by SE presented in Theorem 1:
  Q k + 1 , d ( s , λ , δ ) = ( 1 α k ) Q k , d ( s , λ , δ ) + α k Ψ k ( Q k , d ( s , λ , δ ) ) ,
  Q k + 1 , a j ( s , λ , δ j ) = ( 1 α k ) Q k , a j ( s , λ , δ j ) + α k Ψ k ( Q k , a j ( s , λ , δ j ) ) ,
where
Ψ k ( Q k , d ( s , λ , δ ) )   = r d ( s , λ , δ ) + ρ s S ( j = 1 H η ( β j ) Pr ( s | s , λ , δ j ) ) s t a c k e l b e r g Q k , d ( s , λ , δ ) ,
Ψ k ( Q k , a j ( s , λ , δ j ) )   = r a j ( s , λ , δ j ) + ρ s S Pr ( s | s , λ , δ j ) s t a c k e l b e r g Q k , a j ( s , λ , δ j , )
with the learning rate α k [ 0 , 1 ) , s t a c k e l b e r g Q k , d ( s , λ , δ ) and s t a c k e l b e r g Q k , a j ( s , λ , δ j , ) are the Q-values of of the SE solutions at state s for the sensor and the j-th type attacker, respectively.
In the BSG, the sensor and the attacker acts as the leader and the follower, respectively; hence, the update of Stackelberg Q-value is also hierarchically updated. To find the maximum Q-value of the attacker for the state s , the attacker chooses the optimal action from the best response (26). Based on the observation of the sensor’s action, the optimal action that the j-th type attacker takes in response to the sensor’s action λ k is determined as
ϕ a j , ( λ ) = arg max δ j Q k , a j ( s , λ , δ j ) .
Then the optimal action of the attacker is given as
ϕ a * ( λ ) = [ ϕ a 1 , ( λ ) , , ϕ a H , ( λ ) ] .
Based on the BSG framework, the optimal action of the sensor is given by
λ * = arg max λ Q k , d ( s , λ , ϕ a * ( λ * ) ) ,
Consequently, we compute s t a c k e l b e r g Q k , d ( s , λ , δ ) and s t a c k e l b e r g Q k , a j ( s , λ , δ j , ) in the Q-functions update as
s t a c k e l b e r g Q k , d ( s , λ , δ )   = Q k , d ( s , λ * , ϕ a * ( λ * ) )
s t a c k e l b e r g Q k , a j ( s , λ , δ j , )   = Q k , a j ( s , λ * , ϕ a j , ( λ * ) )
As a result, the Bayesian Stackelberg Q-learning Algorithm for the incomplete channel gain is presented in Algorithm 1. Key implementation details are as follows: The Q-values for both players are initialized to zero, a common practice that does not bias the asymptotic convergence. This initialization does not affect the final optimal policy learned, ensuring the algorithm’s robustness to initial conditions. An ϵ -greedy exploration strategy is employed, where, with probability ϵ , the actions are chosen randomly, and with probability 1 ϵ , they are chosen as the SE actions based on current Q-values. The learning rate α k for updating the Q-tables follows a schedule that ensures the Robbins-Monro conditions are met (as required by Theorem 2), typically by decaying with the number of visits to each state-action pair. The specific forms of α k and ϵ used in our simulations are provided in Section 7.
Algorithm 1 Bayesian Stackelberg Q-learning Algorithm with Incomplete Channel Gain Information
Input: System parameter matrices A, C, Q, R, σ 2
   1:
   2:
Initialization: Initialize Q 0 , d ( s , λ , δ ) and Q 0 , a j ( s , λ , δ j ) for all j N , s S , λ Λ , and  δ j Δ j ; initial state s 0 ; Total iterations T; exploration probability ε . Set iteration counter k = 1 ;
   3:
while  k < T do
   4:
      Randomize a number ζ [ 0 , 1 ] ;
   5:
      if  ζ [ ε , 1 ]  then
   6:
          Take actions λ and δ j for j N based on (36)–(38);
   7:
      else
   8:
          Select random actions λ and δ j for j N ;
   9:
      end if
 10:
      Obtain rewards r d ( s , λ , δ ) and r d ( s , λ , δ ) ;
 11:
      Update Q k + 1 , d ( s , λ , δ ) and Q k + 1 , a j ; ( s , λ , δ j ) , and move to next state;
 12:
       k = k + 1 ;
 13:
end while
Output: The optimal strategies and the optimal payoff functions for the sensor and the attacker
Remark 5.
The per-iteration computational cost of the Stackelberg Q-learning algorithm (Algorithm 1) is linear in the product of the state space size | S | and the action space sizes | Λ | × | Δ j | , as it requires updating Q-tables for all state–action pairs encountered. This makes it efficient for the moderately sized problems presented here. For systems with significantly larger state or action spaces, the well-known curse of dimensionality would necessitate the use of function approximation (e.g., deep Q-networks), which is a promising direction for future work, as noted in Section 8.
The convergence of Algorithm 1 to the optimal Q-functions is guaranteed under the conditions specified in Theorem 2. The learning rate and exploration schedules described above are designed to satisfy these conditions, in particular the Robbins–Monro requirements for the learning rate.
Theorem 2
([36]). The Bayesian Stackelberg Q-learning sequences described in (32) and (33) will converge to the optimal values if the following assumptions hold:
1. 
The recursive processes (32) and (33) converge to Q d ( s , λ , δ ) * and Q a ( s , λ , δ ) * with probability 1, respectively.
2. 
There exists a number a ( 0 , 1 ) and a sequence ς k 0 converging to zero with probability 1, such that
Ψ k ( Q d ) Ψ ( Q d * ) a Q d Q d * + ς k , Q d Q Ψ k ( Q a j ) Ψ ( Q a j , ) a Q a j Q a j , + ς k , Q a Q
where Q is the Q-value space.
3. 
The learning rate α k satisfies that α k [ 0 , 1 ) , and k = 1 n α k converges to infinity uniformly as n .
Remark 6.
Condition 1 assumes that the recursive updates for both players converge with probability 1, meaning the learning process is inherently stable and reaches a fixed point. This is approximated by implementing ε-greedy exploration in Algorithm 1 and running the training for a large number of steps. Condition 2 ensures that the algorithm is driven toward a unique fixed point while noise vanishes by a contraction operator with a decaying perturbation. The learning rate in Condition 3 must satisfy the Robbins–Monro conditions for stochastic approximation. In the simulation section, the learning rate α k is designed to be a nonzero decreasing function of time step k and the current state and actions.

7. Simulation Results

In this section, a numerical example is provided to verify the performance of the proposed stochastic attack strategy. The following system parameter are considered:
A = 1 0.5 0 1.05 ,   C = 1 0 ,   Q = 0.5 I 2 ,   R = 0.5 .
Then the steady-state error covariance P ¯ is P ¯ = [ 0.3802 , 0.2840 ; 0.2840 , 1.6894 ] . The AWGN power and the modulation parameter are σ k = 0.5 and κ = 0.2 , respectively. The fading channel gain and the action space of the sensor are given as α = 0.7 and A d = { 1 , 2 , 3 } . Assume that the fading channel gains of the DoS attacker are [ 0.5 , 0.8 ] with probability distribution [ 0.8 , 0.2 ] . This setting creates a representative asymmetric information scenario in which the sensor needs to design a strategy against an attacker that is most likely to have a moderate interference capability but must also account for a significant possibility of facing a more potent attacker. Correspondingly, the action spaces for the two types of the attacker are A a 1 = { 4 , 5 , 6 } and A a 2 = { 1 , 2 , 3 } , respectively. The unit power costs for the sensor and the attacker are C d = 1 and C a = 2 , respectively. Set the finial state to K = 4 , and the state space is given as S = { 0 , 1 , 2 , 3 , 4 } , which means the estimation error takes the value from { P ¯ , h ( P ¯ ) , , h 4 ( P ¯ ) } . The discount factor for payoff functions is ρ = 0.9 .
In the Bayesian Stackelberg Q-learning algorithm, the learning rate and the initial exploring rate are set as α = 10 / [ 15 + count ( s , λ , δ ) ] and ϵ 0 = 0.98 , respectively, where count ( s , λ , δ ) is the number of the occurrence of state–action pair ( s , λ , δ ) . This design ensures that the learning rate satisfies the Robbins–Monro conditions, a standard requirement for the convergence of stochastic approximation algorithms like Q-learning. The initial values are set as Q 0 , d ( s , λ , δ ) = 0 and Q 0 , a j ( s , λ , δ j ) = 0 for all j N .
After 100,000 learning steps, the Q function values in different states converge. The equilibrium payoffs to the sensor and attacker for states s = P ¯ , , h 4 ( P ¯ ) are given as follows:
J d * ( s ) = [ 131.9945 , 141.6237 , 149.8064 , 155.1217 , 155.1217 ] , J a 1 , ( s ) = [ 81.5995 , 87.9412 , 95.6919 , 101.0546 , 101.0546 ] , J a 2 , ( s ) = [ 97.7494 , 103.6945 , 111.4222 , 116.5348 , 116.5348 ] .
The corresponding optimal transmission and interference strategies are given as follows:
π d * ( s ) = [ 1 , 3 , 3 , 3 , 3 ] , π a 1 , ( s ) = [ 4 , 4 , 5 , 6 , 6 ] , π a 2 , ( s ) = [ 1 , 3 , 3 , 3 , 3 ] .
The resulting Stackelberg equilibrium strategies are explicitly state-dependent, quantitatively showing how the sensor optimally increases transmission power as the estimation error grows. Finally, the equilibrium payoffs are quantitatively characterized, showing the cost–performance trade-off across states. The optimal strategies under the BSG confirm the intuition that a relatively small error covariance at the remote estimator enables the sensor to use less power for transmission, also leading to less interference power for the attacker. For a larger estimation error covariance, the sensor is inclined to choose a high power level to increase the packet arrival rate and, therefore, improve estimation performance. In this case, the attacker also increases the interference power accordingly.
The algorithm converges to distinct, stable Q-values for different state–action pairs. Due to the high dimension of the Q-values Q k ( s , λ , δ ) , next, we present the learning process for the initial state s = P ¯ as an illustrative example. When the sensor takes the action λ = 1 , the corresponding Q-functions converge, as shown in Table 1. Table 2 gives all converged Q-values of the attacker. We can see that the optimal strategy for the attacker is δ = ( 4 , 1 ) under state s = P ¯ .
Figure 2, Figure 3 and Figure 4 depict the learning processes for the optimal strategies of the sensor, the first type attacker, and the second type attacker for state P ¯ , respectively. In the figures, different colored lines respectively represent the iterative Q-values under different actions of the sensor and attacker. These figures collectively depict the convergence of Q-functions for all possible action combinations at state P ¯ . The high density of lines illustrates the learning algorithm’s exploration across the entire strategy space. While individual lines are not labeled for clarity, their collective trend towards stabilization after around 5000 iterations is evident, confirming the convergence of the learning process. The precise converged Q-values that underpin the optimal strategies are detailed in Table 1 and Table 2.
Next, we consider the situation that the attacker can launch attacks without any cost constraints, that is, the cost of the interference power can be ignored. In the case of C a = 0.01 , the equilibrium payoffs to the sensor and attacker for states s = P ¯ , , h 4 ( P ¯ ) are given as follows:
J d * ( s ) = [ 138.7763 , 148.5242 , 156.6659 , 161.9812 , 161.9812 ] , J a 1 , ( s ) = [ 189.0304 , 196.9475 , 205.1795 , 210.5454 , 210.5454 ] , J a 2 , ( s ) = [ 155.1126 , 162.1956 , 169.9786 , 175.0912 , 175.0912 ] .
The corresponding optimal transmission and interference strategies are given as follows:
π d * ( s ) = [ 1 , 3 , 3 , 3 , 3 ] , π a 1 , ( s ) = [ 6 , 6 , 6 , 6 , 6 ] , π a 2 , ( s ) = [ 3 , 3 , 3 , 3 , 3 ] .
The optimal strategies confirm that when the cost of interference power C a is negligible, both types of attackers consistently select the highest available power level in their respective action sets. This behavior aligns perfectly with game-theoretic intuition: as the marginal cost of interference diminishes, the rational objective for the attacker shifts overwhelmingly towards maximizing the degradation of the estimation performance, leading to the selection of maximum jamming power. This result underscores the critical role of cost parameters in shaping the equilibrium of the security game.
The learning processes for the optimal strategies of the sensor, the first type attacker, and the second type attacker for state P ¯ are shown in Figure 5, Figure 6 and Figure 7. Similar to the previous case, the convergence trends for all action combinations are shown. The eventual stabilization of all trajectories validates the algorithm’s effectiveness even in this complex setting. Notably, as seen in Figure 5, the Q-values of the sensor exhibit significant fluctuation and converge more slowly compared to those of the attackers in Figure 6 and Figure 7. This can be attributed to the information asymmetry inherent in the problem: the sensor lacks precise knowledge of the attacker’s channel gain and must learn an optimal strategy based only on the probabilistic distribution over attacker types. This uncertainty inherently increases the exploration burden and complexity of the learning process for the leader.
Therefore, the numerical results validate the algorithm’s effectiveness by demonstrating its capability to solve the proposed BSG model: It is seen that the algorithm achieves convergence to a stable equilibrium where both players’ strategies are mutually optimal in the sequential sense, as theorized in Section 5 and Section 6. The logical adaptation of strategies to different states and costs further confirms that the learned policies align with game-theoretic intuition, providing strong empirical support for the proposed framework.

8. Conclusions

This paper investigated the Bayesian Stackelberg game for remote state estimation under SINR-based DoS attacks. The two players sequentially decide their transmission and interference powers under incomplete information. Specifically, the sensor lacks exact knowledge of the attacker’s fading channel gain. The optimization problem over an infinite-time horizon is first formulated as an MDP with finite state and action spaces. By taking advantage of the probabilistic information about the channel interference gain, a BSG is modeled to describe the iterative decision-making process between the sensor and the various types of attackers. Based on the solution to the BSG, a Stackelberg Q-learning algorithm is used to obtain the optimal strategies of the two players. Numerical results validate the effectiveness of the proposed game-theoretic approach in the case of uncertain channel gains. The presented framework operates under core assumptions of a known attacker type distribution and discrete action spaces, which define its current scope but also present opportunities for future generalization. Future work also includes analyzing games where both players have incomplete information and extending the framework to larger-scale systems via function approximation (e.g., deep reinforcement learning) to mitigate the curse of dimensionality.

Author Contributions

Conceptualization, D.D.; methodology, D.D.; software, D.D.; validation, D.D., P.Y. and M.Q.; formal analysis, D.D.; investigation, D.D.; resources, D.D.; data curation, D.D.; writing—original draft preparation, D.D.; writing—review and editing, D.D., P.Y. and M.Q.; visualization, D.D.; supervision, P.Y. and M.Q.; project administration, M.Q. and P.Y.; funding acquisition, P.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Ahmed, S.H.; Kim, G.; Kim, D. Cyber physical system: Architecture, applications and research challenges. In Proceedings of the 2013 IFIP Wireless Days (WD), Valencia, Spain, 13–15 November 2013; pp. 1–5. [Google Scholar]
  2. Duo, W.; Zhou, M.; Abusorrah, A. A survey of cyber attacks on cyber physical systems: Recent advances and challenges. IEEE/CAA J. Autom. Sin. 2022, 9, 784–800. [Google Scholar] [CrossRef]
  3. Sun, Y.-C.; Gao, K.; Chen, L.; Yang, F.; An, L. Optimal transmission scheduling for remote state estimation under active eavesdropping-based DoS attacks. IEEE Trans. Syst. Man Cybern. Syst. 2025, 55, 7487–7498. [Google Scholar] [CrossRef]
  4. Zhang, H.; Cheng, P.; Shi, L.; Chen, J. Optimal DoS attack scheduling in wireless networked control system. IEEE Trans. Control Syst. Technol. 2016, 24, 843–852. [Google Scholar] [CrossRef]
  5. Saiyed, M.F.; Al-Anbagi, I. A game theoretic model for strategic defence selection against DDos attacks in IoT networks. IEEE Trans. Netw. Serv. Manag. 2025, 22, 4509–4524. [Google Scholar] [CrossRef]
  6. Cai, X.; Xiao, F.; Wei, B. Resilient Nash equilibrium seeking in multiagent games under false data injection attacks. IEEE Trans. Syst. Man Cybern. Syst. 2023, 53, 275–284. [Google Scholar] [CrossRef]
  7. Lian, J.; Jia, P.; Wu, F.; Huang, X. A Stackelberg game approach to the stability of networked switched systems under Dos attacks. IEEE Trans. Netw. Sci. Eng. 2023, 10, 2086–2097. [Google Scholar] [CrossRef]
  8. Wang, Q.; Liu, C.; Lan, J.; Ren, X.; Meng, Y.; Wang, X. Distributed secure surrounding control for multiple USVs against deception attacks: A Stackelberg game approach with reinforcement learning. IEEE Trans. Intell. Veh. 2024, 9, 7003–7015. [Google Scholar] [CrossRef]
  9. Liu, H. SINR-based multi-channel power schedule under DoS attacks: A Stackelberg game approach with incomplete information. Automatica 2019, 100, 274–280. [Google Scholar] [CrossRef]
  10. Wang, X.; Zhu, H.; Luo, X.; Guan, X. Data-driven-based detection and localization framework against false data injection attacks in DC microgrids. IEEE Internet Things J. 2025, 12, 36079–36093. [Google Scholar] [CrossRef]
  11. Zhou, J.; Shang, J.; Chen, T. Cybersecurity landscape on remote state estimation: A comprehensive review. IEEE/CAA J. Autom. Sin. 2024, 11, 851–865. [Google Scholar] [CrossRef]
  12. Attkan, A.; Ranga, V. Cyber-physical security for IoT networks: A comprehensive review on traditional, blockchain and artificial intelligence based key-security. Complex Intell. Syst. 2022, 8, 3559–3591. [Google Scholar] [CrossRef]
  13. Kim, S.; Park, K.-J.; Lu, C. A survey on network security for cyber-physical systems: From threats to resilient design. IEEE Commun. Surv. Tutor. 2022, 24, 1534–1573. [Google Scholar] [CrossRef]
  14. Huseinović, A.; Mrdović, S.; Bicakci, K.; Uludag, S. A survey of denial-of-service attacks and solutions in the smart grid. IEEE Access 2020, 8, 177447–177470. [Google Scholar] [CrossRef]
  15. Huang, M.; Ding, K.; Dey, S.; Li, Y.; Shi, L. Learning-based DoS attack power allocation in multiprocess systems. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 8017–8030. [Google Scholar] [CrossRef]
  16. Alpcan, T.; Başar, T. Network Security: A Decision and Game-Theoretic Approach; Cambridge University Press: Cambridge, UK, 2010. [Google Scholar]
  17. Li, H.; Lai, L.; Qiu, R.C. A denial-of-service jamming game for remote state monitoring in smart grid. In Proceedings of the 2011 45th Annual Conference on Information Sciences and Systems, Baltimore, MD, USA, 23–25 March 2011; pp. 1–6. [Google Scholar]
  18. Liu, S.; Liu, P.X.; Saddik, A.E. A stochastic game approach to the security issue of networked control systems under jamming attacks. J. Frankl. Inst. 2014, 351, 4570–4583. [Google Scholar] [CrossRef]
  19. Li, Y.; Quevedo, D.E.; Dey, S.; Shi, L. A game-theoretic approach to fake-acknowledgment attack on cyber-physical systems. IEEE Trans. Signal Inf. Process. Over Netw. 2017, 3, 1–11. [Google Scholar] [CrossRef]
  20. Li, Y.; Shi, L.; Cheng, P.; Chen, J.; Quevedo, D.E. Jamming attacks on remote state estimation in cyber-physical systems: A game-theoretic approach. IEEE Trans. Autom. Control 2015, 60, 2831–2836. [Google Scholar] [CrossRef]
  21. Li, Y.; Quevedo, D.E.; Dey, S.; Shi, L. SINR-based DoS attack on remote state estimation: A game-theoretic approach. IEEE Trans. Control Netw. Syst. 2017, 4, 632–642. [Google Scholar] [CrossRef]
  22. Ding, K.; Li, Y.; Quevedo, D.E.; Dey, S.; Shi, L. A multi-channel transmission schedule for remote state estimation under DoS attacks. Automatica 2017, 78, 194–201. [Google Scholar] [CrossRef]
  23. Wilczyński, A.; Jakóbik, A.; Kołodziej, J. Stackelberg security games: Models, applications and computational aspects. J. Telecommun. Inf. Technol. 2016, 65, 70–79. [Google Scholar] [CrossRef]
  24. Feng, Y.; Shou, Y.; Yu, X. Jamming on remote estimation over wireless links under faded uncertainty: A Stackelberg game approach. IEEE Trans. Circuits Syst. II Express Briefs 2021, 68, 2593–2597. [Google Scholar] [CrossRef]
  25. Li, Y.; Shi, D.; Chen, T. False data injection attacks on networked control systems: A Stackelberg game analysis. IEEE Trans. Autom. Control 2018, 63, 3503–3509. [Google Scholar] [CrossRef]
  26. Xing, W.; Zhao, X.; Li, Y.; Liu, L. Denial-of-service attacks on cyber-physical systems against linear quadratic control: A Stackelberg-game analysis. IEEE Trans. Autom. Control 2025, 70, 595–602. [Google Scholar] [CrossRef]
  27. Sun, Y.-C.; Gao, K.; Chen, L.; Yang, F.; Yao, L. Optimal power schedule for distributed Kalman filtering under DoS attacks: A Stackelberg game strategy. Int. J. Syst. Sci. 2025, 56, 2067–2081. [Google Scholar] [CrossRef]
  28. Chen, L.; An, L. Optimal stealthy robust attacks on stackelberg game. Automatica 2025, 177, 112310. [Google Scholar] [CrossRef]
  29. Jia, L.; Yao, F.; Sun, Y.; Niu, Y.; Zhu, Y. Bayesian Stackelberg game for antijamming transmission with incomplete information. IEEE Commun. Lett. 2016, 20, 1991–1994. [Google Scholar] [CrossRef]
  30. Wang, Y.; Xing, W.; Zhang, J.; Liu, L.; Zhao, X. Remote state estimation under DoS attacks in CPSs with arbitrary tree topology: A Bayesian Stackelberg game approach. IEEE Trans. Signal Inf. Process. Over Netw. 2024, 10, 527–538. [Google Scholar] [CrossRef]
  31. Ding, K.; Ren, X.; Quevedo, D.E.; Dey, S.; Shi, L. DoS attacks on remote state estimation with asymmetric information. IEEE Trans. Control Netw. Syst. 2019, 6, 653–666. [Google Scholar] [CrossRef]
  32. Cong, X.; Yu, Z.; Fanti, M.P.; Mangini, A.M.; Li, Z. Predictability verification of fault patterns in labeled Petri nets. IEEE Trans. Autom. Control 2025, 70, 1973–1980. [Google Scholar] [CrossRef]
  33. Burgat, J.; Dorè, J.-B.; Farah, J.; Crussixexre, M. Vulnerability analysis of dynamic directional modulations: A multi-sensor receiver attack. In Proceedings of the 2024 IEEE Military Communications Conference (MILCOM), Washington, DC, USA, 28 October–1 November 2024; pp. 1–6. [Google Scholar]
  34. Zayyani, H.; Salman, M.; Hilal, H.A. Joint measurement and channel design of a malicious sensor in distributed estimation based on maximum disturbance in a sensor network. IEEE Sens. Lett. 2025, 9, 7000104. [Google Scholar] [CrossRef]
  35. Hiraoka, S.; Nakashima, Y.; Yamazato, T.; Arai, S.; Tadokoro, Y.; Tanaka, H. Interference-aided detection of subthreshold signal using beam control in polarization diversity reception. IEEE Commun. Lett. 2018, 22, 1926–1929. [Google Scholar] [CrossRef]
  36. Könönen, V. Asymmetric multiagent reinforcement learning. Web Intell. Agent Syst. 2004, 2, 105–121. [Google Scholar]
Figure 1. Remote state estimation over AWGN channel under DoS attacks.
Figure 1. Remote state estimation over AWGN channel under DoS attacks.
Sensors 26 01272 g001
Figure 2. Convergence of Q-values for the sensor at state P ¯ . The multitude of colored lines represents the learning trajectories of Q k , d ( s , λ , δ ) for all possible combinations of the sensor’s transmission power λ and the attacker’s interference power δ = ( δ 1 , δ 2 ) . The collective convergence of all lines after approximately 5000 iterations demonstrates the stabilization of the learning process. The specific equilibrium Q-values that define the optimal strategy are listed in Table 1.
Figure 2. Convergence of Q-values for the sensor at state P ¯ . The multitude of colored lines represents the learning trajectories of Q k , d ( s , λ , δ ) for all possible combinations of the sensor’s transmission power λ and the attacker’s interference power δ = ( δ 1 , δ 2 ) . The collective convergence of all lines after approximately 5000 iterations demonstrates the stabilization of the learning process. The specific equilibrium Q-values that define the optimal strategy are listed in Table 1.
Sensors 26 01272 g002
Figure 3. Convergence of Q-values for the first-type attacker at state P ¯ . Each colored line corresponds to the learning trajectory of Q k , a 1 ( s , λ , δ 1 ) for a specific pair of sensor power and attacker power. The convergence of all lines illustrates the algorithm’s stability for this attacker type. The resulting equilibrium values are part of the dataset summarized in Table 2.
Figure 3. Convergence of Q-values for the first-type attacker at state P ¯ . Each colored line corresponds to the learning trajectory of Q k , a 1 ( s , λ , δ 1 ) for a specific pair of sensor power and attacker power. The convergence of all lines illustrates the algorithm’s stability for this attacker type. The resulting equilibrium values are part of the dataset summarized in Table 2.
Sensors 26 01272 g003
Figure 4. Convergence of Q-values for the second-type attacker at state P ¯ . Each colored line corresponds to the learning trajectory of Q k , a 2 ( s , λ , δ 2 ) for a specific pair of sensor power and attacker power. The convergence of all lines illustrates the algorithm’s stability for this attacker type. The resulting equilibrium values are part of the dataset summarized in Table 2.
Figure 4. Convergence of Q-values for the second-type attacker at state P ¯ . Each colored line corresponds to the learning trajectory of Q k , a 2 ( s , λ , δ 2 ) for a specific pair of sensor power and attacker power. The convergence of all lines illustrates the algorithm’s stability for this attacker type. The resulting equilibrium values are part of the dataset summarized in Table 2.
Sensors 26 01272 g004
Figure 5. Convergence of Q-values for the sensor at state P ¯ under C a = 0.01 . The multitude of colored lines represents the learning trajectories of Q k , d ( s , λ , δ ) for all possible combinations of the sensor’s transmission power λ and the attacker’s interference power δ = ( δ 1 , δ 2 ) .
Figure 5. Convergence of Q-values for the sensor at state P ¯ under C a = 0.01 . The multitude of colored lines represents the learning trajectories of Q k , d ( s , λ , δ ) for all possible combinations of the sensor’s transmission power λ and the attacker’s interference power δ = ( δ 1 , δ 2 ) .
Sensors 26 01272 g005
Figure 6. Convergence of Q-values for the first-type attacker at state P ¯ under C a = 0.01 . Each colored line corresponds to the learning trajectory of Q k , a 1 ( s , λ , δ 1 ) for a specific pair of sensor power and attacker power. The convergence of all lines illustrates the algorithm’s stability for this attacker type.
Figure 6. Convergence of Q-values for the first-type attacker at state P ¯ under C a = 0.01 . Each colored line corresponds to the learning trajectory of Q k , a 1 ( s , λ , δ 1 ) for a specific pair of sensor power and attacker power. The convergence of all lines illustrates the algorithm’s stability for this attacker type.
Sensors 26 01272 g006
Figure 7. Convergence of Q-values for the second-type attacker at state P ¯ under C a = 0.01 . Each colored line corresponds to the learning trajectory of Q k , a 2 ( s , λ , δ 2 ) for a specific pair of sensor power and attacker power. The convergence of all lines illustrates the algorithm’s stability for this attacker type.
Figure 7. Convergence of Q-values for the second-type attacker at state P ¯ under C a = 0.01 . Each colored line corresponds to the learning trajectory of Q k , a 2 ( s , λ , δ 2 ) for a specific pair of sensor power and attacker power. The convergence of all lines illustrates the algorithm’s stability for this attacker type.
Sensors 26 01272 g007
Table 1. Converged Q-values Q d ( s , λ , δ ) of the sensor for state s = P ¯ and sensor action λ = 1 . The columns represent the joint attacker actions δ = ( δ 1 , δ 2 ) .
Table 1. Converged Q-values Q d ( s , λ , δ ) of the sensor for state s = P ¯ and sensor action λ = 1 . The columns represent the joint attacker actions δ = ( δ 1 , δ 2 ) .
δ Q d ( s , λ , δ )
( 4 , 1 ) 131.5542
( 5 , 1 ) 131.6878
( 6 , 1 ) 131.7834
( 4 , 2 ) 131.6996
( 5 , 2 ) 131.8332
( 6 , 2 ) 131.9288
( 4 , 3 ) 131.7653
( 5 , 3 ) 131.8989
( 6 , 3 ) 131.9945
Table 2. Converged Q-values for both attacker types at state s = P ¯ . Each row corresponds to a sensor action λ , with columns showing the Q-values for attacker type 1 ( Q d 1 under action δ 1 ) and type 2 ( Q d 2 under action δ 2 ).
Table 2. Converged Q-values for both attacker types at state s = P ¯ . Each row corresponds to a sensor action λ , with columns showing the Q-values for attacker type 1 ( Q d 1 under action δ 1 ) and type 2 ( Q d 2 under action δ 2 ).
λ δ 1 δ 2 Q a 1 ( s , λ , δ 1 ) Q a 2 ( s , λ , δ 2 )
141 81.5995 97.7494
152 80.6239 96.0405
163 79.6724 94.4824
241 80.7636 97.4259
252 79.9470 96.3323
263 79.1444 95.2768
341 79.8811 96.7319
352 79.1792 95.9321
363 78.4862 95.1471
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Deng, D.; Yi, P.; Qi, M. A Bayesian Stackelberg Game Approach to Remote State Estimation Under SINR-Based DoS Attacks with Incomplete Information. Sensors 2026, 26, 1272. https://doi.org/10.3390/s26041272

AMA Style

Deng D, Yi P, Qi M. A Bayesian Stackelberg Game Approach to Remote State Estimation Under SINR-Based DoS Attacks with Incomplete Information. Sensors. 2026; 26(4):1272. https://doi.org/10.3390/s26041272

Chicago/Turabian Style

Deng, Di, Peng Yi, and Mingze Qi. 2026. "A Bayesian Stackelberg Game Approach to Remote State Estimation Under SINR-Based DoS Attacks with Incomplete Information" Sensors 26, no. 4: 1272. https://doi.org/10.3390/s26041272

APA Style

Deng, D., Yi, P., & Qi, M. (2026). A Bayesian Stackelberg Game Approach to Remote State Estimation Under SINR-Based DoS Attacks with Incomplete Information. Sensors, 26(4), 1272. https://doi.org/10.3390/s26041272

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop