A Bayesian Stackelberg Game Approach to Remote State Estimation Under SINR-Based DoS Attacks with Incomplete Information

Deng, Di; Yi, Peng; Qi, Mingze

doi:10.3390/s26041272

Open AccessArticle

A Bayesian Stackelberg Game Approach to Remote State Estimation Under SINR-Based DoS Attacks with Incomplete Information

by

Di Deng

^1,2

,

Peng Yi

^1,2,*

and

Mingze Qi

^3,4

¹

Department of Control Science and Engineering, Tongji University, Shanghai 201804, China

²

Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai 201210, China

³

College of Science, National University of Defense Technology, Changsha 410073, China

⁴

School of Physical Science and Engineering, Tongji University, Shanghai 200092, China

^*

Author to whom correspondence should be addressed.

Sensors 2026, 26(4), 1272; https://doi.org/10.3390/s26041272

Submission received: 31 December 2025 / Revised: 5 February 2026 / Accepted: 12 February 2026 / Published: 15 February 2026

(This article belongs to the Special Issue Security Issues and Solutions for the Internet of Things)

Download

Browse Figures

Versions Notes

Abstract

With limited energy constraints, the issue of transmission and interference strategies have received considerable critical attention in cyber–physical security. In this paper, for remote state estimation under signal-to-interference-plus-noise ratio-based denial-of-service (DoS) attacks, the Stackelberg game between the sensor and the attacker is investigated. To balance estimation performance and energy consumption, the two players determine the transmission power and interference power sequentially under an incomplete information structure where the sensor does not know the fading channel gain of the attacker exactly. The schedule problem over the infinite-time horizon is first formulated as a Markov decision process with finite state and action spaces. Then, a Bayesian Stackelberg game (BSG) is constructed by incorporating the probability information of the channel interference gain. Based on the definition of best-response, the solution of the BSG is presented and the existence of the Stackelberg equilibrium is proven. Furthermore, a Stackelberg Q-learning algorithm is used to obtain the optimal strategies for the two players. Numerical results demonstrate the effectiveness of the proposed game method when the sensor is unable to access an attacker’s channel gain information.

Keywords:

cyber–physical systems; DoS attacks; incomplete information; Stackelberg game

Graphical Abstract

1. Introduction

Cyber–physical systems (CPSs), which integrate computational capabilities with physical processes, have found extensive application across a range of industrial operations and critical infrastructure [1]. However, unreliable wireless communication networks render these systems highly susceptible to malicious cyber attacks, such as false-data injection (FDI), eavesdropping, and denial-of-service (DoS) attacks, which pose severe threats to their operational security and reliability [2].

In CPSs, security against malicious attacks, especially DoS attacks, has become a critical concern [3,4]. Game theory has emerged as a widespread and powerful analytical tool for modeling such attacker–defender interactions [5]. However, most existing game-theoretic studies are built upon the Nash equilibrium concept, which assumes simultaneous decision-making among players [6]. This assumption does not adequately capture the hierarchical structure often present in security scenarios, where a defender typically acts first and an attacker responds accordingly. To address this, Stackelberg games have been introduced, providing a more realistic framework for sequential decision-making in layered security strategies [7,8]. Furthermore, the majority of existing Stackelberg models rely on the assumption of complete information, where all players have perfect knowledge of each other’s parameters and payoffs. This is often impractical in real-world communication environments because some critical attributes such as channel conditions, interference gains, or attacker capabilities may not be fully accessible to the defender. The lack of such information significantly affects the strategic design and performance of defensive measures. Although some recent studies have begun to address information asymmetry, their modeling of uncertainty remains limited. For instance, most of these works focus on single-type uncertainty or assume that the attacker’s channel gain follows a simple bivariate distribution [9]. Such simplifications fail to capture the practical situation where the attacker’s fading channel gain may follow a multi-type probability distribution.

Motivated by these gaps, this paper investigates remote state estimation under SINR-based DoS attacks within a Bayesian Stackelberg game framework. It is worth noting that our work focuses on the strategic response and resource allocation following the detection or presumption of an attack, rather than on the attack detection mechanisms themselves [10]. While Bayesian Stackelberg models have been studied for general network security, this work distinctively integrates this framework into the SINR-based remote state estimation to handle multi-type channel gain uncertainty and further develops a model-free Stackelberg Q-learning algorithm to derive optimal strategies under this specific information structure. In the incomplete information setting, probabilistic distribution information is incorporated to model the uncertainty of the attacker’s fading channel gain, and a sequential power allocation strategy is developed to balance estimation accuracy and energy efficiency. The proposed approach not only extends current Stackelberg-based security analysis to more realistic communication scenarios but also offers implementable learning-based strategies for deriving equilibrium policies under information asymmetry. The main contributions of this paper are summarized as follows:

We study the strategic interaction between a sensor and a DoS attacker under energy constraints in an SINR-based remote state estimation system with unknown channel gain information. The sequential decision-making process is formulated as a finite-state and finite-action MDP, capturing the dynamic and uncertain nature of the environment.
To address the incomplete information regarding the attacker’s channel gain, we model the interaction as a BSG, where the attacker is treated as having multiple possible types. Within this framework, we define the best-response strategies, prove the existence of an SE, and analytically derive the SE strategies for both players.
A Stackelberg Q-learning algorithm is used to enable both players to learn their optimal strategies without prior knowledge of the opponent’s payoff structure. This model-free approach ensures adaptability and practicality in real-world settings. Finally, numerical simulations validate the effectiveness and robustness of the proposed BSG-based method, demonstrating its clear advantage over conventional Nash-based approaches under information asymmetry.

The remainder of this article is organized as follows. Section 2 reviews related work on secure state estimation under cyber-attacks. Section 3 establishes the system model for remote state estimation under SINR-based DoS attacks with incomplete channel gain information. Section 4 formulates the sequential decision-making process as a two-player Markov Decision Process (MDP). Building upon the MDP, Section 5 constructs the Bayesian Stackelberg Game (BSG) framework and analyzes the Stackelberg equilibrium. Section 6 designs a Stackelberg Q-learning algorithm to compute the optimal strategies for both players. Section 7 provides simulation results to verify the effectiveness of the proposed approach. Finally, conclusions are drawn in Section 8.

2. Related Work

Accurate and secure state estimation serves as the cornerstone of reliable decision-making in CPSs. However, the widespread dependence on wireless networks for data transmission exposes the estimation process to various cyber attacks, including FDI, eavesdropping, and DoS attacks. These attacks compromise data integrity, availability, and confidentiality [11]. These attacks can deliberately degrade estimation accuracy by corrupting, intercepting, or blocking data transmissions, which, in turn, destabilize system operation and threaten physical security [12,13].

DoS attacks, especially those targeting the signal-to-interference-and-noise ratio (SINR) of wireless channels, are the most disruptive to remote state estimation [4,14,15]. The sensor, as the defender, aims to enhance the system performance with less transmission power. On the contrary, the goal of the DoS attacker is to reduce estimation accuracy while consuming more communication resources. To describe such adversarial dynamics between attackers and defenders, game theory has emerged as a powerful analytical tool for the interactive decision-making process [16,17,18,19]. With energy constraints for both the sensor and the attacker, a Nash equilibrium (NE) of a zero-sum game was proven in [20] to be the optimal strategies for both sides. Then, the authors in [21] proposed a a Markov game framework under the SINR-based DoS attacks and applied a modified Nash Q-learning algorithm to obtain the optimal solutions. For multi-channel networks, a two-player zero-sum stochastic game was formulated to design the mixed strategies for the channel selection of the sensor and DoS attacker [22].

It is worth pointing out that most works on attack–defense games adopt NE as the game solution based on the assumption that the defender and the attacker choose their actions simultaneously, which is not applicable for situations where the players deploy their strategies sequentially. Consequently, the Stackelberg game, with its explicit “leader–follower” structure, has been a more appropriate and realistic framework for capturing the hierarchical interactions [23]. For the SINR-based fading wireless network, a Stackelberg equilibrium (SE) within a two-player nonzero-sum Markov game framework was constructed in [24] to obtain the transmission and interference strategies for the sensor and attacker, respectively. By taking the Stackelberg game method, the linear quadratic Gaussian (LQG) control strategy was analyzed in [25,26] for FDI attacks and DoS attacks, respectively. For distributed Kalman filtering, a Stackelberg game-based distributed reinforcement learning algorithm was developed to produce joint optimal strategies based on local observation information [27]. Within the Stackelberg game framework, an optimal stealthy robust attack method was designed based on Wasserstein ambiguity sets [28].

The aforementioned Stackelberg-based studies rely on the assumption of complete information, which rarely holds in practical wireless communication scenarios. For players in games, some crucial environmental information and other’s key attributes are often difficult to obtain, such as energy budgets, acknowledgment (ACK) information, and channel gains of the wireless network. An anti-jamming Bayesian Stackelberg game was proposed in [29], where the uncertainties of the channel state information and transmission cost information were incorporated. For the problem of multi-channel power schedule SINR-based DoS attacks, the SE was studied under two types of incomplete information, where the existence of the attacker and the total power of the attacker were, respectively, unknown to the defender [9]. For a multi-hop network under DoS attacks, the defender had no access to the available energy of the attacker; then, a Bayesian Stackelberg game was implemented and a Stackelberg Q-learning algorithm was presented to obtain the SE [30]. Furthermore, a stochastic Bayesian game was formulated in [31], where the ACK information from the remote estimator to the sensor was hidden from the attacker. While game theory provides a robust framework for strategic analysis, it is also worth noting that formal methods have been extensively explored for the verification and analysis of CPSs under attacks [32]. In contrast to these verification-based approaches, this paper emphasizes the adaptive decision-making process in scenarios where the sensor faces incomplete information about an attacker’s characteristics.

3. Problem Formulation

Notations:

N_{0}

denotes the set of nonnegative integers.

R^{n}

and

R^{m \times n}

are the n-dimensional Euclidean space and set of all

m \times n

matrices, respectively.

S_{⪰ 0}^{n}

and

S_{≻ 0}^{n}

represent the sets of

n \times n

real symmetric positive semi-definite and positive definite matrices, respectively. If

X \in S_{⪰ 0}^{n}

, we simply write

X \geq 0

(

X > 0

if

X \in S_{≻ 0}^{n}

). Notation

X < Y

means that the matrix

X - Y

is negative definite.

Pr (\cdot)

is the probability of a random event.

E [\cdot]

and

Cov [\cdot]

stand for the expectation and covariance of a random vector, respectively. For functions g, h with appropriate domains,

g \circ h (\cdot)

represents the function composition

g (h (\cdot))

, with

h^{n} (\cdot) ≜ h (h^{n - 1} (\cdot))

.

In this section, we introduce the remote state estimation system under DoS attack depicted in Figure 1. The local sensor estimates the system states by the Kalman filter and then transmits the estimates to the remote estimator through the signal-to-noise ratio (SNR)-based network. However, the network is unreliable and may be attacked by a DoS attacker.

3.1. System and Sensor Model

Consider the discrete-time linear time invariant system:

\begin{matrix} x_{k + 1} & = A x_{k} + w_{k}, \end{matrix}

(1)

\begin{matrix} y_{k} & = C x_{k} + v_{k}, \end{matrix}

(2)

where

x_{k} \in R^{n}

is the system state and

y_{k} \in R^{m}

is the measurement output. The process noise

w_{k} \in R^{n}

and the measurement noise

v_{k} \in R^{m}

are assumed to be independent white Gaussian noises with covariance matrices

Q \in S_{≻ 0}^{n}

and

R \in S_{≻ 0}^{m}

, respectively. The initial state

x_{0}

is zero-mean Gaussian with covariance

P_{0}^{-} \in S_{≻ 0}^{n}

, and independent of

w_{k}

and

v_{k}

for all

k \in N_{0}

. The pair

(A, C)

is detectable.

The smart sensor runs a local Kalman filter to estimate the system states based on the the measurement set up to the current time k, that is,

Y_{k} = {y_{1}, y_{2}, \dots, y_{k}}

. The local minimum mean-squared error (MMSE) estimate of the system state

x_{k}

and the corresponding estimation error are, respectively, defined as

\begin{matrix} {\hat{x}}_{k}^{s} & = E [x_{k} | Y_{k}], \\ P_{k}^{s} & = E [(x_{k} - {\hat{x}}_{k}^{s}) {(x_{k} - {\hat{x}}_{k}^{s})}^{⊤} | Y_{k}] . \end{matrix}

We define the the Lyapunov and Riccati operators h and g:

S_{⪰ 0}^{n} \to S_{⪰ 0}^{n}

as follows:

\begin{matrix} h (X) & ≜ A X A^{⊤} + Q, \\ g (X) & ≜ X - X C^{⊤} {[C X C^{⊤} + R]}^{- 1} C X . \end{matrix}

The estimation error covariance of the Kalman filter converges to a unique value irrespective of the initial condition. For simplicity, it is assumed that the local Kalman filter has reached the steady state, and we let

P_{0}^{s} = \bar{P},

(3)

where

\bar{P}

is the steady-state error covariance given by the unique positive semidefinite solution of

g \circ h (X) = X

.

3.2. Communication Channel and Attack Model with Incomplete Information

After obtaining the state estimate

x_{k}^{s}

, the sensor transmits the estimate to the remote estimator in the form of a data packet. However, random data packet drops may occur owing to channel fading and interference. Here, we assume that the sensor communicates with the remote estimator over an additive white Gaussian noise (AWGN) network to model this scenario. The packet dropout probability at time k is described by

S N R_{k} = \frac{α λ_{k}}{σ_{k}}

, where

α > 0

is the fading channel gain of the sensor,

λ_{k}

is the transmission power of the sensor at time k, and

σ_{k}

is the additive white Gaussian noise power.

Considering a DoS attacker against the channel, the SNR is revised as the following SINR:

S I N R_{k} = \frac{α λ_{k}}{σ_{k} + β δ_{k}},

(4)

where

β > 0

is the fading channel gain of the attacker;

δ_{k}

is the transmission power of the attacker at time k.

The transmission of

x_{k}^{s}

between the sensor and the remote estimator can be characterized by a binary random process

{γ_{k}}

:

γ_{k} = \{\begin{matrix} 1, & if x_{k}^{s} is received successfully, \\ 0, & otherwise . \end{matrix}

Then, the packet arrival rate

μ_{k}

is given as

μ_{k} ≜ Pr (γ_{k} = 1) = 1 - 2 Q (\sqrt{κ S I N R_{k}}),

(5)

where

κ > 0

is a parameter;

Q (x) = \frac{1}{\sqrt{2 π}} \int_{x}^{\infty} exp (- \frac{u^{2}}{2}) d u

represents the standard Q-function. It should be noted that the scalar function

Q (\cdot)

here is distinct from the system noise covariance matrix Q defined in the system models (1) and (2). It is seen that the SINR is not only dependent on the sensor’s transmission power but is also influenced by the interference power generated by the DoS attacker. Obviously, the larger SINR leads to the lower packet loss rate and better estimation performance.

In practice, the channel state information is not perfect. The uncertainties of the channel gain information need to be considered. Suppose that the attacker interferes with the transmission under channel gain

β_{j}

with probability

η (β_{j})

, where

j \in N = {1, 2, \dots, H}

and

\sum_{j = 1}^{H} η (β_{j}) = 1, H \geq 2

. Then, the SINR with the channel interference gain

β_{j}

is defined as

S I N R_{k}^{j} = \frac{α λ_{k}}{σ_{k} + β_{j} δ_{k}^{j}} .

(6)

Correspondingly, the packet arrival rate under

{SIN R}_{k}^{j}

is given as

μ_{k}^{j} ≜ Pr (γ_{k} = 1) = 1 - 2 Q (\sqrt{κ S I N R_{k}^{j}}) .

(7)

This multi-type probabilistic model captures a realistic scenario where the sensor, as a defender, cannot precisely identify the attacker’s equipment or instantaneous channel condition but can estimate a probability distribution over a few distinct levels of threat severity. In this case, the sensor does not know the channel interference gain of the attacker but possesses the probability distribution of

β

. Therefore, with different levels of channel interference gain, we can think that there exists H type attackers in the environment.

Remark 1.

In practice, the acquisition of channel information often has a non-symmetric characteristic. An attacker can frequently intercept the sensor’s open pilot signals or reference transmissions. Then, the attacker can accurately estimate the sensor’s transmission channel gain by processing these known sequences [33,34]. When launching an active attack, the attacker employs non-cooperative, noise-like, or protocol-aware jamming signals, which are deliberately designed to be unpredictable [35]. Thus, it is difficult for the sensor to accurately estimate the interference channel gain from the attacker.

3.3. Remote State Estimation

The MMSE estimate of

x_{k}

and the estimation error covariance at the remote estimator are denoted as

{\hat{x}}_{k}

and

P_{k}

, respectively. According to whether the date is received successfully, the state estimation is given by

{\hat{x}}_{k} = \{\begin{matrix} {\hat{x}}_{k}^{s}, & γ_{k} = 1, \\ A x_{k - 1}, & γ_{k} = 0 . \end{matrix}

(8)

As a result, the error covariance

P_{k}

is computed as

P_{k} = \{\begin{matrix} \bar{P}, & γ_{k} = 1, \\ h (P_{k - 1}), & γ_{k} = 0 . \end{matrix}

(9)

Assume that the initial value of the error covariance at the remote estimator also starts from

\bar{P}

, i.e.,

P_{0} = \bar{P}

. The remote estimator will send ACKs to the sensor to indicate whether it has received the estimate at each time. Hence, the sensor can also calculate

P_{k}

by (9). Note that

P_{k}

takes values from the infinitely set

{\bar{P}, h (\bar{P}), h^{2} (\bar{P}), \dots}

.

3.4. Strategy and Objective Function

Considering that both the sensor and the attacker have power constraints, we assume that the transmission power and attack power take values from the finite sets. Specifically, the transmission power

λ_{k}

belongs to a finite set with level

l_{d}

, which is denoted as

Λ = {λ_{1}, \dots, λ_{l_{d}}}

. The interference power

δ_{k}^{j}

belongs to a finite set with level

l_{a}^{j}

, which is denoted as

Δ^{j} = {δ_{1}^{j}, \dots, δ_{l_{a}^{j}}^{j}}

. This discretization models the practical constraints of digital power amplifiers and energy-limited devices and is a common simplification in power control problems.

The strategies of the sensor and the j-th type attacker at time k are

λ_{k} \in Λ

and

δ_{k}^{j} \in Δ^{j}

for all

k \in N_{0}

, respectively. Then, the strategy of the attacker at time k is denoted as

δ_{k} = {δ_{k}^{1}, δ_{k}^{2}, \dots, δ_{k}^{H}} .

From (6) and (7), different types of attacker lead to different packet arrival rates and further affect the expected estimation error. The trace of the expected estimation error covariance under the attacker’s channel gain

β_{j}

is given as

f_{k}^{j} ≜ E [tr (P_{k}^{j})] = tr (μ_{k}^{j} \bar{P} + (1 - μ_{k}^{j}) h (P_{k - 1})) .

(10)

The sensor aims to optimize the expected estimation error while reducing the total transmission cost. In contrast, the goal of the attacker is to deteriorate the estimation performance of the remote estimator and consume as much transmission energy of the sensor as possible while minimizing the total attack cost. Therefore, the one-step rewards of the sensor and the attacker with channel gain

β_{j}

are, respectively, given as

\begin{matrix} r_{k, d} (λ_{k}, δ_{k}) & = - \sum_{j = 1}^{H} η (β_{j}) f_{k}^{j} - C_{d} λ_{k}, \end{matrix}

(11)

\begin{matrix} r_{k, a}^{j} (λ_{k}, δ_{k}^{j}) & = f_{k}^{j} + C_{d} λ_{k} - C_{a} δ_{k}^{j}, \end{matrix}

(12)

where

C_{d}

and

C_{a}

are the costs per unit power for the sensor and DoS attacker, respectively. Then, the corresponding payoff functions over the infinite time horizon for the sensor and the j-th type attacker are, respectively, given as follows:

\begin{matrix} J_{d} (λ_{k}, δ_{k}) & = \sum_{k = 1}^{\infty} ρ_{k - 1} r_{k, d} (λ_{k}, δ_{k}), \end{matrix}

(13)

\begin{matrix} J_{a}^{j} (λ_{k}, δ_{k}^{j}) & = \sum_{k = 1}^{\infty} ρ_{k - 1} r_{k, a}^{j} (λ_{k}, δ_{k}^{j}), \end{matrix}

(14)

where

ρ \in (0, 1)

is the discount factor.

For the considered system under attack, the sensor as the defender first decides its transmission power level and then the DoS attacker chooses the interference power level based on the current situation. For such sequentially interactive decision-making processes with incomplete information, we use the Bayesian Stackelberg game framework to analyze the optimal strategies for the sensor and attacker.

Remark 2.

Existing studies on SINR-based attacks against state estimation predominantly design transmission and interference strategies under the assumption of complete channel information [20,21,22,26]. This idealization ignores the reality that channel knowledge is incomplete and asymmetric in actual wireless networks, leading to multi-modal players and their decision-making spaces. Therefore, investigating the resulting hierarchical game is both highly meaningful and challenging.

4. MDP Formulation

In this section, a Markov decision process is first formulated to describe the dynamics of the state estimation at the remote estimator. Obviously, the scenario involves two players: the sensor and DoS attacker. The sensor does not know the attacker’s channel gain but has knowledge of the probability distribution information. At each time, the sensor and attacker take action based on the current process state and the information that they have previously collected. Then, they respectively receive the rewards, and the process moves to the next state according to the transition probability.

Taking into account the decision-making interaction between the sensor and the attacker, the state estimation process at the remote estimator can be formulated as a two-player MDP model, which consists of the following five essential elements:

(1) Desion epoch: Let T denote the infinite discrete set of decision epochs, i.e.,

T = {1, 2, \dots}

.

(2) State: According to (9), the estimation error covariance can be alternatively written as

P_{k} = h^{τ_{k}} (\bar{P})

, where

τ_{k} = {1, 2, \dots}

is the holding time at time k, namely, the time interval between the current time k and the latest moment of the receiving packet. Then, the set of the possible estimation error covariance

P_{k}

can be represented as

Z_{k} = {\bar{P}, h (\bar{P}), \dots, h^{k} (\bar{P})}

. Therefore, the state at time k is defined as the holding time of the time

k - 1

, i.e.,

s_{k} = τ_{k - 1}

. Intuitively, this state

s_{k}

represents the time interval elapsed since the last successful packet reception. According to the update Formula (9), it directly determines the current estimation error covariance at the remote estimator. Due to the low probability of a large holding time, the final state K is used to represent all states with

τ_{i} \geq K

. Therefore, the state space is

S = {0, 1, \dots, K}

.

(3) Action: At each time, based on the state

s_{k}

, the sensor and the j-th type attacker choose the actions

λ_{k}

and

δ_{k}^{j}

from the action spaces

A_{d} = Λ

and

A_{a}^{j} = Δ^{j}

, respectively.

(4) Transmission probability: Given the action

λ_{k}

of the sensor and the action

δ_{k}^{j}

of the j-th type attacker, the probability that the state transmits from

s_{k}

to

s_{k + 1}

is given as

μ

.

Pr (s_{k + 1} | s_{k}, λ_{k}, δ_{k}^{j}) = \{\begin{matrix} μ_{k}^{j}, & s_{k + 1} = 0, \\ 1 - μ_{k}^{j}, & s_{k + 1} = s_{k} + 1, \\ 0, & otherwise . \end{matrix}

(15)

(5) Reward: The one-stage reward functions for the sensor and the j-th type attacker are, respectively, given as

\begin{matrix} r_{k, d} (s_{k}, λ_{k}, δ_{k}) \\ = & - \sum_{j = 1}^{H} η (β_{j}) tr [μ_{k}^{j} \bar{P} + (1 - μ_{k}^{j}) h^{s_{k} + 1} (\bar{P})] - C_{d} λ_{k}, \end{matrix}

(16)

and

\begin{matrix} r_{k, a}^{j} (s_{k}, λ_{k}, δ_{k}^{j}) \\ = & tr [μ_{k}^{j} \bar{P} + (1 - μ_{k}^{j}) h^{s_{k} + 1} (\bar{P})] + C_{d} λ_{k} - C_{a} δ_{k}^{j} . \end{matrix}

(17)

Note that as the one-stage reward functions are time-invariant and stationary and can also be denoted as

r_{d} (s_{k}, λ_{k}, δ_{k})

and

r_{a}^{j} (s_{k}, λ_{k}, δ_{k}^{j})

.

The sensor’s strategy

π_{d} (s) = λ_{k} (s)

means that the sensor takes action

λ_{k}

under state

s_{k}

. Similarly, the strategy of the j-th type attacker under state s is denoted as

π_{a}^{j} (s) = δ_{k}^{j} (s)

. Then, the strategies of the sensor and the j-th type attacker are, respectively, written as

\begin{matrix} π_{d} & = {π_{d} (s) | \forall s \in S} \in Π_{d}, \\ π_{a}^{j} & = {π_{a}^{j} (s) | \forall s \in S} \in Π_{a}^{j}, \end{matrix}

where

Π_{d}

and

Π_{d}^{i}

are the set of all stationary and deterministic policies for the sensor and j-th type attacker, respectively. Furthermore, the strategy of the attacker is denoted as

π_{a} = [π_{a}^{1}, π_{a}^{2}, \dots, π_{d}^{H}] \in Π_{a}

where

Π_{a}

is the set of all stationary and deterministic policies for the attacker. Then, the payoff functions of the sensor and j-th type attacker are, respectively, given as

\begin{matrix} J_{d} (s, π_{d}, π_{a}) & = \sum_{k = 1}^{\infty} ρ^{k - 1} r_{d} (s_{k}, λ_{k}, δ_{k}), \end{matrix}

(18)

\begin{matrix} J_{a}^{j} (s, π_{d}, π_{a}^{j}) & = \sum_{k = 1}^{\infty} ρ^{k - 1} r_{a}^{j} (s_{k}, λ_{k}, δ_{k}^{j}) . \end{matrix}

(19)

The goal of both the sensor and the attacker is to seek the optimal strategy to maximize the payoff function.

5. Bayesian Stackelberg Game

Based on the above MDP framework, a BSG with two players was investigated for designing the optimal strategies for both the sensor and the attacker. In this hierarchical game, the sensor, as the leader, first makes its action according to the state and declares its strategy. It is noted that the sensor knows the reaction from the attacker in a Stackelberg game. Then, the attacker, as the follower, takes action based on the acquired channel gain and the transmission power of the sensor. The payoff functions of the sensor and j-th type attacker are given as (18) and (19), respectively. The SE of the BSG is analyzed and the corresponding optimal strategies for both players are provided. The best response for each side is first defined as follows.

Definition 1.

The best response is that a player takes an action that optimizes its own payoff while taking into account other players’ actions. Specifically, the best responses for the sensor and j-th type attacker are given as

\begin{matrix} ϕ_{d}^{*} (π_{a}) & = arg max_{π_{d} \in Π_{d}} J_{d} (s, π_{d}, π_{a}), \end{matrix}

(20)

\begin{matrix} ϕ_{a}^{j, *} (π_{d}) & = arg max_{π_{a}^{j} \in Π_{a}^{j}} J_{a}^{j} (s, π_{d}, π_{a}^{j}), j \in N . \end{matrix}

(21)

Then, the best responses for the attacker are denoted as

ϕ_{a}^{*} (π_{d}) = [ϕ_{a}^{1, *} (π_{d}), ϕ_{a}^{2, *} (π_{d}), \dots, ϕ_{a}^{H, *} (π_{d})] .

(22)

The best-response set of the j-th type attacker to the sensor’s strategy

π_{d}

is denoted as

R_{a}^{j} (π_{d})

and

R_{a} (π_{d}) = [R_{a}^{1} (π_{d}), \dots, R_{a}^{H} (π_{d})]

is the best-response set of the attacker to strategy

π_{d}

.

Lemma 1.

For the proposed BSG with two players, the Stackelberg equilibrium

π_{d}^{S E}

of the sensor satisfies

J_{d} (s, π_{d}^{S E}, ϕ_{a}^{*} (π_{d}^{S E})) \geq J_{d} (s, π_{d}, ϕ_{a}^{*} (π_{d})), \forall π_{d} \in Π_{d} .

(23)

where

ϕ_{a}^{*} (π_{d}^{S E})

is expressed as (22).

Proof.

In the BSG, the channel gain and the strategy of the sensor can be obtained and imposed on the attacker; then, each type of attacker takes the best response to maximize its own payoff function. The sensor chooses the optimal strategy by taking account into the follower’s best response to maximize its own payoff functions, which completes the proof. □

Based on Lemma 1, the solution to the proposed Stackelberg game with incomplete channel gain information is given by the following theorem.

Theorem 1.

The solution to the proposed BSG is given by first solving

π_{d}^{S E} = ϕ_{d}^{*} (ϕ_{a}^{*} (π_{d}^{S E})),

(24)

and then calculating

\begin{matrix} π_{a}^{S E} = [π_{a}^{1, S E}, π_{a}^{2, S E}, \dots, π_{a}^{H, S E}], \end{matrix}

(25)

where

π_{a}^{j, S E} = ϕ_{a}^{j, *} (π_{d}^{S E}), j \in N .

(26)

Therefore, the SE of this game is

(π_{d}^{S E}, π_{a}^{S E}) \in Π_{d} \times Π_{a}

.

Proof.

In the proposed BSG, the sensor as the leader and the DoS attacker as the follower make decisions sequentially. Given the sensor’s strategy

π_{d}

, the j-th type attacker will take the best response according to

π_{a}^{j} = ϕ_{a}^{j, *} (π_{d})

. The sensor knows the reaction from the attacker; hence, it will choose the optimal strategy to maximize its own payoff function while taking account into the attacker’s strategy, that is,

\begin{matrix} π_{d}^{S E} & = arg max_{π_{d} \in Π_{d}} J_{d} (s, π_{d}, ϕ_{a}^{*} (π_{d}^{S E})) \\ = ϕ_{d}^{*} (ϕ_{a}^{*} (π_{d}^{S E})) . \end{matrix}

After obtaining the sensor’s strategy

π_{d}^{S E}

, the j-th type attacker will choose the optimal strategy

π_{a}^{j, S E}

by calculating

π_{a}^{j, S E} = ϕ_{a}^{j} (π_{d}^{S E}), j \in N

which completes the proof. □

Remark 3.

This theorem provides a constructive method to find the SE for the proposed BSG. It confirms that within the finite strategy spaces considered, the sensor can determine its optimal strategy by anticipating and incorporating the attacker’s best response, which, in turn, is computed based on the sensor’s declared action. This fixed-point characterization forms the theoretical foundation for the learning algorithm developed in Section 6.

Proposition 1.

There exists at least one SE in the BSG.

Proof.

Given the sensor’s strategy

π_{d}

, the j-th type attacker will choose the strategy from the set

π_{a}^{j}

. Given the final state K, the numbers of possible strategies for the sensor and the j-th type attacker are

l_{d}^{K + 1}

and

l_{a}^{j, K + 1}

, respectively. The j-th type attacker’s strategy space

π_{a}^{j}

is finite, its optimal response

ϕ_{a}^{i, *} (π_{d})

always exists for the fixed sensor’s strategy

π_{d}

. Moreover, the finite strategy space implies the existence of an equilibrium strategy for the sensor by (24). Therefore, there always exists an SE for the proposed BSG. □

Remark 4.

This conclusion substantiates the existence and characterization of the SE in our Bayesian game setting. It implies that despite the incomplete information regarding the attacker’s type, an optimal hierarchical strategy profile exists and can be sought through iterative best-response dynamics.

6. Reinforcement Learning

As mentioned previously, the sensor is unable to access to the SINR of the AWGN channel. In this section, model-free reinforcement learning, i.e., Q-learning is introduced to find the SE. Then, a Stackelberg Q-learning algorithm for the two-player BSG is used to obtain the optimal policy for both the sensor and the attacker.

The optimal payoff functions are given as

\begin{matrix} J_{d}^{*} (s) & = max_{π_{d} \in Π_{d}} min_{π_{a} \in R_{a} (π_{d})} Q_{d}^{*} (s, λ, δ), \end{matrix}

(27)

\begin{matrix} J_{a}^{j, *} (s) & = max_{π_{a}^{j} \in R_{a}^{j} (π_{d}^{S E})} Q_{a}^{j, *} (s, λ^{S E}, δ^{j}), \end{matrix}

(28)

where the optimal state-action-value functions (Q-functions) for the sensor and the j-th type attacker are, respectively, defined as follows:

\begin{matrix} Q_{d}^{*} (s, λ, δ) & = r_{d} (s, λ, δ) + ρ \sum_{s^{'} \in S} (\sum_{j = 1}^{H} η (β_{j}) Pr (s^{'} | s, λ, δ^{j})) J_{d}^{*} (s^{'}), \\ Q_{a}^{j, *} (s, λ, δ^{j}) & = r_{a}^{j} (s, λ, δ^{j}) + ρ \sum_{s^{'} \in S} Pr (s^{'} | s, λ, δ^{j}) J_{a}^{j, *} (s^{'}) . \end{matrix}

(29)

Then, the SE strategy is obtained by

\begin{matrix} π_{d} (s) & = arg max_{π_{d} \in Π_{d}} min_{π_{a} \in R_{a} (π_{d})} Q_{d}^{*} (s, λ, δ), \end{matrix}

(30)

\begin{matrix} π_{a}^{j} (s) & = arg max_{π_{a}^{j} \in R_{a}^{j} (π_{d}^{S E})} Q_{a}^{j, *} (s, λ^{S E}, δ^{j}) . \end{matrix}

(31)

Considering the multi-type channel interference gain of the attacker, the Q-value functions of the sensor and the attacker are recursively updated by SE presented in Theorem 1:

\begin{matrix} Q_{k + 1, d} (s, λ, δ) \\ = & (1 - α_{k}) Q_{k, d} (s, λ, δ) + α_{k} Ψ_{k} (Q_{k, d} (s, λ, δ)), \end{matrix}

(32)

\begin{matrix} Q_{k + 1, a}^{j} (s, λ, δ^{j}) \\ = & (1 - α_{k}) Q_{k, a}^{j} (s, λ, δ^{j}) + α_{k} Ψ_{k} (Q_{k, a}^{j} (s, λ, δ^{j})), \end{matrix}

(33)

where

\begin{matrix} Ψ_{k} (Q_{k, d} (s, λ, δ)) & = r_{d} (s, λ, δ) + ρ \sum_{s^{'} \in S} (\sum_{j = 1}^{H} η (β_{j}) Pr (s^{'} | s, λ, δ^{j})) s t a c k e l b e r g Q_{k, d} (s^{'}, λ^{'}, δ^{'}), \end{matrix}

(34)

\begin{matrix} Ψ_{k} (Q_{k, a}^{j} (s, λ, δ^{j})) & = r_{a}^{j} (s, λ, δ^{j}) + ρ \sum_{s^{'} \in S} Pr (s^{'} | s, λ, δ^{j}) s t a c k e l b e r g Q_{k, a}^{j} (s^{'}, λ^{'}, δ^{j,^{'}}) \end{matrix}

(35)

with the learning rate

α_{k} \in [0, 1)

,

s t a c k e l b e r g Q_{k, d} (s^{'}, λ^{'}, δ^{'})

and

s t a c k e l b e r g Q_{k, a}^{j} (s^{'}, λ^{'}, δ^{j,^{'}})

are the Q-values of of the SE solutions at state

s^{'}

for the sensor and the j-th type attacker, respectively.

In the BSG, the sensor and the attacker acts as the leader and the follower, respectively; hence, the update of Stackelberg Q-value is also hierarchically updated. To find the maximum Q-value of the attacker for the state

s^{'}

, the attacker chooses the optimal action from the best response (26). Based on the observation of the sensor’s action, the optimal action that the j-th type attacker takes in response to the sensor’s action

λ_{k}

is determined as

ϕ_{a}^{j, *} (λ^{'}) = arg max_{δ^{j}} Q_{k, a}^{j} (s^{'}, λ^{'}, δ^{j}) .

(36)

Then the optimal action of the attacker is given as

ϕ_{a}^{*} (λ^{'}) = [ϕ_{a}^{1, *} (λ^{'}), \dots, ϕ_{a}^{H, *} (λ^{'})] .

(37)

Based on the BSG framework, the optimal action of the sensor is given by

λ^{*} = arg max_{λ^{'}} Q_{k, d} (s^{'}, λ^{'}, ϕ_{a}^{*} (λ^{*})),

(38)

Consequently, we compute

s t a c k e l b e r g Q_{k, d} (s^{'}, λ^{'}, δ^{'})

and

s t a c k e l b e r g Q_{k, a}^{j} (s^{'}, λ^{'}, δ^{j,^{'}})

in the Q-functions update as

\begin{matrix} s t a c k e l b e r g Q_{k, d} (s^{'}, λ^{'}, δ^{'}) & = Q_{k, d} (s^{'}, λ^{*}, ϕ_{a}^{*} (λ^{*})) \end{matrix}

(39)

\begin{matrix} s t a c k e l b e r g Q_{k, a}^{j} (s^{'}, λ, δ^{j,^{'}}) & = Q_{k, a}^{j} (s^{'}, λ^{*}, ϕ_{a}^{j, *} (λ^{*})) \end{matrix}

(40)

As a result, the Bayesian Stackelberg Q-learning Algorithm for the incomplete channel gain is presented in Algorithm 1. Key implementation details are as follows: The Q-values for both players are initialized to zero, a common practice that does not bias the asymptotic convergence. This initialization does not affect the final optimal policy learned, ensuring the algorithm’s robustness to initial conditions. An

ϵ

-greedy exploration strategy is employed, where, with probability

ϵ

, the actions are chosen randomly, and with probability

1 - ϵ

, they are chosen as the SE actions based on current Q-values. The learning rate

α_{k}

for updating the Q-tables follows a schedule that ensures the Robbins-Monro conditions are met (as required by Theorem 2), typically by decaying with the number of visits to each state-action pair. The specific forms of

α_{k}

and

ϵ

used in our simulations are provided in Section 7.

Algorithm 1 Bayesian Stackelberg Q-learning Algorithm with Incomplete Channel Gain Information

Input: System parameter matrices A, C, Q, R,

σ^{2}

1:

2:: Initialization: Initialize $Q_{0, d} (s, λ, δ)$ and $Q_{0, a}^{j} (s, λ, δ^{j})$ for all $j \in N$ , $s \in S$ , $λ \in Λ$ , and $δ^{j} \in Δ^{j}$ ; initial state $s_{0}$ ; Total iterations T; exploration probability $ε$ . Set iteration counter $k = 1$ ;
3:: while $k < T$ do
4:: Randomize a number $ζ \in [0, 1]$ ;
5:: if $ζ \in [ε, 1]$ then
6:: Take actions $λ$ and $δ^{j}$ for $j \in N$ based on (36)–(38);
7:: else
8:: Select random actions $λ$ and $δ^{j}$ for $j \in N$ ;
9:: end if
10:: Obtain rewards $r_{d} (s, λ, δ)$ and $r_{d} (s, λ, δ)$ ;
11:: Update $Q_{k + 1, d} (s, λ, δ)$ and $Q_{k + 1, a}^{j}; (s, λ, δ^{j})$ , and move to next state;
12:: $k = k + 1$ ;
13:: end while

Output: The optimal strategies and the optimal payoff functions for the sensor and the attacker

Remark 5.

The per-iteration computational cost of the Stackelberg Q-learning algorithm (Algorithm 1) is linear in the product of the state space size

| S |

and the action space sizes

| Λ | \times | Δ^{j} |

, as it requires updating Q-tables for all state–action pairs encountered. This makes it efficient for the moderately sized problems presented here. For systems with significantly larger state or action spaces, the well-known curse of dimensionality would necessitate the use of function approximation (e.g., deep Q-networks), which is a promising direction for future work, as noted in Section 8.

The convergence of Algorithm 1 to the optimal Q-functions is guaranteed under the conditions specified in Theorem 2. The learning rate and exploration schedules described above are designed to satisfy these conditions, in particular the Robbins–Monro requirements for the learning rate.

Theorem 2

([36]). The Bayesian Stackelberg Q-learning sequences described in (32) and (33) will converge to the optimal values if the following assumptions hold:

1.: The recursive processes (32) and (33) converge to $Q_{d} {(s, λ, δ)}^{*}$ and $Q_{a} {(s, λ, δ)}^{*}$ with probability 1, respectively.
2.: There exists a number $a \in (0, 1)$ and a sequence $ς_{k} \geq 0$ converging to zero with probability 1, such that

$\begin{matrix} ∥ Ψ_{k} (Q_{d}) - Ψ (Q_{d}^{*}) ∥ & \leq a ∥ Q_{d} - Q_{d}^{*} ∥ + ς_{k}, \forall Q_{d} \in Q \\ ∥ Ψ_{k} (Q_{a}^{j}) - Ψ (Q_{a}^{j, *}) ∥ & \leq a ∥ Q_{a}^{j} - Q_{a}^{j, *} ∥ + ς_{k}, \forall Q_{a} \in Q \end{matrix}$

where $Q$ is the Q-value space.
3.: The learning rate $α_{k}$ satisfies that $α_{k} \in [0, 1)$ , and $\sum_{k = 1}^{n} α_{k}$ converges to infinity uniformly as $n \to \infty$ .

Remark 6.

Condition 1 assumes that the recursive updates for both players converge with probability 1, meaning the learning process is inherently stable and reaches a fixed point. This is approximated by implementing ε-greedy exploration in Algorithm 1 and running the training for a large number of steps. Condition 2 ensures that the algorithm is driven toward a unique fixed point while noise vanishes by a contraction operator with a decaying perturbation. The learning rate in Condition 3 must satisfy the Robbins–Monro conditions for stochastic approximation. In the simulation section, the learning rate

α_{k}

is designed to be a nonzero decreasing function of time step k and the current state and actions.

7. Simulation Results

In this section, a numerical example is provided to verify the performance of the proposed stochastic attack strategy. The following system parameter are considered:

\begin{matrix} A & = [\begin{matrix} 1 & 0.5 \\ 0 & 1.05 \end{matrix}], C = [\begin{matrix} 1 & 0 \end{matrix}], Q = 0.5 * I_{2}, R = 0.5 . \end{matrix}

Then the steady-state error covariance

\bar{P}

is

\bar{P} = [0.3802, 0.2840; 0.2840, 1.6894]

. The AWGN power and the modulation parameter are

σ_{k} = 0.5

and

κ = 0.2

, respectively. The fading channel gain and the action space of the sensor are given as

α = 0.7

and

A_{d} = {1, 2, 3}

. Assume that the fading channel gains of the DoS attacker are

[0.5, 0.8]

with probability distribution

[0.8, 0.2]

. This setting creates a representative asymmetric information scenario in which the sensor needs to design a strategy against an attacker that is most likely to have a moderate interference capability but must also account for a significant possibility of facing a more potent attacker. Correspondingly, the action spaces for the two types of the attacker are

A_{a}^{1} = {4, 5, 6}

and

A_{a}^{2} = {1, 2, 3}

, respectively. The unit power costs for the sensor and the attacker are

C_{d} = 1

and

C_{a} = 2

, respectively. Set the finial state to

K = 4

, and the state space is given as

S = {0, 1, 2, 3, 4}

, which means the estimation error takes the value from

{\bar{P}, h (\bar{P}), \dots, h^{4} (\bar{P})}

. The discount factor for payoff functions is

ρ = 0.9

.

In the Bayesian Stackelberg Q-learning algorithm, the learning rate and the initial exploring rate are set as

α = 10 / [15 + count (s, λ, δ)]

and

ϵ_{0} = 0.98

, respectively, where

count (s, λ, δ)

is the number of the occurrence of state–action pair

(s, λ, δ)

. This design ensures that the learning rate satisfies the Robbins–Monro conditions, a standard requirement for the convergence of stochastic approximation algorithms like Q-learning. The initial values are set as

Q_{0, d} (s, λ, δ) = 0

and

Q_{0, a}^{j} (s, λ, δ^{j}) = 0

for all

j \in N

.

After 100,000 learning steps, the Q function values in different states converge. The equilibrium payoffs to the sensor and attacker for states

s = \bar{P}, \dots, h^{4} (\bar{P})

are given as follows:

\begin{matrix} J_{d}^{*} (s) & = [- 131.9945, - 141.6237, - 149.8064, - 155.1217, - 155.1217], \\ J_{a}^{1, *} (s) & = [81.5995, 87.9412, 95.6919, 101.0546, 101.0546], \\ J_{a}^{2, *} (s) & = [97.7494, 103.6945, 111.4222, 116.5348, 116.5348] . \end{matrix}

The corresponding optimal transmission and interference strategies are given as follows:

\begin{matrix} π_{d}^{*} (s) & = [1, 3, 3, 3, 3], \\ π_{a}^{1, *} (s) & = [4, 4, 5, 6, 6], \\ π_{a}^{2, *} (s) & = [1, 3, 3, 3, 3] . \end{matrix}

The resulting Stackelberg equilibrium strategies are explicitly state-dependent, quantitatively showing how the sensor optimally increases transmission power as the estimation error grows. Finally, the equilibrium payoffs are quantitatively characterized, showing the cost–performance trade-off across states. The optimal strategies under the BSG confirm the intuition that a relatively small error covariance at the remote estimator enables the sensor to use less power for transmission, also leading to less interference power for the attacker. For a larger estimation error covariance, the sensor is inclined to choose a high power level to increase the packet arrival rate and, therefore, improve estimation performance. In this case, the attacker also increases the interference power accordingly.

The algorithm converges to distinct, stable Q-values for different state–action pairs. Due to the high dimension of the Q-values

Q_{k} (s, λ, δ)

, next, we present the learning process for the initial state

s = \bar{P}

as an illustrative example. When the sensor takes the action

λ = 1

, the corresponding Q-functions converge, as shown in Table 1. Table 2 gives all converged Q-values of the attacker. We can see that the optimal strategy for the attacker is

δ = (4, 1)

under state

s = \bar{P}

.

Figure 2, Figure 3 and Figure 4 depict the learning processes for the optimal strategies of the sensor, the first type attacker, and the second type attacker for state

\bar{P}

, respectively. In the figures, different colored lines respectively represent the iterative Q-values under different actions of the sensor and attacker. These figures collectively depict the convergence of Q-functions for all possible action combinations at state

\bar{P}

. The high density of lines illustrates the learning algorithm’s exploration across the entire strategy space. While individual lines are not labeled for clarity, their collective trend towards stabilization after around 5000 iterations is evident, confirming the convergence of the learning process. The precise converged Q-values that underpin the optimal strategies are detailed in Table 1 and Table 2.

Next, we consider the situation that the attacker can launch attacks without any cost constraints, that is, the cost of the interference power can be ignored. In the case of

C_{a} = 0.01

, the equilibrium payoffs to the sensor and attacker for states

s = \bar{P}, \dots, h^{4} (\bar{P})

are given as follows:

\begin{matrix} J_{d}^{*} (s) & = [- 138.7763, - 148.5242, - 156.6659, - 161.9812, - 161.9812], \\ J_{a}^{1, *} (s) & = [189.0304, 196.9475, 205.1795, 210.5454, 210.5454], \\ J_{a}^{2, *} (s) & = [155.1126, 162.1956, 169.9786, 175.0912, 175.0912] . \end{matrix}

The corresponding optimal transmission and interference strategies are given as follows:

\begin{matrix} π_{d}^{*} (s) & = [1, 3, 3, 3, 3], \\ π_{a}^{1, *} (s) & = [6, 6, 6, 6, 6], \\ π_{a}^{2, *} (s) & = [3, 3, 3, 3, 3] . \end{matrix}

The optimal strategies confirm that when the cost of interference power

C_{a}

is negligible, both types of attackers consistently select the highest available power level in their respective action sets. This behavior aligns perfectly with game-theoretic intuition: as the marginal cost of interference diminishes, the rational objective for the attacker shifts overwhelmingly towards maximizing the degradation of the estimation performance, leading to the selection of maximum jamming power. This result underscores the critical role of cost parameters in shaping the equilibrium of the security game.

The learning processes for the optimal strategies of the sensor, the first type attacker, and the second type attacker for state

\bar{P}

are shown in Figure 5, Figure 6 and Figure 7. Similar to the previous case, the convergence trends for all action combinations are shown. The eventual stabilization of all trajectories validates the algorithm’s effectiveness even in this complex setting. Notably, as seen in Figure 5, the Q-values of the sensor exhibit significant fluctuation and converge more slowly compared to those of the attackers in Figure 6 and Figure 7. This can be attributed to the information asymmetry inherent in the problem: the sensor lacks precise knowledge of the attacker’s channel gain and must learn an optimal strategy based only on the probabilistic distribution over attacker types. This uncertainty inherently increases the exploration burden and complexity of the learning process for the leader.

Therefore, the numerical results validate the algorithm’s effectiveness by demonstrating its capability to solve the proposed BSG model: It is seen that the algorithm achieves convergence to a stable equilibrium where both players’ strategies are mutually optimal in the sequential sense, as theorized in Section 5 and Section 6. The logical adaptation of strategies to different states and costs further confirms that the learned policies align with game-theoretic intuition, providing strong empirical support for the proposed framework.

8. Conclusions

This paper investigated the Bayesian Stackelberg game for remote state estimation under SINR-based DoS attacks. The two players sequentially decide their transmission and interference powers under incomplete information. Specifically, the sensor lacks exact knowledge of the attacker’s fading channel gain. The optimization problem over an infinite-time horizon is first formulated as an MDP with finite state and action spaces. By taking advantage of the probabilistic information about the channel interference gain, a BSG is modeled to describe the iterative decision-making process between the sensor and the various types of attackers. Based on the solution to the BSG, a Stackelberg Q-learning algorithm is used to obtain the optimal strategies of the two players. Numerical results validate the effectiveness of the proposed game-theoretic approach in the case of uncertain channel gains. The presented framework operates under core assumptions of a known attacker type distribution and discrete action spaces, which define its current scope but also present opportunities for future generalization. Future work also includes analyzing games where both players have incomplete information and extending the framework to larger-scale systems via function approximation (e.g., deep reinforcement learning) to mitigate the curse of dimensionality.

Author Contributions

Conceptualization, D.D.; methodology, D.D.; software, D.D.; validation, D.D., P.Y. and M.Q.; formal analysis, D.D.; investigation, D.D.; resources, D.D.; data curation, D.D.; writing—original draft preparation, D.D.; writing—review and editing, D.D., P.Y. and M.Q.; visualization, D.D.; supervision, P.Y. and M.Q.; project administration, M.Q. and P.Y.; funding acquisition, P.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ahmed, S.H.; Kim, G.; Kim, D. Cyber physical system: Architecture, applications and research challenges. In Proceedings of the 2013 IFIP Wireless Days (WD), Valencia, Spain, 13–15 November 2013; pp. 1–5. [Google Scholar]
Duo, W.; Zhou, M.; Abusorrah, A. A survey of cyber attacks on cyber physical systems: Recent advances and challenges. IEEE/CAA J. Autom. Sin. 2022, 9, 784–800. [Google Scholar] [CrossRef]
Sun, Y.-C.; Gao, K.; Chen, L.; Yang, F.; An, L. Optimal transmission scheduling for remote state estimation under active eavesdropping-based DoS attacks. IEEE Trans. Syst. Man Cybern. Syst. 2025, 55, 7487–7498. [Google Scholar] [CrossRef]
Zhang, H.; Cheng, P.; Shi, L.; Chen, J. Optimal DoS attack scheduling in wireless networked control system. IEEE Trans. Control Syst. Technol. 2016, 24, 843–852. [Google Scholar] [CrossRef]
Saiyed, M.F.; Al-Anbagi, I. A game theoretic model for strategic defence selection against DDos attacks in IoT networks. IEEE Trans. Netw. Serv. Manag. 2025, 22, 4509–4524. [Google Scholar] [CrossRef]
Cai, X.; Xiao, F.; Wei, B. Resilient Nash equilibrium seeking in multiagent games under false data injection attacks. IEEE Trans. Syst. Man Cybern. Syst. 2023, 53, 275–284. [Google Scholar] [CrossRef]
Lian, J.; Jia, P.; Wu, F.; Huang, X. A Stackelberg game approach to the stability of networked switched systems under Dos attacks. IEEE Trans. Netw. Sci. Eng. 2023, 10, 2086–2097. [Google Scholar] [CrossRef]
Wang, Q.; Liu, C.; Lan, J.; Ren, X.; Meng, Y.; Wang, X. Distributed secure surrounding control for multiple USVs against deception attacks: A Stackelberg game approach with reinforcement learning. IEEE Trans. Intell. Veh. 2024, 9, 7003–7015. [Google Scholar] [CrossRef]
Liu, H. SINR-based multi-channel power schedule under DoS attacks: A Stackelberg game approach with incomplete information. Automatica 2019, 100, 274–280. [Google Scholar] [CrossRef]
Wang, X.; Zhu, H.; Luo, X.; Guan, X. Data-driven-based detection and localization framework against false data injection attacks in DC microgrids. IEEE Internet Things J. 2025, 12, 36079–36093. [Google Scholar] [CrossRef]
Zhou, J.; Shang, J.; Chen, T. Cybersecurity landscape on remote state estimation: A comprehensive review. IEEE/CAA J. Autom. Sin. 2024, 11, 851–865. [Google Scholar] [CrossRef]
Attkan, A.; Ranga, V. Cyber-physical security for IoT networks: A comprehensive review on traditional, blockchain and artificial intelligence based key-security. Complex Intell. Syst. 2022, 8, 3559–3591. [Google Scholar] [CrossRef]
Kim, S.; Park, K.-J.; Lu, C. A survey on network security for cyber-physical systems: From threats to resilient design. IEEE Commun. Surv. Tutor. 2022, 24, 1534–1573. [Google Scholar] [CrossRef]
Huseinović, A.; Mrdović, S.; Bicakci, K.; Uludag, S. A survey of denial-of-service attacks and solutions in the smart grid. IEEE Access 2020, 8, 177447–177470. [Google Scholar] [CrossRef]
Huang, M.; Ding, K.; Dey, S.; Li, Y.; Shi, L. Learning-based DoS attack power allocation in multiprocess systems. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 8017–8030. [Google Scholar] [CrossRef]
Alpcan, T.; Başar, T. Network Security: A Decision and Game-Theoretic Approach; Cambridge University Press: Cambridge, UK, 2010. [Google Scholar]
Li, H.; Lai, L.; Qiu, R.C. A denial-of-service jamming game for remote state monitoring in smart grid. In Proceedings of the 2011 45th Annual Conference on Information Sciences and Systems, Baltimore, MD, USA, 23–25 March 2011; pp. 1–6. [Google Scholar]
Liu, S.; Liu, P.X.; Saddik, A.E. A stochastic game approach to the security issue of networked control systems under jamming attacks. J. Frankl. Inst. 2014, 351, 4570–4583. [Google Scholar] [CrossRef]
Li, Y.; Quevedo, D.E.; Dey, S.; Shi, L. A game-theoretic approach to fake-acknowledgment attack on cyber-physical systems. IEEE Trans. Signal Inf. Process. Over Netw. 2017, 3, 1–11. [Google Scholar] [CrossRef]
Li, Y.; Shi, L.; Cheng, P.; Chen, J.; Quevedo, D.E. Jamming attacks on remote state estimation in cyber-physical systems: A game-theoretic approach. IEEE Trans. Autom. Control 2015, 60, 2831–2836. [Google Scholar] [CrossRef]
Li, Y.; Quevedo, D.E.; Dey, S.; Shi, L. SINR-based DoS attack on remote state estimation: A game-theoretic approach. IEEE Trans. Control Netw. Syst. 2017, 4, 632–642. [Google Scholar] [CrossRef]
Ding, K.; Li, Y.; Quevedo, D.E.; Dey, S.; Shi, L. A multi-channel transmission schedule for remote state estimation under DoS attacks. Automatica 2017, 78, 194–201. [Google Scholar] [CrossRef]
Wilczyński, A.; Jakóbik, A.; Kołodziej, J. Stackelberg security games: Models, applications and computational aspects. J. Telecommun. Inf. Technol. 2016, 65, 70–79. [Google Scholar] [CrossRef]
Feng, Y.; Shou, Y.; Yu, X. Jamming on remote estimation over wireless links under faded uncertainty: A Stackelberg game approach. IEEE Trans. Circuits Syst. II Express Briefs 2021, 68, 2593–2597. [Google Scholar] [CrossRef]
Li, Y.; Shi, D.; Chen, T. False data injection attacks on networked control systems: A Stackelberg game analysis. IEEE Trans. Autom. Control 2018, 63, 3503–3509. [Google Scholar] [CrossRef]
Xing, W.; Zhao, X.; Li, Y.; Liu, L. Denial-of-service attacks on cyber-physical systems against linear quadratic control: A Stackelberg-game analysis. IEEE Trans. Autom. Control 2025, 70, 595–602. [Google Scholar] [CrossRef]
Sun, Y.-C.; Gao, K.; Chen, L.; Yang, F.; Yao, L. Optimal power schedule for distributed Kalman filtering under DoS attacks: A Stackelberg game strategy. Int. J. Syst. Sci. 2025, 56, 2067–2081. [Google Scholar] [CrossRef]
Chen, L.; An, L. Optimal stealthy robust attacks on stackelberg game. Automatica 2025, 177, 112310. [Google Scholar] [CrossRef]
Jia, L.; Yao, F.; Sun, Y.; Niu, Y.; Zhu, Y. Bayesian Stackelberg game for antijamming transmission with incomplete information. IEEE Commun. Lett. 2016, 20, 1991–1994. [Google Scholar] [CrossRef]
Wang, Y.; Xing, W.; Zhang, J.; Liu, L.; Zhao, X. Remote state estimation under DoS attacks in CPSs with arbitrary tree topology: A Bayesian Stackelberg game approach. IEEE Trans. Signal Inf. Process. Over Netw. 2024, 10, 527–538. [Google Scholar] [CrossRef]
Ding, K.; Ren, X.; Quevedo, D.E.; Dey, S.; Shi, L. DoS attacks on remote state estimation with asymmetric information. IEEE Trans. Control Netw. Syst. 2019, 6, 653–666. [Google Scholar] [CrossRef]
Cong, X.; Yu, Z.; Fanti, M.P.; Mangini, A.M.; Li, Z. Predictability verification of fault patterns in labeled Petri nets. IEEE Trans. Autom. Control 2025, 70, 1973–1980. [Google Scholar] [CrossRef]
Burgat, J.; Dorè, J.-B.; Farah, J.; Crussixexre, M. Vulnerability analysis of dynamic directional modulations: A multi-sensor receiver attack. In Proceedings of the 2024 IEEE Military Communications Conference (MILCOM), Washington, DC, USA, 28 October–1 November 2024; pp. 1–6. [Google Scholar]
Zayyani, H.; Salman, M.; Hilal, H.A. Joint measurement and channel design of a malicious sensor in distributed estimation based on maximum disturbance in a sensor network. IEEE Sens. Lett. 2025, 9, 7000104. [Google Scholar] [CrossRef]
Hiraoka, S.; Nakashima, Y.; Yamazato, T.; Arai, S.; Tadokoro, Y.; Tanaka, H. Interference-aided detection of subthreshold signal using beam control in polarization diversity reception. IEEE Commun. Lett. 2018, 22, 1926–1929. [Google Scholar] [CrossRef]
Könönen, V. Asymmetric multiagent reinforcement learning. Web Intell. Agent Syst. 2004, 2, 105–121. [Google Scholar]

Figure 1. Remote state estimation over AWGN channel under DoS attacks.

Figure 2. Convergence of Q-values for the sensor at state

\bar{P}

. The multitude of colored lines represents the learning trajectories of

Q_{k, d} (s, λ, δ)

for all possible combinations of the sensor’s transmission power

λ

and the attacker’s interference power

δ = (δ^{1}, δ^{2})

. The collective convergence of all lines after approximately 5000 iterations demonstrates the stabilization of the learning process. The specific equilibrium Q-values that define the optimal strategy are listed in Table 1.

Figure 2. Convergence of Q-values for the sensor at state

\bar{P}

. The multitude of colored lines represents the learning trajectories of

Q_{k, d} (s, λ, δ)

for all possible combinations of the sensor’s transmission power

λ

and the attacker’s interference power

δ = (δ^{1}, δ^{2})

. The collective convergence of all lines after approximately 5000 iterations demonstrates the stabilization of the learning process. The specific equilibrium Q-values that define the optimal strategy are listed in Table 1.

Figure 3. Convergence of Q-values for the first-type attacker at state

\bar{P}

. Each colored line corresponds to the learning trajectory of

Q_{k, a}^{1} (s, λ, δ^{1})

for a specific pair of sensor power and attacker power. The convergence of all lines illustrates the algorithm’s stability for this attacker type. The resulting equilibrium values are part of the dataset summarized in Table 2.

Figure 3. Convergence of Q-values for the first-type attacker at state

\bar{P}

. Each colored line corresponds to the learning trajectory of

Q_{k, a}^{1} (s, λ, δ^{1})

for a specific pair of sensor power and attacker power. The convergence of all lines illustrates the algorithm’s stability for this attacker type. The resulting equilibrium values are part of the dataset summarized in Table 2.

Figure 4. Convergence of Q-values for the second-type attacker at state

\bar{P}

. Each colored line corresponds to the learning trajectory of

Q_{k, a}^{2} (s, λ, δ^{2})

for a specific pair of sensor power and attacker power. The convergence of all lines illustrates the algorithm’s stability for this attacker type. The resulting equilibrium values are part of the dataset summarized in Table 2.

Figure 4. Convergence of Q-values for the second-type attacker at state

\bar{P}

. Each colored line corresponds to the learning trajectory of

Q_{k, a}^{2} (s, λ, δ^{2})

for a specific pair of sensor power and attacker power. The convergence of all lines illustrates the algorithm’s stability for this attacker type. The resulting equilibrium values are part of the dataset summarized in Table 2.

Figure 5. Convergence of Q-values for the sensor at state

\bar{P}

under

C_{a} = 0.01

. The multitude of colored lines represents the learning trajectories of

Q_{k, d} (s, λ, δ)

for all possible combinations of the sensor’s transmission power

λ

and the attacker’s interference power

δ = (δ^{1}, δ^{2})

.

Figure 5. Convergence of Q-values for the sensor at state

\bar{P}

under

C_{a} = 0.01

. The multitude of colored lines represents the learning trajectories of

Q_{k, d} (s, λ, δ)

for all possible combinations of the sensor’s transmission power

λ

and the attacker’s interference power

δ = (δ^{1}, δ^{2})

.

Figure 6. Convergence of Q-values for the first-type attacker at state

\bar{P}

under

C_{a} = 0.01

. Each colored line corresponds to the learning trajectory of

Q_{k, a}^{1} (s, λ, δ^{1})

for a specific pair of sensor power and attacker power. The convergence of all lines illustrates the algorithm’s stability for this attacker type.

Figure 6. Convergence of Q-values for the first-type attacker at state

\bar{P}

under

C_{a} = 0.01

. Each colored line corresponds to the learning trajectory of

Q_{k, a}^{1} (s, λ, δ^{1})

for a specific pair of sensor power and attacker power. The convergence of all lines illustrates the algorithm’s stability for this attacker type.

Figure 7. Convergence of Q-values for the second-type attacker at state

\bar{P}

under

C_{a} = 0.01

. Each colored line corresponds to the learning trajectory of

Q_{k, a}^{2} (s, λ, δ^{2})

for a specific pair of sensor power and attacker power. The convergence of all lines illustrates the algorithm’s stability for this attacker type.

Figure 7. Convergence of Q-values for the second-type attacker at state

\bar{P}

under

C_{a} = 0.01

. Each colored line corresponds to the learning trajectory of

Q_{k, a}^{2} (s, λ, δ^{2})

for a specific pair of sensor power and attacker power. The convergence of all lines illustrates the algorithm’s stability for this attacker type.

Table 1. Converged Q-values

Q_{d} (s, λ, δ)

of the sensor for state

s = \bar{P}

and sensor action

λ = 1

. The columns represent the joint attacker actions

δ = (δ^{1}, δ^{2})

.

Table 1. Converged Q-values

Q_{d} (s, λ, δ)

of the sensor for state

s = \bar{P}

and sensor action

λ = 1

. The columns represent the joint attacker actions

δ = (δ^{1}, δ^{2})

.

$δ$	$Q_{d} (s, λ, δ)$
$(4, 1)$	$- 131.5542$
$(5, 1)$	$- 131.6878$
$(6, 1)$	$- 131.7834$
$(4, 2)$	$- 131.6996$
$(5, 2)$	$- 131.8332$
$(6, 2)$	$- 131.9288$
$(4, 3)$	$- 131.7653$
$(5, 3)$	$- 131.8989$
$(6, 3)$	$- 131.9945$

Table 2. Converged Q-values for both attacker types at state

s = \bar{P}

. Each row corresponds to a sensor action

λ

, with columns showing the Q-values for attacker type 1 (

Q_{d}^{1}

under action

δ^{1}

) and type 2 (

Q_{d}^{2}

under action

δ^{2}

).

Table 2. Converged Q-values for both attacker types at state

s = \bar{P}

. Each row corresponds to a sensor action

λ

, with columns showing the Q-values for attacker type 1 (

Q_{d}^{1}

under action

δ^{1}

) and type 2 (

Q_{d}^{2}

under action

δ^{2}

).

$λ$	$δ^{1}$	$δ^{2}$	$Q_{a}^{1} (s, λ, δ^{1})$	$Q_{a}^{2} (s, λ, δ^{2})$
1	4	1	$81.5995$	$97.7494$
1	5	2	$80.6239$	$96.0405$
1	6	3	$79.6724$	$94.4824$
2	4	1	$80.7636$	$97.4259$
2	5	2	$79.9470$	$96.3323$
2	6	3	$79.1444$	$95.2768$
3	4	1	$79.8811$	$96.7319$
3	5	2	$79.1792$	$95.9321$
3	6	3	$78.4862$	$95.1471$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Deng, D.; Yi, P.; Qi, M. A Bayesian Stackelberg Game Approach to Remote State Estimation Under SINR-Based DoS Attacks with Incomplete Information. Sensors 2026, 26, 1272. https://doi.org/10.3390/s26041272

AMA Style

Deng D, Yi P, Qi M. A Bayesian Stackelberg Game Approach to Remote State Estimation Under SINR-Based DoS Attacks with Incomplete Information. Sensors. 2026; 26(4):1272. https://doi.org/10.3390/s26041272

Chicago/Turabian Style

Deng, Di, Peng Yi, and Mingze Qi. 2026. "A Bayesian Stackelberg Game Approach to Remote State Estimation Under SINR-Based DoS Attacks with Incomplete Information" Sensors 26, no. 4: 1272. https://doi.org/10.3390/s26041272

APA Style

Deng, D., Yi, P., & Qi, M. (2026). A Bayesian Stackelberg Game Approach to Remote State Estimation Under SINR-Based DoS Attacks with Incomplete Information. Sensors, 26(4), 1272. https://doi.org/10.3390/s26041272

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Bayesian Stackelberg Game Approach to Remote State Estimation Under SINR-Based DoS Attacks with Incomplete Information

Abstract

1. Introduction

2. Related Work

3. Problem Formulation

3.1. System and Sensor Model

3.2. Communication Channel and Attack Model with Incomplete Information

3.3. Remote State Estimation

3.4. Strategy and Objective Function

4. MDP Formulation

5. Bayesian Stackelberg Game

6. Reinforcement Learning

7. Simulation Results

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI