Learning Deceptive Tactics for Defense and Attack in Bayesian–Markov Stackelberg Security Games

Clempner, Julio B.

doi:10.3390/mca30020029

Open AccessArticle

Learning Deceptive Tactics for Defense and Attack in Bayesian–Markov Stackelberg Security Games

by

Julio B. Clempner

Escuela Superior de Física y Matemáticas, Instituto Politécnico Nacional School of Physics and Mathematics, National Polytechnic Institute, Edificio 9 U.P. Adolfo Lopez Mateos, Col. San Pedro Zacatenco, Mexico City 07730, Mexico

Math. Comput. Appl. 2025, 30(2), 29; https://doi.org/10.3390/mca30020029

Submission received: 1 January 2025 / Revised: 23 February 2025 / Accepted: 10 March 2025 / Published: 17 March 2025

(This article belongs to the Special Issue Applied Optimization in Automatic Control and Systems Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, we address the challenges posed by limited knowledge in security games by proposing a novel system grounded in Bayesian–Markov Stackelberg security games (SSGs). These SSGs involve multiple defenders and attackers and serve as a framework for managing incomplete information effectively. To tackle the complexity inherent in these games, we introduce an iterative proximal-gradient approach to compute the Bayesian Equilibrium, which captures the optimal strategies of both defenders and attackers. This method enables us to navigate the intricacies of the game dynamics, even when the specifics of the Markov games are unknown. Moreover, our research emphasizes the importance of Bayesian approaches in solving the reinforcement learning (RL) algorithm, particularly in addressing the exploration–exploitation trade-off. By leveraging Bayesian techniques, we aim to minimize the expected total discounted costs, thus optimizing decision-making in the security domain. In pursuit of effective security game implementation, we propose a novel random walk approach tailored to fulfill the requirements of the scenario. This innovative methodology enhances the adaptability and responsiveness of defenders and attackers, thereby improving overall security outcomes. To validate the efficacy of our proposed strategy, we provide a numerical example that demonstrates its benefits in practice. Through this example, we showcase how our approach can effectively address the challenges posed by limited knowledge, leading to more robust and efficient security solutions. Overall, our paper contributes to advancing the understanding and implementation of security strategies in scenarios characterized by incomplete information. By combining Bayesian and Markov Stackelberg games, reinforcement learning algorithms, and innovative random walk techniques, we offer a comprehensive framework for enhancing security measures in real-world applications.

Keywords:

security games; strong Stackelberg equilibrium; extraproximal method; Lyapunov games; shortest-path games; finite Markov chains

1. Introduction

1.1. Brief Review

In safeguarding vital infrastructures, strategically allocating limited security resources is crucial. The primary objective is to prevent attacks from unpredictable assailants targeting critical assets. However, in selecting defense techniques, the balance between defenders’ private information disclosure and attackers’ observational capabilities must be carefully considered. This equilibrium ensures effective protection against potential threats while minimizing vulnerabilities arising from asymmetrical information.

Game theory offers a powerful mathematical framework for optimizing decision-making in strategic scenarios, which is particularly evident in SSGs tackling high-complexity problems. SSGs encapsulate situations where defenders (leaders) and attackers (followers) strategically engage in sequential decision-making to achieve their objectives. In this model, defenders are represented by the leaders, while attackers are represented by the followers, highlighting the strategic interplay between protective and offensive roles in the system. Multiplayer game theory presents significant complexity and challenges in finding solutions due to the interaction of multiple agents with varying strategies and objectives. The dynamic nature and interdependencies of players’ actions create a highly intricate environment, requiring sophisticated approaches to model, analyze, and achieve equilibrium in these games. Initially, defenders commit to a strategy, which attackers then observe before determining their own course of action [1,2]. Defenders employ random techniques to safeguard potential targets, anticipating attackers’ objectives within the game’s dynamics. Meanwhile, attackers adopt best-reply strategies while cognizant of these defensive tactics. Mathematically, the defenders’ advantage lies in revealing their random plans, yet this overlooks potential knowledge loss due to incomplete observations. For instance, when attackers disguise themselves as civilians to infiltrate a target, defenders may struggle to ascertain the available security resources accurately. In such scenarios, revealing complete information isn’t always advantageous for defenders. Instead, strategic disclosure of incomplete information can be more beneficial [3,4]. This approach acknowledges the complexity of real-world security dilemmas, where optimal decision-making hinges on dynamically balancing information disclosure and strategic concealment. By leveraging game theory’s mathematical rigor, SSGs provide a framework for navigating these strategic challenges, ultimately enhancing security decision-making in complex and dynamic environments.

1.2. Related Work

Various game-theoretical frameworks have been employed to model the interaction between defenders and attackers in security scenarios, with Bayesian Stackelberg games emerging as a particularly successful solution [5]. Conitzer and Sandholm [6] adapted the Bayesian Stackelberg game into a normal-form representation using Harsanyi’s approach [5], determining game equilibrium by evaluating each follower’s plan to ascertain if it constitutes an optimal reaction. Paruchuri et al. [7] proposed a mixed-integer linear programming (MILP) approach to solve Bayesian Stackelberg games, leveraging optimization techniques to identify optimal strategies for both defenders and attackers. Jain et al. [8] introduced a strategy combining hierarchical decomposition with branch and bound techniques, aiming to efficiently explore the solution space and achieve equilibrium in the game. Yin and Tambe [9] suggested a hybrid technique merging best-first search with heuristic branching rules to solve Bayesian Stackelberg games effectively. This approach enhances the computational efficiency of finding equilibrium strategies by guiding the search process through the solution space. Overall, these methodologies contribute to advancing our understanding of security game dynamics and provide practical solutions for effectively addressing security challenges in real-world scenarios, highlighting the versatility and adaptability of game-theoretical approaches in security analysis and decision-making.

For further exploration, information can be found in Wilczynski’s comprehensive survey [10]. Sayed Ahmed [11] introduced a deception-based Stackelberg game anti-jamming mechanism. Gan et al. delved into security games involving uncoordinated defenders cooperating to protect targets, with each defender optimizing their resource allocation selfishly for utility maximization [12]. These studies underscore the breadth of applications and innovative approaches within the realm of SSGs, showcasing their relevance in modern security paradigms.

SSGs find diverse applications in everyday scenarios, contributing significantly to the enhancement of security measures across various domains. Wilczynski [10] provided a comprehensive survey highlighting the breadth of SSG applications. Basilico et al. [13] determined the minimum number of robots required to patrol a given environment, compute optimal patrolling strategies across various coordination dimensions, and experimentally evaluate the proposed techniques, demonstrating their effectiveness in ensuring comprehensive coverage and the efficient use of robotic resources. Clempner [14] proposed a security model implemented with a temporal difference method, and it incorporates prior information to effectively address security issues. By leveraging continuous learning and adapting to evolving threats, this model aims to enhance the robustness and resilience of security systems. Albarran et al. [15] applied SSGs to distribute security resources across airport terminals, leveraging partially observed Markov game settings to address complex security challenges. In urban settings, Trejo et al. [16] employed reinforcement learning within SSG frameworks to adapt attacker and defender strategies, improving security measures at geographically dispersed shopping malls. Solis et al. [17] extended SSG applications to maritime security, utilizing a multiplayer ship differential SSG to model pursuit–evasion scenarios in continuous time. Alcantara et al. [18] incorporated topographical information into SSGs to generate realistic patrol plans for defenders in major cities, enhancing target identification and security deployment strategies. Sayed Ahmed [11] proposed a deception-based SSG anti-jamming solution, addressing vulnerabilities in communication systems. Gan et al. [12] explored SSGs where uncoordinated defenders collaborate to protect targets, with each defender optimizing their resource allocation selfishly. Clempner and Poznyak [19] introduced an attacker–defender SSG system, leveraging ergodic Markov models and non-decreasing Lyapunov-like functions to represent solutions. Rahman and Oh [20] investigated online patrolling robot route-planning tasks using SSG frameworks, enhancing security surveillance capabilities. Wang et al. [21] developed the DeDOL method based on deep reinforcement learning within SSG models, offering advanced solutions for security optimization. Li et al. [22] proposed a Bayesian Stackelberg Markov game for adversarial federated learning, utilizing meta-RL-based pre-training and adaptation to combat diverse attacks, achieving robust, adaptive, and efficient federated learning defense strategies. Sengupta and Kambhampati [23] framed adversarial federated learning as a Bayesian Stackelberg Markov game, using RL-based pre-training and meta-RL defenses to combat adaptive attacks, achieving robust and efficient federated learning defense strategies. Shukla et al. [24] presented a cybersecurity game for networked control systems, where an attacker disrupts communication and a defender protects key nodes. A cost-based Stackelberg equilibrium and robust defense method optimize security. Genetic algorithms enhance strategies for large power systems.

The versatility and effectiveness of SSGs highlight their importance in enhancing security strategies and protecting critical assets across various contexts. Whether applied to scheduling security patrols at airports, optimizing resource allocation in urban settings, or addressing vulnerabilities in communication systems, SSGs offer valuable insights and solutions. By integrating game-theoretical frameworks with real-world security challenges, SSGs enable proactive and adaptive approaches to security management, ultimately contributing to safer environments and mitigating potential threats effectively.

1.3. Main Results

Improving the existing security framework involves addressing the challenges posed by incomplete information. This is achieved through several key approaches. Firstly, an iterative proximal-gradient approach is proposed to compute the Bayesian Stackelberg equilibrium, enabling effective decision-making despite uncertainty. Additionally, a Bayesian reinforcement learning (BRL) algorithm is introduced, allowing for adaptive strategies that learn from past experiences and observations. Furthermore, a random walk approach is developed to implement SSGs, offering a dynamic and flexible method for responding to evolving threats. By integrating these techniques, the security framework gains the ability to adapt and respond effectively in the face of incomplete information, thereby enhancing overall resilience and effectiveness in mitigating security risks. These methods enable decision-makers to make more informed choices and optimize security strategies in complex and uncertain environments. A numerical example is used to provide useful recommendations for defenders’ resource allocation against attackers.

1.4. Organization of the Paper

The paper follows a structured layout, with Section 2 providing preliminary information. In this section, foundational concepts and background knowledge relevant to the study are presented to establish a basis for subsequent discussions. Section 3 elaborates on the problems under consideration and introduces a Bayesian Stackelberg game model, which serves as the theoretical framework for analyzing strategic interactions between defenders and attackers. This section outlines the key components of the model and discusses its applicability to the specific security scenario under investigation. In Section 4, the paper proposes a learning algorithm tailored to the context of the Bayesian Stackelberg game, aiming to optimize decision-making strategies for both defenders and attackers. The algorithm is designed to adapt and improve over time based on feedback and observed outcomes. Section 5 introduces a novel random walk model, offering an alternative approach to addressing security challenges within the Bayesian Stackelberg game framework. This model presents a new perspective on strategic decision-making in dynamic security environments. A numerical example illustrating the application of the proposed methodologies is provided in Section 6, demonstrating their effectiveness and practical relevance in real-world scenarios. Finally, Section 7 concludes the paper by summarizing key findings, drawing conclusions based on the analysis conducted, and discussing potential avenues for future research and development in the field of SSGs.

2. Preliminaries

In a discrete-time and finite-horizon framework, we consider an environment characterized by private and independent values, as described in prior works [3,25,26]. In this setup, players denoted by

l \in L

, where

L = 1, 2, \dots, n

, receive rewards in discrete time periods

t \in T

, with

T \subseteq N

. These rewards are contingent upon the current physical allocation

a_{t}^{l} \in A^{l}

and the type

Θ^{l} \in Θ^{l}

of the player. Here,

A^{l}

represents the feasible set of allocations available to player l at time t, while

Θ^{l}

denotes the set of possible types for player l. The reward obtained by player l at time t is determined by their chosen allocation and their specific type. This framework captures the dynamics of decision-making in scenarios where players have private information and their actions impact the outcomes over a finite time horizon. Such models find applications in various domains, including economics, game theory, and decision science, where understanding the interplay between private information and strategic decision-making is crucial.

In period t, the type vector is denoted by

θ_{t} = (θ_{t}^{1}, \dots, θ_{t}^{n}) \in Θ

, where

Θ = \times_{l \in L} Θ^{l}

. The set of feasible allocations in period t may be contingent on the vector of past allocations

a_{t} = (a_{t}^{1}, \dots, a_{t}^{n}) \in A

, where

A = \times_{l \in L} A^{l}

. We denote

Θ^{- l}

as

\times_{i \in L, i \neq l} Θ^{i}

, and

Δ (A)

as the set of all probability distributions over A. Additionally, we assume that

A^{l}

and

Θ^{l}

are finite sets, ensuring a manageable and well-defined problem space. This setup enables the modeling of dynamic interactions and decision-making processes among multiple players with diverse types and feasible allocation options.

Each player l has a common prior distribution

P^{l} (θ_{0}^{l})

. The type

Θ^{l}

and action

a_{t}^{l}

determine a probability distribution for the variables

θ_{t + 1}^{l}

on

Θ^{l}

, which is denoted by

p^{l} (θ_{t + 1}^{l} | θ_{t}^{l} a_{t}^{l})

. This distribution captures the probabilistic relationship between a player’s current type and action and their subsequent type in the next period, facilitating the modeling of dynamic decision-making processes within the game.

A Markov chain is described by the transition matrix

p^{l} (θ_{t + 1}^{l} | θ_{t}^{l} a_{t}^{l})

and the common prior distribution

P^{l} (θ_{0}^{l})

where

P^{l} (Θ^{l}) \in Δ Θ^{l}

and

Δ Θ^{l}

establishes the probability distributions over

Θ^{l}

. We assume that each chain

(P^{l}, p (θ_{t + 1}^{l} | θ_{t}^{l}, a_{t}^{l}))

is ergodic, ensuring that the system will eventually reach a steady state regardless of its initial conditions.

The asymmetric information is determined by the private observation of

θ_{t}^{l}

. We determine that by the priors distributions

P^{l} (θ_{0}^{l})

and the transition matrices

p^{l} (θ_{t + 1}^{l} | θ_{t}^{l} a_{t}^{l})

for each player l, the information of player l, given by

θ_{t + 1}^{l}

, does not depend on

θ_{t}^{i},

for

i \neq l

. Following that, participants transmitted

m_{t}^{l}

messages at the same time, and the message profile was made public

m_{t}^{l}

.

A (behavioral) strategy

σ^{l} (m_{t}^{l} | θ_{t}^{l})

for player l is mapping

σ^{l} : M^{l} \times Θ^{l} \to Δ (M^{l})

, which represents the likelihood with which a player l believes that a message

m_{t}^{l}

is of type

Θ^{l}

. The (behavioral) strategy set is given by

\begin{matrix} S_{a d m}^{l} = \{σ^{l} (m_{t}^{l} | θ_{t}^{l}) \geq 0 |\sum_{m_{t}^{l} \in Θ^{l}} σ^{l} (m_{t}^{l} | θ_{t}^{l}) = 1, θ_{t}^{l} \in Θ^{l}\} . \end{matrix}

A strategy is a sequence

\{π^{l} (a_{t}^{l} | m_{t}^{l})\}

such that, for each period t,

π^{l} (a_{t}^{l} | m_{t}^{l})

is a stochastic kernel on A given history

H_{n}

. The set of all the admissible strategies is

Π_{a d m}^{l} = \{π^{l} (a_{t}^{l} | m_{t}^{l}) \geq 0 |\sum_{a_{t}^{l} \in A^{l}} π^{l} (a_{t}^{l} | m_{t}^{l}) = 1, m_{t}^{l} \in M^{l}\}

Remark 1.

A strategy π is typically defined as a policy or plan of action for a player that depends only on the information they possess. In this case, π depends solely on the player’s own message, and this suggests that π is a strategy based on private information or the part of the information that is most relevant to the player. The assumption here might be that each player has access to their own private message but does not directly depend on the messages of other players in their strategy formulation. This can simplify the analysis in certain games. A strategy σ, on the other hand, often refers to a more general form of strategy that could depend on all available information, such as public messages or even the outcomes of communication between players. σ involves communication, and it might be labeled as a communication strategy, where players decide how to share or process information.

Let us introduce the cost function

v^{l} (a_{t}^{l} θ_{t}^{l} m_{t}^{l})

, which characterizes the losses incurred by players l when they take action

a_{t}^{l}

based on the message

m_{t}^{l}

generated under type

θ_{t}^{l}

.

The average cost

[0, T]

of the player l is

\begin{matrix} V_{T}^{l} (π, σ) = \sum_{t \in T} \sum_{θ_{t}^{l} \in Θ^{l}} \sum_{m_{t}^{l} \in M^{l}} \sum_{a_{t}^{l} \in A^{l}} v^{l} (a_{t}^{l} θ_{t}^{l} m_{t}^{l}) p^{l} (θ_{t + 1}^{l} | θ_{t}^{l} a_{t}^{l}) \prod_{i \in L} π^{i} (a_{t}^{i} | m_{t}^{i}) σ^{i} (m_{t}^{i} | θ_{t}^{i}) P^{i} (θ_{t}^{i}) = \\ \sum_{t \in T} \sum_{θ_{t}^{l} \in Θ^{l}} \sum_{m_{t}^{l} \in M^{l}} \sum_{a_{t}^{l} \in A^{l}} W^{l} (a_{t}^{l} θ_{t}^{l} m_{t}^{l}) \prod_{i \in L} π^{i} (a_{t}^{i} | m_{t}^{i}) σ^{i} (m_{t}^{i} | θ_{t}^{i}) P^{i} (θ_{t}^{i}) . \end{matrix}

where

W^{l} (a_{t}^{l} θ_{t}^{l} m_{t}^{l}) = v^{l} (a_{t}^{l} θ_{t}^{l} m_{t}^{l}) p^{l} (θ_{t + 1}^{l} | θ_{t}^{l} a_{t}^{l}) .

We assume that players know their payoffs. The strategy

σ^{l} (m_{t}^{l} | θ_{t}^{l})

minimizes the weighted cost function

V^{l} (π, σ)

, realizing the rule, which is given by

\begin{matrix} (π^{*}, σ^{*}) \in A r g min_{σ \in Π_{a d m}} min_{σ \in S_{a d m}} \sum_{l \in L} V^{l} (π, σ) . \end{matrix}

(1)

The policy

π^{*}

and the strategy

σ^{*}

is said to satisfy the Bayesian–Nash equilibrium if, for all

π \in Π_{a d m}

and

σ \in S_{a d m}

, the following condition holds:

V^{l} (π^{*}, σ^{*}) \leq V^{l} (π^{*}, σ^{l}, σ^{- l *}) .

(2)

where

σ^{- l} =

(σ^{1}, \dots, σ^{l - 1}, σ^{l + 1}, \dots, σ^{n})

.

Let us introduce the auxiliary

ξ

variable as follows:

ξ^{l} (θ_{t}^{l} m_{t}^{l} a_{t}^{l}) : = π (a_{t}^{l} | m_{t}^{l}) σ^{l} (m_{t}^{l} | θ_{t}^{l}) P^{l} (θ_{t}^{l}) .

(3)

such that

\begin{matrix} Ξ_{a d m}^{l} : = \{ξ^{l} (θ_{t}^{l} m_{t}^{l} a_{t}^{l}) |\sum_{m_{t}^{l} \in M^{l}} \sum_{θ_{t}^{l} \in Θ^{l}} \sum_{a_{t}^{l} \in A^{l}} ξ^{l} (θ_{t}^{l} m_{t}^{l} a_{t}^{l}) = 1, \sum_{m_{t}^{l} \in M^{l}} \sum_{a_{t}^{l} \in A^{l}} ξ^{l} (θ_{t}^{l} m_{t}^{l} a_{t}^{l}) = P^{l} (θ_{t}^{l}) > 0, \\ \sum_{m_{t}^{l} \in M^{l}} \sum_{a_{t}^{l} \in A^{l}} \sum_{θ_{t}^{l} \in Θ^{l}} [δ_{θ_{t}^{l}, θ_{t + 1}^{l}} - p^{l} (θ_{t + 1}^{l} | θ_{t}^{l} a_{t}^{l})] ξ^{l} (θ_{t}^{l} m_{t}^{l} a_{t}^{l}) = 0, θ_{t + 1}^{l} \in Θ^{l}\} . \end{matrix}

(4)

where

δ_{θ_{t}^{l}, θ_{t + 1}^{l}}

is Kroneker’s symbol. It should be noted that the following relations hold:

\begin{matrix} \sum_{m_{t}^{l} \in M^{l}} σ (m_{t} | θ_{t}) = 1, & \sum_{θ_{t}^{l} \in Θ^{l}} P^{l} (θ_{t}^{l}) = 1 . \end{matrix}

We obtain that

ξ^{l} \in Δ^{l}

, where

\begin{matrix} Δ^{l} : = \{ξ^{l} (θ_{t}^{l} m_{t}^{l} a_{t}^{l}) |\sum_{m_{t}^{l} \in M^{l}} \sum_{θ_{t}^{l} \in Θ^{l}} \sum_{a_{t}^{l} \in A^{l}} ξ^{l} (θ_{t}^{l} m_{t}^{l} a_{t}^{l}) = 1, \sum_{m_{t}^{l} \in M^{l}} \sum_{a_{t}^{l} \in A^{l}} ξ^{l} (θ_{t}^{l} m_{t}^{l} a_{t}^{l}) = P^{l} (θ_{t}^{l}) > 0\} . \end{matrix}

(5)

3. Security Game Model

In the game involving multiple players, we consider two distinct groups: the defenders, denoted by

L = 1, \dots, l

, and the attackers, denoted by

F = 1, \dots, f

. Utilizing conventional game-theoretic notation, we denote the player index set as

I

. The joint action profile of all participants, excluding agent q, is represented as

x^{- q} = {(x^{h})}_{h \in I / q}

. In the context of our Stackelberg game,

I = L \cup F

, where players are indexed by

ι = \bar{1, l}

for the defenders and

r = \bar{1, f}

for the attackers. This indexing scheme allows for clear differentiation between the defenders and attackers, facilitating the analysis of strategic interactions and decision-making dynamics within the game framework.

Let us say the defenders’ strategies are denoted by

x^{ι} \in X^{ι}

, where X is a convex and compact set and

\begin{matrix} x^{ι} : = c o l (ξ_{i k}^{ι}), X^{ι} : = Ξ_{a d m}^{ι}, X : = \otimes_{ι = 1}^{l} X^{ι} \end{matrix}

such that

c o l

is a column-converting operator for the matrix

ξ_{i k}^{ι}

. The joint strategy of the players is denoted by

x = {(x^{1}, \dots, x^{n})}^{⊤} \in X

, while the complement strategy of the players is denoted by

\hat{x} = x^{- ι}

. Here,

\begin{matrix} x^{- ι} : = {(x^{1}, \dots, x^{ι - 1}, x^{ι + 1}, \dots, x^{l})}^{⊤} \in X^{- ι} : = \otimes_{h = 1, h \neq ι}^{l} X^{ι} \end{matrix}

such that

x = (x^{ι}, x^{\hat{ι}})

.

Similarly, if we let

y^{r} \in Y^{r}

(r = \bar{1, f})

represent the attackers’ strategies and Y be a convex and compact set where

\begin{matrix} y^{r} : = c o l (ξ_{i k)}^{r}), Y^{r} : = Ξ_{a d m}^{r} (r = \bar{1, f}), Y : = \otimes_{r = 1}^{f} y^{r} \end{matrix}

is denoted by

y = (y^{1}, \dots, y^{f}) \in Y : = \otimes_{r = 1}^{f} Y^{r}

, then the joint strategy of the attackers and

y^{\hat{r}} = y^{- r}

are strategies of the complement of the players adjoint to

y^{r}

, i.e.,

\begin{matrix} y^{- r} : = {(y^{1}, \dots, y^{r - 1}, y^{r + 1}, \dots, y^{f})}^{⊤} \in Y^{- r} : = \otimes_{r = 1, r \neq f}^{f} Y^{r} \end{matrix}

such that

y = (y^{r}, y^{\hat{r}})

.

In the scenario we investigate, defenders and attackers engage in a Nash game within the framework of a simultaneous play game that is restricted to a Stackelberg game. In simultaneous play games, the Nash equilibrium serves as the notion of equilibrium. Here, each player independently selects their strategy, aiming to maximize their own utility given the strategies chosen by other players. A Nash equilibrium is reached when no player can unilaterally deviate from their strategy to achieve a better outcome. In contrast, hierarchical play games, such as Stackelberg games, involve sequential decision-making, where one player (the defender) commits to a strategy first, and the other player (the attacker) observes this strategy before determining their own. The equilibrium concept in hierarchical play games differs from the Nash equilibrium, as it involves optimizing strategies in anticipation of the actions of other players in the sequence of play. Formalizing the Stackelberg game within this framework involves defining the roles of defenders and attackers, specifying their strategies, and analyzing the equilibrium outcomes under the sequential decision-making structure.

Let

W = X \cup Y

. If a player

q \in I

has a cost function

φ^{q} (w^{q}, w^{- q})

, then we have the following:

Definition 1.

The joint strategy

w^{*} \in W

is a Nash equilibrium if, for each

q \in I

,

φ^{q} (w^{*}) \leq φ^{q} (w^{q}, w^{- q *}), \forall w^{q} \in W^{q} .

In the game progression, defenders forecast attackers behavior by playing non-cooperatively, anticipating attackers’ actions at a Stackelberg equilibrium. This strategic anticipation enables defenders to make informed decisions when committing to their strategies, considering the likely responses of attackers within the framework of the Nash equilibrium. By strategically forecasting attackers’ behavior, defenders aim to optimize their own outcomes in the game. To achieve the game’s aim, defenders must first identify a strategy

x^{*} = (x^{1 *}, \dots, x^{l *}) \in X

that is satisfying for any admissible

x^{ι} \in X^{ι}

and any

ι = \bar{1, l}

.

\begin{matrix} Λ (x) : = \sum_{ι = 1}^{l} [(min_{x^{ι} \in X^{ι}} φ^{ι} (x^{ι}, x^{- ι})) - φ^{ι} (x^{ι}, x^{- ι})] \end{matrix} .

(6)

The cost function of the defenders

ι

who play the strategy

x^{ι} \in X^{ι}

and the complement of the defenders who play the strategy

x^{- ι} \in X^{- ι}

is

φ^{ι} (x^{ι}, x^{- ι})

.

Taking the utopia point as

{\bar{x}}^{ι} : = \underset{x^{ι} \in X^{ι}}{arg min} φ^{ι} (x^{ι}, x^{- ι})

(7)

then one can describe Equation (6) in the following manner:

Λ (x) : = \sum_{ι = 1}^{l} [φ^{ι} ({\bar{x}}^{ι}, x^{- ι}) - φ^{ι} (x^{ι}, x^{- ι})] .

(8)

The functions

φ^{ι} (x^{ι}, x^{- ι})

(ι = \bar{1, l})

are supposed to be convex in all their arguments.

The function

Λ (x)

fulfills Nash’s condition

\begin{matrix} max_{x \in X} g (x) = \sum_{ι = 1}^{l} [φ^{ι} ({\bar{x}}^{ι}, x^{- ι}) - φ^{ι} (x^{ι}, x^{- ι})] \leq 0 \end{matrix}

for any

x^{ι} \in x^{ι}

and all

ι = \bar{1, l}

A strategy

x^{*} \in X

is a Nash equilibrium if

x^{*} \in A r g min_{x \in X} \{Λ (x)\} .

If

Λ (x)

is strictly convex, then

x^{*} = arg min_{x \in X} \{Λ (x)\} .

In addition, in this process, the attackers attempt to reach some of the Nash equilibria and try to find a joint strategy

y^{*} = (y^{1 *}, \dots, y^{f *})

∈Y that satisfies any admissible

y^{r} \in y^{r}

and any

r = \bar{1, f}

\begin{matrix} Ψ (y) : = \sum_{r = 1}^{f} [(min_{y^{r} \in Y^{r}} φ_{r} (y^{r}, y^{- r})) - φ^{r} (y^{r}, y^{- r})] . \end{matrix}

The cost function of the attackers m who play the strategy

y^{r} \in y^{r}

and the complement of the defenders who play the strategy

y^{- r} \in y^{- r}

is

φ^{r} (y^{r}, y^{- r})

.

Bearing in mind the utopia point

{\bar{y}}^{r} : = \underset{y^{r} \in Y^{r}}{arg min} φ^{r} (y^{r}, y^{- r})

it is possible to rewrite Equation (6) as follows:

Ψ (y) : = \sum_{r = 1}^{f} (φ^{r} ({\bar{v}}^{r}, y^{- r}) - φ^{r} (y^{r}, y^{- r})) .

The functions

φ^{r} (y^{r}, y^{- r})

(r = \bar{1, f})

are supposed to be convex in all their arguments.

The function

Ψ (y)

satisfies the Nash condition

\begin{matrix} max_{y^{r} \in y^{r}} w (y) = \sum_{r = 1}^{f} (φ^{r} ({\bar{y}}^{r}, y^{- r}) - φ^{r} (y^{r}, y^{- r})) \leq 0 \end{matrix}

for any

y^{r} \in y^{r}

and all

r = \bar{1, f}

.

The defenders’ goal is to find a solution to the optimization challenge given by the following definition:

Definition 2.

A Stackelberg game is a game with l defenders and f attackers; it is called a Stackelberg–Nash game if the defenders strive to solve the issue presented by

min_{x \in X} {Λ (x | y) | \underset{v \in Y}{y \in A r g min} Ψ (v | x)}

and the attackers try to solve the problem

min_{y \in Y} Ψ (y | x)

such that

Λ (x | y) : = \sum_{ι = 1}^{l} (φ^{ι} ({\bar{x}}^{ι}, x^{- ι} | y) - φ^{l} (x^{ι}, x^{- ι} | y))

and

Ψ (y | x) : = \sum_{r = 1}^{f} (φ^{r} ({\bar{y}}^{r}, y^{- r} | x) - φ^{r} (y^{r}, y^{- r} | x)) .

Then, the equilibrium notion in games is the Nash equilibrium in simultaneous play games and the Stackelberg equilibrium in hierarchical play games.

Definition 3.

(Stackelberg equilibrium) In a game with l defenders, the strategy

x^{*} \in X

is said to be a Stackelberg–Nash equilibrium strategy for the defenders if

max_{x \in ρ (x^{*})} Λ (x^{*} | y) \leq max_{x \in ρ (x)} Λ (x | y)

such that

ρ (x) = {y \in Y | Ψ (y | x) \leq Ψ (v | x) \forall v \in Y}

is the best reply set of the attackers.

The definition of Stackelberg equilibrium given above can be redefined for the set of the attackers when the set

ρ (x)

is substituted with the set of Nash equilibria, considering that the defenders play the strategy x and then the attackers’ best reply is a Nash equilibrium.

The general format iterative version

(n = 0, 1, \dots)

of the proximal-gradient method for computing the Stackelberg equilibrium is as follows:

1. The first half-step (prediction):

{\hat{v}}_{n} = arg min_{\tilde{w} \in X \times Y} \{\frac{1}{2} {∥ \tilde{w} - \tilde{v}}_{n} ∥^{2} + γ Ψ {(\tilde{w}, \tilde{v}}_{n})\}

(9)

2. The second (basic) half-step:

{\tilde{v}}_{n + 1} = arg min_{\tilde{w} \in X \times Y} \{\frac{1}{2} {∥ \tilde{w} - {\tilde{v}}_{n} ∥}^{2} + γ Ψ (\tilde{w}, {\hat{v}}_{n})\}

(10)

4. Learning

Reinforcement learning (RL) addresses the challenge of learning optimal actions in an unknown Markov setting through interaction [27]. Players aim to devise strategies that minimize their expected costs. In a discrete-time setting, at each time step t, players observe a cost

v^{i}

, where i denotes the player. The overarching objective is to minimize the average cost V over time, guiding decision-making towards more favorable outcomes. By iteratively adjusting actions based on observed costs and environmental feedback, RL algorithms seek to discover optimal strategies that lead to the most advantageous long-term outcomes in dynamic and uncertain environments.

We consider RL problems where the underlying environment is a Bayesian–Markov decision process. Specifically, in the Stackelberg game, players indexed by

ι = \bar{1, l}

are the defenders and players indexed by

r = \bar{1, f}

are the attackers. We will develop the results in general for a player

i \in I

(

I = L \cup F

) and we specify defenders and attackers when it will be necessary. We propose a setting based on experiences, which is calculated by adding the number

χ

of experiences given as follows: let

h_{t} \in H_{t}

(

t \in T

) be the history at time t and

(θ_{t}, a_{t}) \in Θ \times A

. Let, for each

(θ_{t + 1}, θ_{t}, a_{t}) \in Θ \times Θ \times A

,

χ^{i} ({\tilde{θ}}_{t + 1}^{i} {\tilde{θ}}_{t}^{i} a_{t}^{i}) = \sum_{t \in T} 1 ({\tilde{θ}}_{t + 1}^{i} {\tilde{θ}}_{t}^{i} a_{t}^{i}) = E \{1 ({\tilde{θ}}_{t + 1}^{i}) 1 ({\tilde{θ}}_{t}^{i}) 1 (a_{t}^{i})\}

denote the experimentally observed absolute average number of transitions from type

{\hat{θ}}_{t}^{i}

applying action

a_{t}^{i}

, We obtain the normally distributed and asymptotically unbiased maximum likelihood estimator given by

p^{i} ({\tilde{θ}}_{t + 1}^{i} | {\tilde{θ}}_{t}^{i} a_{t}^{i})

{\tilde{p}}^{i} ({\tilde{θ}}_{t + 1}^{i} | {\tilde{θ}}_{t}^{i} a_{t}^{i}) = \frac{χ^{i} ({\tilde{θ}}_{t + 1}^{i} {\tilde{θ}}_{t}^{i} a_{t}^{i})}{χ^{i} ({\tilde{θ}}_{t}^{i} a_{t}^{i})} .

Therefore, the cost of player

l \in L

is

{\tilde{v}}^{i} ({\tilde{θ}}_{t + 1}^{i} {\tilde{θ}}_{t}^{i} a_{t}^{i}) = \frac{\sum_{t \in T} α^{i} ({\tilde{θ}}_{t + 1}^{i} {\tilde{θ}}_{t}^{i} a_{t}^{i}) χ^{i} ({\tilde{θ}}_{t + 1}^{i} {\tilde{θ}}_{t}^{i} a_{t}^{i})}{χ^{i} ({\tilde{θ}}_{t + 1}^{i} {\tilde{θ}}_{t}^{i} a_{t}^{i})}

(11)

where

α^{i} ({\tilde{θ}}_{t + 1}^{i} {\tilde{θ}}_{t}^{i} a_{t}^{i}) = v^{i} ({\tilde{θ}}_{t + 1}^{i} {\tilde{θ}}_{t}^{i} a_{t}^{i}) + α^{i} r

and

α^{i} \leq u^{i}

such that r randomly takes the values between

- 1

or 1.

We assume that the det

P^{i} (θ_{t}^{i} | m_{t}^{i})

\neq 0

where

Σ = {[P^{i} (θ_{t}^{i} | m_{t}^{i})]}^{- 1}

exists (the inverse of

P^{i} (θ_{t}^{i} | m_{t}^{i})

). We suggest a framework from experiences that is calculated counting the number

χ

of unobserved experiences recursively as follows:

\begin{matrix} χ^{i} (m_{t}^{' i} m_{t}^{i} a_{t}^{i}) = \sum_{t \in T} 1 (m_{t}^{' i} | m_{t}^{i} a_{t}^{i}) = E \{1 (m_{t}^{' i}) 1 (m_{t}^{i}) 1 (a_{t}^{i})\} \end{matrix}

where

χ^{i} (m_{t}^{i} a_{t}^{i})

is the number of visits in the state

m_{t}^{i}

and

χ^{i} (m_{t}^{' i} m_{t}^{i} a_{t}^{i})

is the total number of times that the system changes from

m_{t}^{' i}

to

m_{t}^{i}

applying

a_{t}^{i}

. We obtain that

χ^{i} (m_{t}^{i} a_{t}^{i}) = \sum_{m_{t}^{' i} \in M_{t}^{i}} χ^{i} (m_{t}^{' l} m_{t}^{i} a_{t}^{i})

.

The estimated transition matrix

{\tilde{p}}^{i} ({\tilde{θ}}_{t + 1} | {\tilde{θ}}_{t} x_{t})

is given by

\begin{matrix} {\tilde{p}}^{i} ({\tilde{θ}}_{t + 1}^{i} | {\tilde{θ}}_{t}^{i} a_{t}^{i}) = \frac{\sum_{m_{t}^{i} \in M^{i}} \sum_{m_{t}^{' i} \in M^{i}} Σ^{i} (m_{t}^{' i} {\tilde{θ}}_{t}^{i}) Σ^{i} (m_{t}^{i} {\tilde{θ}}_{t + 1}^{i}) χ^{i} (m_{t}^{' i} m_{t}^{i} a_{t}^{i})}{\sum_{{\hat{θ}}_{t + 1}^{i} \in Θ_{t}^{i}} [\sum_{m_{t}^{i} \in M^{i}} \sum_{m_{t}^{' i} \in M^{i}} Σ^{i} (m_{t}^{' i} {\tilde{θ}}_{t}^{i}) Σ^{i} (m_{t}^{i} {\tilde{θ}}_{t + 1}^{i}) χ^{i} (m_{t}^{' i} m_{t}^{i} a_{t}^{i})]} . \end{matrix}

(12)

In order to recover the variables of interest, we calculate

\begin{matrix} σ^{ι *} (m_{t}^{ι} | θ_{t}^{ι}) = \frac{\sum_{a_{t}^{ι} \in A^{ι}} ξ^{ι *} (θ_{t}^{ι} m_{t}^{ι} a_{t}^{ι})}{\sum_{m_{t}^{ι} \in M^{ι}} \sum_{a_{t}^{ι} \in A^{ι}} ξ^{ι *} (θ_{t}^{ι} m_{t}^{i} a_{t}^{ι})} . \end{matrix}

(13)

\begin{matrix} π^{ι *} (a_{t}^{ι} | m_{t}^{ι}) = \frac{1}{|Θ^{ι}|} \sum_{θ_{t}^{ι} \in Θ^{ι}} \frac{ξ^{ι} (θ_{t}^{ι} m_{t}^{ι} a_{t}^{ι})}{\sum_{a_{t}^{ι} \in A^{ι}} ξ^{ι} (θ_{t}^{ι} m_{t}^{ι} a_{t}^{ι})} . \end{matrix}

(14)

\begin{matrix} P^{ι *} (θ_{t}^{ι}) = \sum_{m_{t}^{ι} \in M^{ι}} \sum_{a_{t}^{ι} \in A^{ι}} ξ^{ι *} (θ_{t}^{ι} m_{t}^{ι} a_{t}^{ι}) > 0 . \end{matrix}

(15)

\begin{matrix} σ^{r *} (m_{t}^{r} | θ_{t}^{r}) = \frac{\sum_{a_{t}^{r} \in A^{r}} ξ^{r *} (θ_{t}^{r} m_{t}^{r} a_{t}^{r})}{\sum_{m_{t}^{r} \in M^{r}} \sum_{a_{t}^{r} \in A^{r}} ξ^{r *} (θ_{t}^{r} m_{t}^{j} a_{t}^{r})} . \end{matrix}

(16)

\begin{matrix} π^{r *} (a_{t}^{r} | m_{t}^{r}) = \frac{1}{|Θ^{r}|} \sum_{θ_{t}^{r} \in Θ^{r}} \frac{ξ^{r} (θ_{t}^{r} m_{t}^{r} a_{t}^{r})}{\sum_{a_{t}^{r} \in A^{r}} ξ^{r} (θ_{t}^{r} m_{t}^{r} a_{t}^{r})} . \end{matrix}

(17)

\begin{matrix} P^{j *} (θ_{t}^{r}) = \sum_{m_{t}^{r} \in M^{r}} \sum_{a_{t}^{r} \in A^{r}} ξ^{r *} (θ_{t}^{r} m_{t}^{j} a_{t}^{r}) > 0 . \end{matrix}

(18)

The BRL algorithm for the SSG is outlined in Algorithm 1 and Algorithm 2. At time

t = 0

, for each defender

ι = 1, \dots, l

and each attacker

r = 1, \dots, f

, we initialize the parameters. Specifically, the initial belief states for both the defender and the attackers are represented as

{\tilde{θ}}_{0}^{ι} \sim P^{ι} (θ_{0}^{ι})

and

{\tilde{θ}}_{0}^{r} \sim P^{r} (θ_{0}^{r})

, where

P^{ι}

and

P^{r}

are probability distributions over the possible states

θ_{0}

of the defender and attackers, respectively. These beliefs are updated throughout the game as players take action and observe the outcomes.

Algorithm 1: BRL algorithm for the SSG

ι = \bar{1, l}

and

r = \bar{1, f}

.

t = 1

.

{\tilde{θ}}_{0}^{ι} \sim P^{ι} (θ_{0}^{ι})

and

{\tilde{θ}}_{0}^{r} \sim P^{r} (θ_{0}^{r})

.

{\tilde{p}}^{ι} = p^{ι}

and

{\tilde{p}}^{r} = p^{r}

.

ε > 0

.

e_{t}^{ι} \leftarrow

Error(

{\hat{p}}_{t - 1}^{ι}

,

{\hat{p}}_{t}^{ι}

) and

e_{t}^{r} \leftarrow

Error(

{\hat{p}}_{t - 1}^{r}

,

{\hat{p}}_{t}^{r}

)

while (

ε > e_{t}^{ι}

and

ε > e_{t}^{r}

)

(i)

x^{*} \leftarrow \{min_{x \in X} Λ (x | y) | \underset{α \in Y}{y \in arg min} Ψ (α | x)\}

and

y^{*} \leftarrow min_{y \in Y} Ψ (y | x) .

(ii) recover

π^{*} \leftarrow

Equations (14)–(17) and

σ^{*} \leftarrow

Equations (13)–(16)

(iii)

m_{t}^{ι} \sim σ^{ι *} (m_{t}^{ι} | {\tilde{θ}}_{t}^{ι})

and

m_{t}^{r} \sim σ^{r *} (m_{t}^{r} | {\tilde{θ}}_{t}^{r})

.

(iv)

a_{t}^{ι} \sim π^{ι *} (a_{t}^{ι} | m_{t}^{ι})

and

a_{t}^{r} \sim π^{r *} (a_{t}^{r} | m_{t}^{r})

.

(v)

{\tilde{θ}}_{t + 1}^{ι} \sim {\hat{p}}^{ι} (θ_{t + 1}^{ι} | θ_{t}^{ι} a_{t}^{ι})

and

{\tilde{θ}}_{t + 1}^{r} \sim {\hat{p}}^{r} (θ_{t + 1}^{r} | θ_{t}^{r} a_{t}^{r})

.

(vi)

χ^{ι} ({\tilde{θ}}_{t + 1}^{i} \tilde{θ} a_{t}^{ι}) + = 1

,

χ^{ι} ({\tilde{θ}}^{ι} a_{t}^{ι}) + = 1

and

χ^{r} ({\tilde{θ}}_{t + 1}^{i} \tilde{θ} a_{t}^{r}) + = 1

,

χ^{r} ({\tilde{θ}}^{r} a_{t}^{r}) + = 1

.

(vii)

{\tilde{v}}^{ι} ({\tilde{θ}}_{t + 1}^{ι} {\tilde{θ}}_{t}^{ι} a_{t}^{ι}) \leftarrow

Equation (11) and

{\tilde{v}}^{r} ({\tilde{θ}}_{t + 1}^{r} {\tilde{θ}}_{t}^{r} a_{t}^{r}) \leftarrow

Equation (11).

(viii)

{\tilde{p}}^{ι} ({\tilde{θ}}_{t + 1}^{ι} | {\tilde{θ}}_{t}^{ι} a_{t}^{ι}) \leftarrow

Equation (12) and

{\tilde{p}}^{r} ({\tilde{θ}}_{t + 1}^{r} | {\tilde{θ}}_{t}^{r} a_{t}^{r}) \leftarrow

Equation (12).

(ix)

e_{t}^{ι} \leftarrow

Error(

{\hat{p}}_{t - 1}^{ι}

,

{\hat{p}}_{t}^{ι}

) and

e_{t}^{r} \leftarrow

Error(

{\hat{p}}_{t - 1}^{r}

,

{\hat{p}}_{t}^{r}

).

(x)

{\tilde{θ}}_{t}^{ι} = {\tilde{θ}}_{t + 1}^{ι}

,

{\tilde{θ}}_{t}^{r} = {\tilde{θ}}_{t + 1}^{r}

,

p^{ι} = {\hat{p}}^{ι}

,

p^{r} = {\hat{p}}^{r}

,

v^{ι} = {\tilde{v}}^{ι}

,

v^{r} = {\tilde{v}}^{r}

and

t = t + 1

.

end while

Algorithm 2: Computation of the error

function Error (

{\hat{p}}_{t - 1}

,

{\hat{p}}_{t}

)

e = \sum_{a_{t} \in A} t r ({({\hat{p}}_{t - 1} - {\hat{p}}_{t})}^{⊺} ({\hat{p}}_{t - 1} - {\hat{p}}_{t}))

.

return e.

end function

The transition probabilities for each player’s states are defined as

{\tilde{p}}^{ι} = p^{ι} ({\tilde{θ}}_{1}^{ι} | {\tilde{θ}}_{0}^{ι} a_{0}^{ι})

and

{\tilde{p}}^{r} = p^{r} ({\tilde{θ}}_{1}^{r} | {\tilde{θ}}_{0}^{r} a_{0}^{r})

, where

{\tilde{p}}^{ι}

and

{\tilde{p}}^{r}

represent the likelihood of transitioning from one state to another given the actions

a_{0}^{ι}

and

a_{0}^{r}

taken by the defenders and the attackers. These transition probabilities are essential for predicting the future states of the game based on current strategies.

An important aspect of the algorithm is to account for estimation errors in the parameters of the model. We denote the allowable error in the estimated parameters by

ε > 0

, which sets a threshold for how much deviation is acceptable. During each step of the algorithm, the actual error

e_{t}^{ι}

for the defenders and

e_{t}^{r}

for the attackers is computed to track the accuracy of the estimated parameters.

At each time step t, the BRL algorithm computes the optimal strategy

σ^{*} (m_{t}^{l} | Θ^{l})

and the corresponding policy

π^{*} (a_{t}^{l} | m_{t}^{l})

for both the defenders and the attackers. The strategy

σ^{*}

determines the probability distribution over the possible messages

m_{t}^{l}

given the state

Θ^{l}

, while the policy

π^{*}

dictates the probability of taking a particular action

a_{t}^{l}

given the message

m_{t}^{l}

. These strategies and policies are central to the Bayesian Stackelberg equilibrium, where players optimize their decisions based on their beliefs about the other players’ actions.

For each step, random messages

m_{t}^{ι}

and

m_{t}^{r}

are drawn from the respective optimal strategies

σ^{ι *} (m_{t}^{ι} | {\tilde{θ}}_{t}^{ι})

and

σ^{r *} (m_{t}^{r} | {\tilde{θ}}_{t}^{r})

. Based on these messages, actions are chosen randomly according to the policies

π^{ι *} (a_{t}^{ι} | m_{t}^{ι})

for the defenders and

π^{r *} (a_{t}^{r} | m_{t}^{r})

for the attackers.

Once the actions are selected and executed, the game transitions to the next state. The updated states

{\tilde{θ}}_{t + 1}^{ι}

and

{\tilde{θ}}_{t + 1}^{r}

are determined by the transition probabilities

{\hat{p}}^{ι} (θ_{t + 1}^{ι} | θ_{t}^{ι} a_{t}^{ι})

and

{\hat{p}}^{r} (θ_{t + 1}^{r} | θ_{t}^{r} a_{t}^{r})

, respectively.

After transitioning to the new states, we update the counts

χ^{ι} ({\tilde{θ}}_{t + 1}^{ι}, {\tilde{θ}}_{t}^{ι}, a_{t}^{ι})

and

χ^{r} ({\tilde{θ}}_{t + 1}^{r} {\tilde{θ}}_{t}^{r} a_{t}^{r})

, which track how frequently each state–action pair is encountered. These counts are critical for refining the players’ estimates of the value functions

{\tilde{v}}^{ι} ({\tilde{θ}}_{t + 1}^{ι} {\tilde{θ}}_{t}^{ι} a_{t}^{ι})

and

{\tilde{v}}^{r} ({\tilde{θ}}_{t + 1}^{r} {\tilde{θ}}_{t}^{r} a_{t}^{r})

. These value functions represent the expected reward for each player given their current and next states and the actions taken.

As the game progresses, the algorithm updates the transition probabilities

{\tilde{p}}^{ι} ({\tilde{θ}}_{t + 1}^{ι} | {\tilde{θ}}_{t}^{ι} a_{t}^{ι})

and

{\tilde{p}}^{r} ({\tilde{θ}}_{t + 1}^{r} | {\tilde{θ}}_{t}^{r} a_{t}^{r})

based on observed outcomes. At each step, the mean-square errors

e_{t}^{ι}

and

e_{t}^{r}

are calculated to measure the difference between the estimated and actual outcomes. The algorithm continues to iterate as long as these errors exceed the predefined threshold

ε

, ensuring that the estimates converge to an acceptable level of accuracy.

Once the errors

e_{t}^{ι}

and

e_{t}^{r}

fall below the threshold

ε

, the resulting strategy profile

σ^{*} (m_{t} | θ_{t})

and policy

π^{*} (a_{t} | m_{t})

form a Bayesian Stackelberg equilibrium. This equilibrium represents the optimal solution for the game, where the defenders and attackers have maximized their respective utilities given the available information.

5. Random Walk Algorithm

Random walks are basic models in probability theory with deep mathematical features and a wide range of applications in SSGs [16,28]. The long-term asymptotic behavior of the defenders to catch the attackers is a key question for these models.

The random walk process for the SSG is described by Algorithm 3 as follows. We will look at a game in which four players compete: defenders and attackers. Both sides have the option of using a randomized approach. The defenders and attackers can both move to any state at the same time. The defenders’ goal is to catch the attackers in the fewest possible steps. The attackers, on the other hand, choose a strategy that maximizes the amount of time it takes for the defenders to catch the attackers. If one of the defenders moves to a state where an attacker is, they are said to be caught. The game is over when the defenders have caught the attackers. We refer to a random walk as a discrete-time Markov process

(X_{t}, t \geq 0)

in a finite-type space

Θ

. The random walk is assumed to be time-homogeneous. This means that the

X_{t + 1}

distribution given

(X_{0}, \dots, X_{t})

is solely dependent on

X_{t}

and not on time t. To ensure that the random walk does not become ’stuck’ in some place of the state space, we also require some type of irreducibility: the set

Θ

is finite.

Given a type

θ_{t}

, a message

m_{t}

is chosen randomly by the defenders and the attackers from the behavior strategies

m_{t}^{ι} \sim σ^{ι *} (m_{t}^{ι} | θ_{t}^{ι})

and

m_{t}^{r} \sim σ^{r *} (m_{t}^{r} | θ_{t}^{r})

. They then choose actions

a_{t}^{ι} \sim π^{ι *} (a_{t}^{ι} | m_{t}^{ι})

and

a^{r} \sim π^{r *} (a_{t}^{r} | m_{t}^{r})

. To minimize and maximize the chance of damage, the transition matrix

p^{ι} (θ_{t + 1}^{ι} | θ_{t}^{ι} a_{t}^{ι})

and

p^{r} (θ_{t + 1}^{r} | θ_{t}^{r} a_{t}^{r})

are used to select the next state in the process. The defenders and attackers aim to finish the SSG the next time (

t + 1

). The process continues until the fulfillment of the SSG meets the game-over or capture statuses provided by the following:

\sum_{ι \in L} \sum_{r \in F} \sum_{θ_{t + 1}^{ι} \in Θ_{t}^{ι}} \sum_{θ_{t + 1}^{r} \in Θ_{t}^{r}} 1 (θ_{t + 1}^{ι}) 1 (θ_{t + 1}^{r})

(19)

Algorithm 3: Random walk for the SSG

1. Choose randomly an initial type for the defenders

θ_{0}^{ι} \sim P^{ι} (θ_{0}^{ι})

and attackers

θ_{0}^{r} \sim P^{r} (θ_{0}^{r})

.

2. Let

π^{ι *} (a_{t}^{ι} | m_{t}^{ι})

and

π^{r *} (a_{t}^{r} | m_{t}^{r})

be the resulting policies and,

σ_{m_{t} | θ_{t}}^{ι *}

and

σ_{m_{t} | θ_{t}}^{r *}

the resulting strategies.

3. For each attacker r and defender

ι

Do

4. Choose a type

θ_{t}^{ι} \sim σ_{m_{t} | θ_{t}}^{ι *}

and

θ_{t}^{r} \sim σ_{r | i}^{ι *}

5. Choose actions

a_{t}^{ι} \sim π^{ι *} (a_{t}^{ι} | m_{t}^{ι})

and

a^{r} \sim π^{r *} (a_{t}^{r} | m_{t}^{r})

.

6. From

p^{ι} (θ_{t + 1}^{ι} | θ_{t}^{ι} a_{t}^{ι})

select

θ_{t + 1}^{ι}

considering

p^{ι} (\cdot | θ_{t}^{ι} a_{t}^{ι})

and from

p^{r} (θ_{t + 1}^{r} | θ_{t}^{r} a_{t}^{r})

select

θ_{t + 1}^{r}

considering

p^{r} (\cdot | θ_{t}^{r} a_{t}^{r})

.

8. Update the original values by including

θ_{t + 1}^{ι}

and

θ_{t + 1}^{r}

to the random walk process.

9. Repeat steps 3–8 until the status expressed in Equation (19) is met.

6. Numerical Example

We will employ three players to simulate the SSG: one defender (

ι = 1

) and two attackers (

r = 1, 2

). The primary objective of the defender is to minimize or halt the damage inflicted by the attackers, while the attackers aim to maximize the expected damage they can cause to a set of targets. This dynamic reflects a typical security scenario, where an agent (the defender) must safeguard resources or locations, while adversarial players (attackers) try to exploit weaknesses.

The game is structured in a sequential manner and is commonly modeled as a Stackelberg game. The defender acts as the leader, committing to a strategy first, which is observed by the attackers. Once the defender’s strategy is known, the attackers, acting as followers, select their strategies in response. This turn-based nature highlights the strategic advantage of the defender, as their actions guide the attackers’ responses.

Each player in the game has a set of possible states and actions. In this particular setup, we assume that each player has

θ = 4

states, representing different possible situations or configurations in which they can find themselves during the game. Additionally, each player has

a = 2

possible actions they can take in any given state, representing the limited yet strategic decisions available at each step of the game.

The key variables that drive the analysis of the SSG are the policies (

π

), behavior strategies (

σ

), and distribution vectors (P) for both the defender and attackers. The policy (

π

) defines the overall plan of action for each player, essentially specifying the probabilities with which they select each available action in a given state. Behavior strategies (

σ

) represent the specific actions taken by players as they progress through the game, reflecting how their strategies evolve over time. The distribution vectors (P) capture the likelihood of players being in different states throughout the course of the game.

These variables will be recovered analytically, meaning that through mathematical derivations and strategic analysis, we will determine the optimal policies, strategies, and distributions for both the defender and the attackers. This will allow us to predict the outcomes of the game under various scenarios, providing insight into how effective the defender’s strategy is in mitigating the attackers’ impact and how the attackers can best exploit potential vulnerabilities.

The generated policies are supplied by using the proposed approach as follows:

\begin{matrix} π^{1 *} (a^{1} | m^{1}) = [\begin{matrix} 0.3925 & 0.6075 \\ 0.3925 & 0.6075 \\ 0.3925 & 0.6075 \\ 0.3961 & 0.6039 \end{matrix}] \end{matrix} \begin{matrix} π^{2 *} (a^{2} | m^{2}) = [\begin{matrix} 0.4748 & 0.5252 \\ 0.4748 & 0.5252 \\ 0.4748 & 0.5252 \\ 0.2492 & 0.7508 \end{matrix}] \end{matrix} \begin{matrix} π^{3 *} (a^{3} | m^{3}) = [\begin{matrix} 0.7903 & 0.2097 \\ 0.7904 & 0.2096 \\ 0.7904 & 0.2096 \\ 0.6700 & 0.3300 \end{matrix}] \end{matrix}

The defender’s resultant (behavior) strategies are provided by

\begin{matrix} σ^{1 *} (m^{1} | θ^{1}) = [\begin{matrix} 0.3204 & 0.3205 & 0.3204 & 0.0387 \\ 0.0223 & 0.0223 & 0.0223 & 0.9332 \\ 0.2253 & 0.2253 & 0.2253 & 0.3240 \\ 0.2500 & 0.2500 & 0.2500 & 0.2500 \end{matrix}] \end{matrix}

and the (behavior) strategies of the attackers are

\begin{matrix} σ^{2 *} (m^{2} | θ^{2}) = [\begin{matrix} 0.0919 & 0.0919 & 0.0919 & 0.7243 \\ 0.0573 & 0.0573 & 0.0573 & 0.8280 \\ 0.2493 & 0.2495 & 0.2494 & 0.2518 \\ 0.2499 & 0.2501 & 0.2500 & 0.2500 \end{matrix}] \end{matrix} \begin{matrix} σ^{3 *} (m^{3} | θ^{3}) = [\begin{matrix} 0.0287 & 0.0287 & 0.0287 & 0.9139 \\ 0.2501 & 0.2503 & 0.2501 & 0.2495 \\ 0.2491 & 0.2494 & 0.2498 & 0.2517 \\ 0.2498 & 0.2498 & 0.2500 & 0.2504 \end{matrix}] \end{matrix}

The distribution vectors are as follows:

\begin{matrix} P^{1 *} (θ^{1}) = [\begin{matrix} 0.4657 \\ 0.4492 \\ 0.0448 \\ 0.0403 \end{matrix}] & P^{2 *} (θ^{2}) = [\begin{matrix} 0.1097 \\ 0.1752 \\ 0.5679 \\ 0.1472 \end{matrix}] & P^{3 *} (θ^{3}) = [\begin{matrix} 0.3492 \\ 0.2401 \\ 0.1065 \\ 0.3041 \end{matrix}] \end{matrix}

The algorithm ensures the convergence of the strategies shown in Figure 1, Figure 2 and Figure 3, optimizing decision-making and adapting efficiently to dynamic threats in real-time environments. Figure 4, Figure 5, Figure 6, Figure 7, Figure 8 and Figure 9 demonstrate the convergence of the error in the reinforcement learning algorithm, illustrating its efficiency and effectiveness in optimizing strategies and adapting to dynamic environments over time. By integrating prior information and utilizing a continuous-time approach, the model enhances its robustness, ensuring that the security system remains resilient against evolving threats. These visualizations confirm the model’s capability to maintain optimal performance and adaptiveness.

The primary goal of patrol scheduling is to efficiently allocate security teams to safeguard fixed targets, taking into account limited workforce resources. To address this challenge, Algorithm 3 is utilized to illustrate the realization of an SSG, where target visitations are determined based on the resulting game strategy, denoted as

π_{k | m}

. This algorithm aids in the planning and deployment of patrols to maximize the protection of critical assets.

Figure 10 provides a visual representation of an instance of the SSG, showcasing the outcomes of engagements between attackers and defenders. In this particular scenario, attacker 2 is apprehended at state 1 by defender 1 after 10 time steps, indicating the successful interception of the threat. Additionally, attacker 2 is captured at state 1, this time after 12 time steps by defender 1. These outcomes demonstrate the effectiveness of the patrol planning process in thwarting attacks and protecting the designated targets. By employing Algorithm 2 and visualizing the SSG outcomes in Figure 1, security planners can gain insights into the effectiveness of their patrol scheduling strategies. This approach allows for the optimization of patrol routes and deployment strategies, ultimately enhancing the security posture despite workforce limitations.

In the alternate realization depicted in Figure 11, a different sequence of events unfolds within the SSG. Here, attacker 2 faces a swift apprehension at state 1 by defender 1 after three time steps. Simultaneously, attacker 1 is intercepted at state 1, albeit after a longer duration of nine time steps, by defender 1. Despite the varying durations and defender involvement, both attackers are ultimately captured. The prompt resolution of attacker 2 by defender 1 underscores the importance of timely responses in security operations. This rapid intervention effectively neutralizes the threat before it can progress further. Meanwhile, the prolonged engagement leading to the capture of attacker 1 highlights the complexities and challenges inherent in security patrolling and target defense.

With the apprehension of both attackers, the game reaches its conclusion. The successful outcomes achieved by the defenders validate the effectiveness of the patrol scheduling strategies implemented. Such realizations provide valuable insights for refining future security tactics, emphasizing the importance of adaptability and resource allocation in mitigating security risks effectively.

7. Conclusions

This research introduces a novel mathematical framework that enhances security measures by integrating optimization techniques to mitigate existing risks while addressing real-time threats. The framework is based on a Bayesian–Markov Stackelberg game model, designed to capture adversarial interactions in security scenarios with incomplete information, where players have limited knowledge of each other’s strategies.

A key feature of this approach is the use of an optimization strategy grounded in the proximal-gradient method, which significantly improves computational efficiency compared to traditional Bayesian–Markov Stackelberg solutions. By leveraging this method, the framework effectively reduces security risks associated with persistent threats. Notably, it incorporates a unique reinforcement learning algorithm that derives rewards from a prior Bayesian distribution. This contribution is particularly significant, as it integrates structured historical data into decision-making, enabling more adaptive and informed responses to evolving security threats.

The framework’s practical effectiveness is demonstrated through numerical examples, showcasing its ability to utilize past information to guide present decisions. These results provide both theoretical insights and empirical evidence of its applicability in real-world security scenarios. Overall, this research presents a comprehensive security optimization approach that bridges the gap between mitigating existing risks and countering ongoing threats in real time. By combining advanced optimization techniques with an innovative reinforcement learning paradigm, this offers a promising avenue for strengthening security in complex, dynamic environments.

Looking ahead, several challenges remain. A key technical endeavor is the implementation of a game-theoretic approach using novel reinforcement learning methods, where capture conditions depend on states and incomplete information, enhancing the realism of patrolling schedules. Additionally, applying this methodology in fortification games and conducting controlled trials to evaluate player responses to game-theoretic scheduling in real-world settings represent significant challenges. These trials can yield valuable insights into the practical implications of game theory for strategic decision-making and scheduling in uncertain environments, ultimately ensuring more robust and adaptable security measures.

Funding

This research received no external funding.

Data Availability Statement

All data required for this article are included within this article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The author declares no conflicts of interest.

References

Trejo, K.K.; Clempner, J.B.; Poznyak, A.S. A Stackelberg security game with random strategies based on the extraproximal theoretic approach. Eng. Appl. Artif. Intell. 2015, 37, 145–153. [Google Scholar] [CrossRef]
Solis, C.U.; Clempner, J.B.; Poznyak, A.S. Modeling Multi-Leader-Follower Non-Cooperative Stackelberg Games. Cybern. Syst. 2016, 47, 650–673. [Google Scholar] [CrossRef]
Clempner, J.B.; Poznyak, A.S. A nucleus for Bayesian Partially Observable Markov Games: Joint observer and mechanism design. Eng. Appl. Artif. Intell. 2020, 95, 103876. [Google Scholar] [CrossRef]
Clempner, J.B.; Poznyak, A.S. Optimization and Games for Controllable Markov Chains: Numerical Methods with Application to Finance and Engineering; Springer: Cham, Switzerland, 2023. [Google Scholar]
Harsanyi, J.C.; Selten, R. A generalized Nash solution for two-person bargaining games with incomplete information. Manag. Sci. 1972, 18, 80–106. [Google Scholar] [CrossRef]
Conitzer, V.; Sandholm, T. Computing the Optimal Strategy to Commit to. In Proceedings of the 7th ACM Conference on Electronic Commerce, Ann Arbor, MI, USA, 11–15 June 2006; pp. 82–90. [Google Scholar]
Paruchuri, P.J.; Pearce, P.; Marecki, J.; Tambe, M.; Ordonez, F.; Kraus, S. Playing Games for Security: An Efficient Exact Algorithm for Solving Bayesian Stackelberg Games. In Proceedings of the 7th International Joint Conference on Autonomous Agents and Multiagent Systems, Estoril, Portugal, 12–16 May 2008; Volume 2, pp. 895–902. [Google Scholar]
Jain, M.; Kiekintveld, C.; Tambe, M. Quality-Bounded Solutions for Finite Bayesian Stackelberg Games: Scaling up. In Proceedings of the 10th International Conference on Autonomous Agents and Multiagent Systems, Taipei, Taiwan, 2–6 May 2011; Volume 3, pp. 997–1004. [Google Scholar]
Yin, Z.; Tambe, M. A Unified Method for Handling Discrete and Continuous Uncertainty in Bayesian Stackelberg Games. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems, Valencia, Spain, 4–8 June 2012; Volume 2, pp. 855–862. [Google Scholar]
Wilczyński, A.; Jakóbik, A.; Kołodziej, J. Stackelberg Security Games: Models, Applications and Computational Aspects. J. Telecomun. Inf. Technol. 2016, 3, 70–79. [Google Scholar] [CrossRef]
Sayed Ahmed, I. Stackelberg-Based Anti-Jamming Game for Cooperative Cognitive Radio Networks. Ph.D. Thesis, University of Calgary, Calgary, AB, USA, 2017. [Google Scholar]
Gan, J.; Elkind, E.; Wooldridge, M. Stackelberg Security Games with Multiple Uncoordinated Defenders. In Proceedings of the 17th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), Stockholm, Sweden, 10–15 July 2018; pp. 703–711. [Google Scholar]
Basilico, N.; Gatti, N.; Villa, F. Asynchronous multi-robot patrolling against in trusions in arbitrary topologies. In Proceedings of the Conference for the Advancement of Artificial Intelligence (AAAI), Atlanta, GA, USA, 11–15 July 2010; pp. 345–350. [Google Scholar]
Clempner, J.B. Learning attack-defense response in continuous-time discrete-states Stackelberg Security Markov games. J. Exp. Theor. Artif. Intell. 2022. to be published. [Google Scholar] [CrossRef]
Albarran, S.; Clempner, J.B. A Stackelberg security Markov game based on partial information for strategic decision making against unexpected attacks. Eng. Appl. Artif. Intell. 2019, 81, 408–419. [Google Scholar] [CrossRef]
Trejo, K.K.; Clempner, J.B.; Poznyak, A.S. Adapting Attackers and Defenders Preferred Strategies: A Reinforcement Learning Approach in Stackelberg Security Games. J. Comput. Syst. Sci. 2018, 95, 35–54. [Google Scholar] [CrossRef]
Solis, C. Ship differential game approach for multiple players: Stackelberg security games. Optim. Control Appl. Methods 2020, 41, 312–326. [Google Scholar] [CrossRef]
Alcantara-Jiménez, G.; Clempner, J. Repeated Stackelberg security games: Learning with incomplete state information. Reliab. Eng. Syst. Saf. 2020, 195, 106695. [Google Scholar] [CrossRef]
Clempner, J.B.; Poznyak, A.S. Stackelberg Security Games: Computing The Shortest-Path Equilibrium. Expert. Syst. Appl. 2015, 42, 3967–3979. [Google Scholar] [CrossRef]
Rahman, M.; Oh, J. Online Learning for Patrolling Robots Against Active Adversarial Attackers. In Proceedings of the Recent Trends and Future Technology in Applied Intelligence; Lecture Notes in Computer Science; Mouhoub, M., Sadaoui, S., Ait Mohamed, O., Ali, M., Eds.; Springer: Montreal, QC, Canada, 2018; Volume 10868. [Google Scholar]
Wang, Y.; Shi, Z.; Yu, L.; Wu, Y.; Singh, R.; Joppa, L.; Fang, F. Deep Reinforcement Learning for Green Security Games with Real-Time Information. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19), Honolulu, HI, USA, 27 January–1 February 2019; pp. 1401–1408. [Google Scholar]
Li, T.; Li, H.; Pan, Y.; Xu, T.; Zheng, Z.; Zhu, Q. Meta Stackelberg Game: Robust Federated Learning against Adaptive and Mixed Poisoning Attacks. arXiv 2024, arXiv:2410.17431. [Google Scholar]
Sengupta, S.; Kambhampati, S. Multi-agent Reinforcement Learning in Bayesian Stackelberg Markov Games for Adaptive Moving Target Defense. arXiv 2020, arXiv:2007.10457. [Google Scholar]
Shukla, P.; An, L.; Chakrabortty, A.; Duel-Hallen, A. A robust Stackelberg game for cyber-security investment in networked control systems. IEEE Trans. Control Syst. Technol. 2022, 31, 856–871. [Google Scholar] [CrossRef]
Clempner, J.B.; Poznyak, A.S. Analytical Method for Mechanism Design in Partially Observable Markov Games. Mathematics 2021, 9, 321. [Google Scholar] [CrossRef]
Clempner, J.B. A Markovian Stackelberg game approach for computing an optimal dynamic mechanism. Comput. Appl. Math. 2021, 40, 186. [Google Scholar] [CrossRef]
Asiain, E.; Clempner, J.B.; Poznyak, A.S. Controller Exploitation-Exploration: A Reinforcement Learning Architecture. Soft Comput. 2019, 23, 3591–3604. [Google Scholar] [CrossRef]
Benson, A.R.; Gleich, D.F.; Lim, L.K. The Spacey Random Walk: A Stochastic Process for Higher-Order Data. SIAM Rev. 2017, 59, 321–345. [Google Scholar] [CrossRef]

Figure 1. Convergence for defender 1’s strategy

ξ^{1}

.

Figure 1. Convergence for defender 1’s strategy

ξ^{1}

.

Figure 2. Convergence for attacker 2’s strategy

ξ^{2}

.

Figure 2. Convergence for attacker 2’s strategy

ξ^{2}

.

Figure 3. Convergence for attacker 3’s strategy

ξ^{3}

.

Figure 3. Convergence for attacker 3’s strategy

ξ^{3}

.

Figure 4. Convergence for defender 1’s error of

p^{1}

.

Figure 4. Convergence for defender 1’s error of

p^{1}

.

Figure 5. Convergence for attacker 2’s error of

p^{2}

.

Figure 5. Convergence for attacker 2’s error of

p^{2}

.

Figure 6. Convergence for attacker 3’s error

p^{3}

.

Figure 6. Convergence for attacker 3’s error

p^{3}

.

Figure 7. Convergence for defender 1’s error of

v^{1}

.

Figure 7. Convergence for defender 1’s error of

v^{1}

.

Figure 8. Convergence for attacker 2’s error of

v^{2}

.

Figure 8. Convergence for attacker 2’s error of

v^{2}

.

Figure 9. Convergence for attacker 3’s error of

v^{3}

.

Figure 9. Convergence for attacker 3’s error of

v^{3}

.

Figure 10. Game realization 1: the attackers are intercepted by defender 1 (blue).

Figure 11. Game realization 2: the attackers are intercepted by defender 1 (blue).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Clempner, J.B. Learning Deceptive Tactics for Defense and Attack in Bayesian–Markov Stackelberg Security Games. Math. Comput. Appl. 2025, 30, 29. https://doi.org/10.3390/mca30020029

AMA Style

Clempner JB. Learning Deceptive Tactics for Defense and Attack in Bayesian–Markov Stackelberg Security Games. Mathematical and Computational Applications. 2025; 30(2):29. https://doi.org/10.3390/mca30020029

Chicago/Turabian Style

Clempner, Julio B. 2025. "Learning Deceptive Tactics for Defense and Attack in Bayesian–Markov Stackelberg Security Games" Mathematical and Computational Applications 30, no. 2: 29. https://doi.org/10.3390/mca30020029

APA Style

Clempner, J. B. (2025). Learning Deceptive Tactics for Defense and Attack in Bayesian–Markov Stackelberg Security Games. Mathematical and Computational Applications, 30(2), 29. https://doi.org/10.3390/mca30020029

Article Menu

Learning Deceptive Tactics for Defense and Attack in Bayesian–Markov Stackelberg Security Games

Abstract

1. Introduction

1.1. Brief Review

1.2. Related Work

1.3. Main Results

1.4. Organization of the Paper

2. Preliminaries

3. Security Game Model

4. Learning

5. Random Walk Algorithm

6. Numerical Example

7. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI