Deceptive Path Planning via Count-Based Reinforcement Learning under Specific Time Constraint

Chen, Dejun; Zeng, Yunxiu; Zhang, Yi; Li, Shuilin; Xu, Kai; Yin, Quanjun

doi:10.3390/math12131979

Open AccessArticle

Deceptive Path Planning via Count-Based Reinforcement Learning under Specific Time Constraint

by

Dejun Chen

,

Yunxiu Zeng

,

Yi Zhang

,

Shuilin Li

,

Kai Xu

^* and

Quanjun Yin

College of Systems Engineering, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(13), 1979; https://doi.org/10.3390/math12131979

Submission received: 20 May 2024 / Revised: 24 June 2024 / Accepted: 24 June 2024 / Published: 26 June 2024

Download

Browse Figures

Versions Notes

Abstract

:

Deceptive path planning (DPP) aims to find a path that minimizes the probability of the observer identifying the real goal of the observed before it reaches. It is important for addressing issues such as public safety, strategic path planning, and logistics route privacy protection. Existing traditional methods often rely on “dissimulation”—hiding the truth—to obscure paths while ignoring the time constraints. Building upon the theory of probabilistic goal recognition based on cost difference, we proposed a DPP method, DPP_Q, based on count-based Q-learning for solving the DPP problems in discrete path-planning domains under specific time constraints. Furthermore, to extend this method to continuous domains, we proposed a new model of probabilistic goal recognition called the Approximate Goal Recognition Model (AGRM) and verified its feasibility in discrete path-planning domains. Finally, we also proposed a DPP method based on proximal policy optimization for continuous path-planning domains under specific time constraints called DPP_PPO. DPP methods like DPP_Q and DPP_PPO are types of research that have not yet been explored in the field of path planning. Experimental results show that, in discrete domains, compared to traditional methods, DPP_Q exhibits better effectiveness in enhancing the average deceptiveness of paths. (Improved on average by 12.53% compared to traditional methods). In continuous domains, DPP_PPO shows significant advantages over random walk methods. Both DPP_Q and DPP_PPO demonstrate good applicability in path-planning domains with uncomplicated obstacles.

Keywords:

deception; deceptiveness; path planning; goal recognition; count-based reinforcement learning

MSC:

68T42

1. Introduction

Deception is a longstanding topic in computer science and a significant hallmark of intelligence [1]. In the field of multi-agent systems and robotics, path planning stands as the cornerstone for collaborative task completion. Deception emerges as an important factor enabling agents to gain an edge in games and adversarial environments. Deceptive planning in adversarial environments empowers humans or AI to conceal their true intentions and mislead the situational awareness of opponents. Moreover, deception is prevalent in many multi-agent applications, such as multi-agent negotiations [2,3] and fugitive pursuit [4]. Deceptive planning in adversarial environments enables intelligence operatives to veil their true intentions or mislead the situational awareness of adversaries. It finds applications in scenarios like network intrusion [5], robot soccer matches [6], privacy protection [7], and many other real-world problems. Deceptive path planning (DPP) emerges as a representative task in this context.

Consider a combat environment that can be abstracted as a two-dimensional grid map containing one real goal and one false goal. A military commander is mobilizing the troops. If they fail to maintain the secrecy of the real goal, they risk being ambushed by the enemy. This necessitates that the commander maneuvers the troops efficiently to ensure they arrive at the real goal on time while minimizing the exposure of their true intentions to the enemy. The troops need to plan a deceptive path, departing from the start node and reaching the real target within a specific time constraint while also deceiving the observers as much as possible to blur their judgment of the true destination of the troops. Alternatively, consider an example involving a warehouse treasure. Many areas within the warehouse can potentially store the treasure, with thieves monitoring the movements of those attempting to retrieve it. If someone were to head directly towards the treasure area, their intentions could be exposed prematurely. However, taking a route that is ambiguous or deceptive can delay the exposure of the location of the treasure. However, the existing DPP methods cannot solve such problems with specific time constraints. Therefore, we propose two count-based reinforcement learning methods (DPP_Q for discrete domains and DPP_PPO for continuous domains) to solve these kinds of DPP problems.

There are three main technical contributions to this article. Firstly, we proposed a method to solve DPP problems under specific time constraints based on count-based Q-learning (DPP_Q), which is applicable to discrete path-planning domains. DPP_Q is based on a traditional cost-difference-based goal recognition model [8], which is proposed to solve the goal recognition problems under discrete domains. Because this goal recognition model is full of the observation sequence, its posterior probability calculation for each goal is precise. In this paper, we refer to it as the Precise Goal Recognition Model (PGRM). And we denote the “quantified deception value” as “deceptiveness”. DPP_Q takes into full consideration the deceptiveness at every grid point so that it can improve the average deceptiveness of the path. Different from traditional DPP methods [9,10], DPP_Q introduces specific time constraints. Assuming the discrete path-planning domains are fully observable. By incorporating time cost into the state space of the observed, the observer can adequately consider the impact of time cost already incurred by the observed on the posterior probability of the real goal, which traditional DPP methods cannot achieve. This allows for a precise quantitative calculation of the deceptiveness of every grid point covered by the path.

Secondly, it is challenging to compute the shortest time cost from the start node to the real or false goals in continuous path-planning domains. This renders PGRM impractical in continuous domains. To address this issue, the maps were discretized into grids, which facilitates the use of simple path-planning methods to compute the shortest time cost. But it may result in the observation sequence containing illegal actions in the discrete map because of the continuous action space (such as the velocity angle of the observed agent being 37.3°, whereas the velocity angle may only have special angles like 0°, 45°, 90°, 135°, etc., for a discrete map). Consequently, computing the time cost the observed has consumed becomes a new challenge. Therefore, we proposed some improvements to PGRM to eliminate all illegal actions in discrete domains. Inspired by the magnitude-based DPP method [11], we proposed a new model for goal recognition called Approximate Goal Recognition Model (AGRM). Even if the observer captures the complete observation sequence, only the last observation point, i.e., the current location of the observed, is utilized for calculating the posterior probabilities of the goals. Since AGRM does not consider the complete observation sequence, the deceptiveness calculated is an approximation. However, it may offer a method for the observer to perform goal recognition in continuous path-planning domains. We conducted statistical experiments to verify that there is no significant difference in the average deceptiveness of paths under discrete path-planning domains when the reward function of DPP_Q is based on PGRM and AGRM. Therefore, it is reliable to divide the continuous map into discrete grids and use AGRM-based DPP_Q to solve DPP problems.

Thirdly, we proposed a method to solve DPP problems under specific time constraints based on count-based proximal policy optimization (DPP_PPO), which utilizes deep reinforcement learning to achieve DPP tasks in continuous domains. AGRM provides reliable rewards to the observed trained by DPP_PPO. Obviously, this method also introduces specific time constraints.

The structure of this paper is as follows. Firstly, we introduced the concepts and traditional methods of goal recognition and DPP problems. Secondly, we proposed DPP_Q for deceptive path generation in discrete path-planning domains. Its effectiveness is validated through experiments on 100 random 10 × 10 maps. Thirdly, we introduced PGRM and AGRM and conducted statistical experiments to verify whether there is no significant difference in the solutions of DPP problems under discrete path-planning domains when using reward function based on PGRM and AGRM or not. Then, we proposed the DPP_PPO method for DPP in continuous path-planning domains. The effectiveness of this method is validated through experiments on a set of 50 × 50 maps, followed by quantitative analyses of the experimental results. Finally, we discussed the advantages and limitations of the experimental results and models and gave future research directions.

2. Background

2.1. Goal Recognition Based on Competitive Relationship and Level of Rationality

In most cases, there are two important factors for studying goal recognition of agents: the competitive relationship and the level of rationality. The relationship between the observed and the observer can be classified as Keyhole [12,13,14,15], where the observed are not sensitive to the behavior of the observer, as there is no competitive or cooperative relationship between the two parties; Adversarial [16], where the observed adopt a hostile attitude towards the behavior of the observer; and Intended [17], where the observed adopt a cooperative and open attitude towards the observer, even assisting it in the recognition process. the study of DPP algorithms focuses on the adversarial relationship between the observer and the observed.

According to the level of rationality exhibited by the observed, behavior can be classified into two categories: optimal and non-optimal [18]. Intent recognition based on planning often assumes that the observed is optimal [19]. This is because planning-based intent recognition methods generate effective plans from the entire set of goals to achieve intent recognition, requiring the method of generating plans to be optimal. However, there are also studies on intent recognition that consider the non-optimality of the behavior of the observed. Keren et al. consider Bounded Non-Optimal [9], which assumes that agents have a certain budget or resources that allow them to deviate from the optimal path. If it is assumed that agents are fully rational, deviating from the optimal path results in deceptive behavior.

2.2. Plan Recognition as Planning

Techniques such as Bayesian networks [20], Hidden Markov Models [21], and tree grammars [22] can address goal recognition problems. In essence, goal recognition is a planning problem [23], as agents continuously select grid points to move forward in the map, which is essentially planning. In most cases, the observation sequence in plan recognition can match the operation sequence stored in the plan library, indicating a clear relationship between plan recognition and planning. Ramirez and Geffner [23] matched observation results with pre-existing plans and created two plans for each possible goal in the set, calculating the costs separately. Assuming agents are rational, it is possible to evaluate the types of goals they set. Building on this idea, Masters et al. proposed the traditional cost-difference-based goal recognition model [8], establishing the probability of relevant goals based on the cost difference between the optimal cost that matches the current observation and the optimal cost without considering any node of the observation sequence.

2.3. Deceptive Path-Planning Methods

Maithripala et al. [24] conducted research on trajectory deception in radar networks for drone countermeasure systems. Hajieghrary et al. [25] considered the configuration parameters between drones and set false trajectories, using automatic control methods to ensure consistency between drones and set trajectories. Lee et al. [26] also conducted research on drone trajectory deception in three-dimensional space environments, mainly from the perspective of control theory. They used the feasible sequential quadratic programming method to first convert the drone trajectory-solving problem into an optimal control problem and then solve it. All of the above algorithms require certain prior knowledge.

Masters et al. [10] formalized DPP problems in discrete path-planning domains, demonstrating that qualitative comparisons of posterior probabilities of goals in the possible goals do not require full of the observation sequence, only the last node of the observation sequence. Based on this conclusion, they introduced the concepts of the Last Deceptive Point (LDP) and proposed a set of traditional DPP methods based on the A* algorithm [27] and heuristic pruning. These methods do not explicitly model time constraints and mainly consider the concept of “dissimulation” without precise quantification of the deceptiveness of each step (grid). Cai et al. [28] proposed research on DPP problems in dynamic environments based on two-dimensional grid maps. Xu et al. [9] applied a mixed integer programming method to estimate the deceptiveness of each grid based on magnitude, proposing a DPP method based on mixed integer programming. It weights time consumption with quantified deceptiveness, achieving good results, but does not consider full of the observation sequence and lacks time constraint.

In recent years, Liu et al. [29] introduced the Ambiguity Model, which weights the optimal Q-value of path planning with deceptiveness to obtain an ambiguous path. Subsequently, Lewis [30] addressed the issue of poor performance of the Ambiguity Model when learning the Q-function through model-free methods by proposing the Deceptive Exploration Ambiguity Model, which achieved better effects.

2.4. Count-Based Reinforcement Learning Exploration Methods

Count-based methods are essentially exploratory approaches in reinforcement learning grounded in intrinsic motivation. Intrinsic motivation, originating from concepts in behavioral and psychological studies, was first discussed in 1950 when Harlow observed the sustained interest of monkeys in solving mechanical puzzles without external rewards [31]. In general, intrinsic motivation stems from the natural interest of humans in activities that offer novelty and curiosity [32], driving humans to focus on the exploration process itself.

Count-based methods draw inspiration from the UCB (upper confidence bound) algorithm and adhere to the philosophy of OFU (optimism in the face of uncertainty) [33]. For instance, MBIE-EB (model-based interval estimation exploration bonus) [34] in the context of episodic MDP (Markov decision process) derives upper confidence bounds for the state–action pair reward values and state transition vectors, incorporating exploration bonuses into the state–action value update formula. Bellemare [35] was among the first to discuss how to adapt the UCB method, traditionally used in reinforcement learning for exploration, to deep reinforcement learning, employing the CTS model. To address the limitations of the CTS model in interpretability, scalability, and data efficiency, Ostrovski et al. [36] replaced the CTS model with a convolutional neural network density model based on PixelCNN (pixel convolutional neural networks) [37]. Tang et al. [38] aimed to establish a simple counting method and resolve the issue that PixelCNN and CTS models are limited to images and cannot be used in continuous control. They proposed the exploration algorithm, which maps high-dimensional states to integer space using a hash function and defines exploration rewards. Although this algorithm relies heavily on the choice of hash function to ensure appropriate discretization granularity, it is straightforward to implement and can be applied to continuous action spaces.

3. Preliminaries

Before explaining the specific details of DPP_Q, AGRM, and DPP_PPO, it is necessary to explain its basic techniques. In this section, we first presented the definition of discrete path-planning domains, followed by the definition of path-planning problems, and finally, the definition of deceptive path-planning problems [10].

Definition 1.

A discrete path-planning domain is a triple

D = ⟨ N, E, c ⟩

:

$N$ is a set of non-empty nodes (or locations);
$E \subseteq N \times N$ is a set of edges between nodes;
$c : E \mapsto R_{0}^{+}$ returns the cost of traversing each edge.

A path

π

in a discrete path-planning domain is a sequence of nodes such as

π = n_{0}, n_{1}, \dots, n_{k}

, in which

(n_{i}, n_{i + 1}) \in E

for each

i \in {0,1, \dots, k - 1}

. The cost of

π

is the cost of traversing all edges in

π

, which is

c o s t (π) = \sum_{i = 0}^{k - 1} c (n_{i}, n_{i + 1})

, from the start node to the goal node. A path-planning problem in a discrete path-planning domain is the problem of finding a path from the start node to the goal node.

Definition 2.

A path-planning problem is a tuple

⟨ D, s, g ⟩

:

$D = ⟨ N, E, c ⟩$ is a discrete path-planning domain;
$s \in N$ is the start node;
$g \in N$ is the goal node.

The solution path for a path-planning problem in the discrete path-planning domain

D

is a path

π = n_{0}, n_{1}, \dots, n_{k}

, in which

{s = n}_{0}

and

{g = n}_{k}

. An optimal path is a solution path with the lowest cost among all solution paths. The optimal cost for two nodes is the cost of an optimal path between them, which is denoted by

o p t c (n_{i}, n_{j})

. The A* algorithm [27], as a well-known best-first search algorithm, is used by typical AI approaches to find the optimal path between two nodes.

DPP presents a departure from conventional path planning. While typical path planning endeavors to find the most cost-effective route to a destination, DPP introduces a layer of complexity by acknowledging the potential for the movements to be tracked. In this context, the objective extends beyond mere navigation to the goal; it includes minimizing the likelihood of an observer identifying the real goal among a set of possibilities. In the scenario of hiding a treasure among several possible locations, the purpose of DPP is to minimize the likelihood of the observer correctly identifying the actual location of the treasure. According to Definitions 1 and 2, the definition of a DPP problem is shown in Definition 3.

Definition 3.

A deceptive path-planning (DPP) problem is a quintuple

⟨ D, s, g_{r}, G, P ⟩

:

$D$ is a discrete path-planning domain;
$s \in N$ is the start node;
$g_{r} \in N$ is the real goal node;
$G = {g_{r}} \cup G_{f}$ is a set of possible goal nodes in which $g_{r}$ is a single real goal and $G_{f}$ is the set of false goals;
$P (G | O)$ is the posterior probability distribution of $G$ based on the observation sequence $O$ . Its calculation is determined by the goal recognition model of the observer.

The quality of the solution depends on the magnitude, density, and extent of the deceptiveness [10]. Now introducing time constraints, we approached the analysis of DPP problems from a new perspective: since the deceptiveness is defined at each individual step, each grid point traversed by the observed from the start to the end of the path was expected to exhibit a higher level of quantified deceptiveness. The quality of solutions depends on the average deceptiveness of all nodes, excluding the start node

s

and the real goal node

g_{r}

, in the deceptive path planned by the observed. This is also the main focus of the upcoming research and discussion in this paper.

4. Method

This section was divided into three parts. Firstly, we presented the definition of PGRM and proposed AGRM. Secondly, we introduced the DPP method in discrete domains—DPP_Q. Finally, we introduced the DPP method in continuous domains—DPP_PPO.

4.1. PGRM and AGRM

In Definition 3, the calculation of

P (G | O)

is determined by the goal recognition model of the observer. Assuming that the path-planning domains are discrete, fully observable, and deterministic, we presented PGRM and proposed AGRM below.

Definition 4.

Precise Goal Recognition Model (PGRM) is a quadruple

⟨ D, G, O, P ⟩

:

$D$ is a discrete path-planning domain;
$G = {g_{r}} \cup G_{f}$ is a set of goals consisting of the real goal $g_{r}$ and a set of false goals $G_{f}$ ;
$O = o_{1}, o_{2}, \dots, o_{| O |}$ is the observation sequence, representing the sequence of all grid points that the observed has passed through from the start node to the current node.
$P$ is a conditional probability distribution $P (G | O)$ across $G$ , given the observation sequence $O$ .

Based on Definition 4, the formula for calculating the cost difference is as follows:

c o s t d i f f (s, O, g) = o p t c (s, O, g) - {o p t c}^{\neg} (s, O, g), f o r a l l g \in G

(1)

where

c o s t d i f f (s, O, g)

represents the cost difference for each

g

in

G

;

o p t c (s, O, g)

represents the optimal cost for the observed to reach

g

from

s

given the observation sequence

O

;

{o p t c}^{\neg} (s, O, g)

represents the optimal cost for the observed to reach

g

from

s

without the need to satisfy the observed sequence

O

.

From Equation (1), the cost difference for all goals in the set

G

can be computed. Subsequently, Equation (2) allows us to calculate the posterior probability distribution of them. The formula for computing

P (G | O)

for each goal is as follows:

P (G | O) = α e^{- β c o s t d i f f (s, O, G)} / {(1 + e}^{- β c o s t d i f f (s, O, G)})

(2)

where

α

is a normalization factor,

β

is a positive constant, satisfying

1 \geq β \geq 0

.

β

is used to describe the degree to which the behavior of the observer tends toward rational or irrational, namely “soft rationality”.

β

indicates the sensitivity of the observer to whether the observed is rational. When the observed is fully rational, it will choose the least costly method (the optimal path) to reach its real goal. The larger

β

, the more the observer believes that the behavior of the observed is rational. When

β = 0

,

P (g_{i} | O) = P (g_{j} | O), \forall g_{i}, g_{j} \in G

, which means that the observed is considered completely irrational so that the posterior probabilities of all goals are equal.

Obviously, PGRM is a goal recognition method applied in discrete path-planning domains. In order to apply it to continuous domains, we proposed Approximate Goal Recognition Mode (AGRM). Firstly, the map is discretized into grids. Then, by only using the last node of the observation sequence (the current node of the observed) to calculate the posterior probability of the goal set

G

, an approximate posterior probability can be obtained. The definition of AGRM is shown in Definition 5.

Definition 5.

The Approximate Goal Recognition Model (AGRM) is a quadruple

⟨ D, G, o_{| O |}, P ⟩

:

$D$ is a path-planning domain;
$G = {g_{r}} \cup G_{f}$ is a set of goals consisting of a single real goal $g_{r}$ and a set of false goals $G_{f}$ ;
$o_{| O |}$ represents the current node of the observed captured by the observer;
$P$ is a conditional probability distribution $P (G | o_{| O |})$ across $G$ , given the current observation point $o_{| O |}$ .

Different from PGRM matching full of the observation sequence, the observer based on AGRM only matches the current node of the observed

o_{| O |}

. Specifically, the formula for calculating the cost difference is:

c o s t d i f f (s, o_{| O |}, g) = o p t c (s, o_{| O |}) + o p t c (o_{| O |}, g) - o p t c (s, g)

(3)

The formula for computing

P (G | o_{| O |})

for each goal is as follows:

P (G | o_{| O |}) = α e^{- β c o s t d i f f (s, o_{| O |}, G)} / (1 + e^{- β c o s t d i f f (s, o_{| O |}, G)})

(4)

The difference between AGRM and PGRM lies in AGRM employing an approximate calculation method

P (G | o_{| O |})

instead of the precise calculation method

P (G | O)

used in PGRM.

Figure 1 illustrates the differences and connections between the key elements of PGRM and AGRM for a goal recognition problem, where the blue dot represents the real goal

g_{r}

, the teal dot represents the false goal

g_{f}

, the orange dot represents the start node

s

, and the yellow dot represents the current node of the observed

o_{| O |}

. Assuming that there is no obstacle. The shortest path between two dots is represented by a line connecting them (i.e., the blue and teal lines). Figure 1a shows the elements of PGRM, where the orange curve represents the observation sequence

O

. Figure 1b shows the elements of AGRM, where the orange dashed line represents the observation sequence ignored by the observer.

Conducting statistical experiments to verify whether there is no significant difference in the solutions of DPP problems when using PGRM-based and AGRM-based reward functions or not under discrete path-planning domains is important for the research to transition from discrete domains to continuous domains. Additionally, even without taking continuous domains into consideration, this work is meaningful. If the experimental results show no significant difference between PGRM and AGRM in this aspect, in adversarial environments, it is unnecessary for the observed to know whether the observer is observing them in real time or not. When the observed realize being observed, there is no need for them to know how much trace is left before. However, there is a premise that the goal recognition method of the observer is based on the theory of plan recognition as planning [23].

4.2. Deceptive Path Planning (DPP) via DPP_Q in Discrete Domains

The Q-learning algorithm, based on MDP, can directly address path-planning problems in discrete domains. A DPP problem under specific time constraints can be abstracted as a path-planning problem where each grid point is assigned a weight (deceptiveness), with the objective for the observed to pass through grid points with higher weight as much as possible during the process of reaching the real goal.

It is assumed that the observed cannot remain stationary at its current position so that its path length can be considered equivalent to the time cost. Since deceptiveness is defined at each grid point, the observed was expected to fully utilize all of the time to pass through grid points with higher deceptiveness. Although the observed cannot remain stationary, it can linger between grid points. Therefore, the observed is not expected to have any remaining time, as it can always use this time to linger in areas with higher deceptiveness, thereby increasing the average deceptiveness of the path. Based on this concept, we proposed DPP_Q. Our optimization objective is to maximize the total deceptiveness by allowing the observed to utilize the specific time as much as possible. This is approximately equivalent to maximizing the average deceptiveness at each grid point along the path.

4.2.1. The State Space

In MDP, the state space contains the environmental information and dynamic changes perceived by the observed. Based on DPP_Q, the state space of the observed includes its current coordinate

(x, y)

, as well as the number of straight and diagonal movements made by itself (denoted as

n_{S t r a i g h t}

and

n_{d i a g o n a l}

, respectively). The state space

S

is specifically defined as a four-dimensional vector:

S = {x, y, n_{S t r a i g h t}, n_{d i a g o n a l}}

(5)

4.2.2. The Action Space

The observed can take actions of two types: straight movements (up, down, left, right) and diagonal movements (up–left, down–left, up–right, down–right). The time required for straight movements is

1

, while that of diagonal movements takes

\sqrt{2}

. Specifically, the action space is a matrix:

A c t i o n s = [[0, 1, 1, 0], [0, - 1, 1, 0], [- 1, 0, 1, 0], [1, 0, 1, 0], [- 1, - 1, 0, 1], [- 1, 1, 0, 1], [1, - 1, 0, 1], [1, 1, 0, 1]]

(6)

It means the observed has eight actions. Each of them was represented as a 4-dimensional vector corresponding to the state space

S

. The first two dimensions update the current coordinate of the observed

(x, y)

, while the last two dimensions update the counts of straight and diagonal movements made by the observed. The special part of this design is that, given the state

s_{T}

, the next state

s_{T + 1}

can be obtained by simple vector addition:

{s_{T} = s}_{T - 1} + A c t i o n s [i]

(7)

i = 0,1, 2, \dots, 7

represents the index of the action taken by the observed.

4.2.3. The Reward Function

Since it is assumed that the observed is fully observable, after the observed moves one grid, the observer calculates the posterior probability

P (G | O)

of all goals in

G

based on PGRM. Let

P_{Q}^{*} = 1 - P (g_{r} | O)

denote the deceptiveness at the current grid point from the calculation, which serves as the primary component of the reward function. The objective is to maximize the average deceptiveness

\bar{P_{Q}^{*}}

of the path. We set the path planned by the observed denoted as

π = n_{0}, n_{1}, \dots, n_{k}

, the calculation of

\bar{P_{Q}^{*}}

is as follows:

\bar{P_{Q}^{*}} = \frac{\sum_{i = 1}^{k - 1} (1 - P (g_{r} | O_{i}))}{k - 1}

(8)

We neglected the condition where the real goal is only one grid point away from the start node, setting

k \geq 2

. Therefore, for each node

n_{i}

the observed passed through, there exists the observation sequence

O_{i} = n_{0}, \dots, n_{i}

.

We employed the count-based method to encourage the observed to explore unknown states. Besides the current coordinate of the observed, its state includes the length of the path it has already planned (i.e., the time cost already spent). The count-based method can guide the observed to explore state–action pairs with higher uncertainty to confirm their high rewards. The uncertainty of the observed relative to its environment can be measured by

δ / \sqrt{N (s, a)}

[34], where

δ

is a constant and

N (s, a)

represents the number of times the state–action pair

(s, a)

has been visited. Specifically,

δ / \sqrt{N (s, a)}

is set as an additional reward used to train the observed:

r_{a d d_Q} (s, a) = δ / \sqrt{N (s, a)}

(9)

Intuitively, if the observed visited a state–action

(s, a)

pair less frequently (i.e.,

N (s, a)

is smaller), the corresponding additional reward will be larger; thus, it should be more inclined to visit this state–action pair to confirm whether it has a high reward. Unlike the classical reinforcement learning approach to path-planning problems, instead of penalizing it, the observed was not allowed to collide with obstacles. The rules for the observed to receive rewards are as follows:

If the remaining time exceeds the shortest time the observed takes to reach the real goal, it cannot arrive at the real goal within the time constraint. This scenario is denoted as $C o n d i t i o n A$ , with a reward of −9 given.
The observed successfully reaches the real goal, which is denoted as $C o n d i t i o n B$ , with a reward of +100 given.
Typically, the observed receives the deceptiveness of each grid point it traverses.

Specifically, the observed receives exploration rewards to encourage it to explore unknown state–action pairs.

Overall, the reward function is shown as follows:

r_{Q} (s, a) = \{\begin{matrix} - 9 + r_{{a d d}_{Q}} (s, a), C o n d i t i o n A \\ + 100 + r_{{a d d}_{Q}} (s, a), C o n d i t i o n B \\ P_{Q}^{*} + r_{{a d d}_{Q}} (s, a), e l s e \end{matrix}

(10)

Our experiments selected a simple scenario where the number of false goals is only one. Under the condition that there is only one false goal, Algorithm 1 provides a detailed explanation of DPP_Q.

Firstly, it reads information about the DPP problem and initializes the Q-table and the counting matrix

N

. (lines 1 to 2).

D

is a 2 × 10 × 10 matrix, where

D [0]

and

D [1],

respectively, store the shortest path lengths (equivalent to the optimal time cost) from each grid point in the discrete path-planning domain to the real goal and the false goal. Secondly, the Q-table is directly modified, with a highly negative value (−1000) being assigned to state–action pairs that cause the observed to collide with walls. (line 3). The illegal actions of the agent are pruned (lines 8 to 12). Subsequently, the observed selects an action according to

ε - g r e e d y

and the state–action pairs are counted (lines 14 to 18). Then, the observed receives a reward (line 19). Following that, the Q-table and the state of the observed are updated (lines 21 to 23). Finally, the termination condition was checked (lines 25 to 27). There are two scenarios leading to the termination of the training: one is when the remaining time exceeds the shortest time observed to reach the real goal, indicating that the observed cannot reach the real goal within the time constraint

T C - (s [2] + \sqrt{2} \times s [3]) < D [0, s [1], s [2]]

; the other is when the current coordinate of the observed coincides with the coordinate of the true goal

s_{c o o r d} = g_{r}

.

Algorithm 1 DPP_Q

Require: A DPP problem with a 10 × 10 gird map.
Parameter: Learning rate

α

, discount factor

γ

, Epsilon greedy

ε

.

1:: Initialize distance matrix $D$ , time constraint $T C$ , collection of obstacles Wall_coord,
start node $s_{0}$ , real goal $g_{r}$ , false goal $g_{f}$ of the DPP problem.
2:: Initialize Q-table and N-table with zeros.
3:: Set $Q (s_{i}, a_{i}) = - 1000$ for all illegal $(s_{i}, a_{i}),$ resulting in the agent being out of the map.
4:: for episode in Episodes do
5:: Initialize s
6:: while True:
7:: /* Collision Detection */
8:: for $a$ in $A c t i o n s$ :
9:: if $s + a$ in Wall_coord:
10:: $Q (s, a) = - 1000$
11:: end if
12:: end for
13:: /* Action Selection */
14:: Choose an action $a$ satisfying $Q (s, a) \neq - 1000$ :
15:: - Select the action with maximum Q-value with probability $1 - ε$
16:: - Select a random action with probability $ε$
17:: /* Update Counts & Give Rewards*/
18:: $N (s, a) = N (s, a) + 1$
19:: Calculate $r_{Q} (s, a)$
20:: /* Update Q-table */
21:: $s' = s + a$
22:: $Q (s, a) = (1 - α) Q (s, a) + α (r_{Q} + γ \max_{a'} Q (s', a'))$
23:: $s = s'$
24:: /* Check for Termination Condition */
25:: if $T C - (s [2] + \sqrt{2} \times s [3]) < D [0, s [0], s [1]]$ or $s_{c o o r d} = g_{r}$ :
26:: break
27:: end if
28:: end while
29:: end for

4.3. Deceptive Path Planning (DPP) via DPP_PPO in Continuous Domains

Inspired by DPP_Q, we further proposed DPP_PPO, which can solve DPP problems in continuous path-planning domains. In our experimental design, the difference between continuous and discrete path-planning domains lies in whether the state space and action space of the observed are continuous or discrete, while the framework of the map remains unchanged. Similar to DPP_Q, the calculation of the reward function depends on the discretized grid partition.

4.3.1. The State Space

The state space of the observed is defined by its position, velocity angle, and remaining time. Such a simple setup reduces the dimensionality of the neural network, which is beneficial for neural network training. The state of the observed is shown as follows:

S = {\vec{p}, ρ, T}

(11)

where

\vec{p} = (x, y)

represents the position vector of the observed,

ρ

denotes the current velocity angle, and

T

indicates the remaining time.

4.3.2. The Action Space

We set the action space as a continuous variable. It is simply defined as follows:

A c t i o n s = {∆ ρ}, ∆ ρ \in [- \frac{π}{4}, \frac{π}{4}]

(12)

where

∆ ρ

represents the change in velocity angle relative to that of the previous step. We set the magnitude of the velocity

| v |

to be constant; thus, there is no need to explicitly represent it in the state space or action space. The direction of velocity changes while its magnitude remains constant.

4.3.3. The Reward Function

In continuous domains, the observed was set to receive a reward every

n

time steps. To design the reward function, we first discretized the continuous map into grids with a certain granularity, then calculated the posterior probability

P (G | o_{| O})

for each grid based on AGRM, and set

P_{Q_{o_{| O |}}}^{*} = 1 - P (g_{r} | o_{| o |})

as the base reward, which denotes the deceptiveness at the current grid point from the calculation. The objective is to maximize the average deceptiveness

\bar{P_{Q_{o_{| O |}}}^{*}}

of the path. We set the path planned by the observed to be denoted as

π = n_{0}, n_{1}, \dots, n_{k}

; the calculation of

\bar{P_{Q_{o_{| O |}}}^{*}}

is as follows:

\bar{P_{Q_{o_{| O |}}}^{*}} = \frac{\sum_{i = 1}^{k - 1} (1 - P (g_{r} | n_{i}))}{k - 1}

(13)

We also set

k \geq 2

.

In other words, once the map is determined (including the distribution of obstacles, the coordinate of real and false goals, and the granularity of grid partition), the reward of each grid is also determined, which has been calculated before training. During training, the deceptiveness of the grid point mapped to by the current coordinate will serve as the base reward.

In continuous domains, although pseudo-counts can be used to evaluate the frequency of state occurrences by designing a density model, this requires a significant amount of learning cost. Therefore, dividing the state–action pairs of the observed into intervals, we adopted the discretized counting method in DPP_PPO. The additional reward is defined as follows:

r_{{a d d}_{Q_{o_{| O |}}}} (s, a) = δ / \sqrt{N (D i v (s, a))}

(14)

where

D i v

is a function that maps continuous state–action pairs to the discrete space. The rules for the observed to receive rewards are as follows:

If the remaining time exceeds the shortest time the observed takes to reach the real goal, it cannot arrive at the real goal within the specific time constraint, or the observed may collide with obstacles. Two of the scenarios are denoted as $C o n d i t i o n A$ , with a reward of −9 given.
The observed successfully reaches the real goal, which is denoted as $C o n d i t i o n B$ , with a reward of +100 given.
Typically, the observed receives the deceptiveness of each grid point it traverses.

Specifically, the observed receives exploration rewards to encourage it to explore unknown state–action pairs.

Overall, the reward function is defined as follows:

r_{Q_{o_{| O |}}} (s, a) = \{\begin{matrix} - 9 + r_{{a d d}_{Q_{o_{| O |}}}} (s, a), C o n d i t i o n A \\ 100 + r_{{a d d}_{Q_{o_{| O |}}}} (s, a), C o n d i t i o n B \\ P_{Q_{o_{| O |}}}^{*} + r_{{a d d}_{Q_{o_{| O |}}}} (s, a), e l s e \end{matrix}

(15)

Unlike DPP_Q, DPP_PPO cannot effectively prune in continuous domains. The observed is allowed to collide with obstacles, but it will receive corresponding penalties.

5. Experiments

We divided the experiments into three parts. Firstly, we tested the performance of DPP_Q on 100 random 10 × 10 grid maps. Secondly, we conducted statistical experiments to verify if there is a significant difference in the results of DPP_Q when using reward functions based on PGRM and AGRM. Lastly, we tested the performance of DPP_PPO on a set of four maps.

Specifically, the experiments only consider the scenario with a single false goal.

5.1. Experiment 1: Performance Evaluation of DPP_Q

Considering the need to collect a sufficient amount of experimental data to ensure the credibility of the performance testing results for DPP_Q, we conducted experiments based on 100 10 × 10 grid maps, in each of which, we generated 4~9 randomly distributed obstacle grids, one start node randomly generated in the lower center area, one real goal randomly generated in the upper left corner area, and one false real randomly generated in the upper right corner area. The reason for this design is to highlight the specificity of DPP problems. This is because the special nature of DPP problems can only be reflected when there is a certain distance between the start node, the real goal, and the false goal. Separating the three elements also helps to make the planned paths more observable. Considering a scenario where the false goal is very close to the real goal, the grids covered by the deceptive path would have only a small difference in posterior probabilities between the two goals, making the demonstration of DPP problems less evident.

We selected four traditional methods as competing baselines to compare the performance of DPP_Q.

π_{d 1}

[9]: This is the simplest simulation method, which first takes an optimal path towards the false goal. This strategy generates the path

π_{d 1} = s, \dots, g_{f}, \dots, g_{r}

. This achieves a strongly deceptive path but the path cost is likely to be high.

π_{d 2}

[10]: This is a basic DPP method of dissimulation, aiming to seek an ambiguous path. This strategy takes an optimal path direct from

s

to LDP, then on to

g_{r}

, which generates the cheapest path that can pass through LDP.

π_{d 3}

[10]:

π_{d 3}

is improved based on

π_{d 2}

. A path

π_{d 3}

can be assembled using a modified heuristic so that, while still targeting LDP, whenever there is a choice of routes, it favours the false goal, increasing its likelihood of remaining deceptive.

π_{d 4}

[10]: There is a precalculated “heatmap” of probabilities used to prune nodes with low deceptiveness in

π_{d 4}

. The average deceptiveness of

π_{d 4}

is generally higher than

π_{d 2}

and

π_{d 3}

, but still not as high as

π_{d 1}

for the same DPP problems most of the time.

We conducted research on DPP problems under specific time constraints to test the performance of DPP_Q. The experimental design is illustrated in Figure 2. The process of DPP_Q experiments. Initially, the time constraint was computed according to the baseline methods

π_{d 1} ~ π_{d 4},

based on which the observed was trained by PGRM-based DPP_Q. We aim to compare the average deceptiveness of paths generated by PGRM-based DPP_Q with those generated by

π_{d 1} ~ π_{d 4}

, and it is meaningful when the time constraints are the same. The time constraint can be arbitrary for DPP_Q but not for

π_{d 1} ~ π_{d 4}

. We decide whether to conduct Paired T-tests or Wilcoxon Signed Rank Tests according to the results of Normality Tests and Homogeneity of Variance Tests. After that, we further conducted a comprehensive analysis of the data, including measures such as means, medians, and other statistical parameters.

5.2. Experiment 2: Significance Testing for PGRM and AGRM

This experimental design is illustrated in Figure 3. For each of the 100 maps used in Experiment 1, four kinds of time constraints are generated based on

π_{d 1} ~ π_{d 4}

, corresponding to four DPP problems under specific time constraints. Therefore, DPP_Q solved these 400 DPP problems under specific time constraints in total, with two reward functions based on PGRM and AGRM. After the average deceptiveness of the two deceptive paths (denoted as

\bar{P_{Q}^{*}}

and

\bar{P_{Q_{o_{| O |}}}^{*}}

, respectively) for all DPP problems were collected in two arrays, we conducted the significance testing.

5.3. Experiment 3: Performance Experiments of DPP_PPO

In Experiment 3, we evaluated the performance of DPP_PPO using the maps numbered 1 to 4 in Experiment 1. Each map was partitioned into 50 × 50 grids (the deceptiveness of every grid is computed by AGRM based on this partition). For experimental convenience, we set the coordinates of obstacles, start nodes, and goal nodes to be integers.

We selected the random walk observed with the aim of achieving the real goal as the baseline to compare the performance of DPP_PPO.

6. Results

This section is divided into three parts, where we analyze the three experimental results. We denoted the deceptive paths generated by PGRM-based DPP_Q as

π_{Q}

, those generated by based on AGRM-based DPP_Q as

π_{Q_{o_{| O |}}}

, and those generated by

π_{d 1} ~ π_{d 4}

as

π_{d 1} ~ π_{d 4}

.

6.1. Results of Experiment 1

Figure 4 visualizes the paths generated by PGRM-based DPP_Q and

π_{d 1} ~ π_{d 4}

on 10 × 10 grid maps numbered from 1 to 5 (i.e., a total of twenty DPP problems under specific time constraints), in which the green and red arrows represent the paths generated by DPP_Q and

π_{d 1} ~ π_{d 4}

, respectively. And the word “map” represents the number of maps in Experiment 1.

Figure 5a–d, respectively, depict the comparison of the average deceptiveness of paths generated by DPP_Q and

π_{d 1} ~ π_{d 4}

, with the proportional function

y = x

(red dashed line) as reference. Every blue point represents a comparison between the two solutions to a DPP problem. A point above the line indicates that the result of DPP_Q is superior to that of

π_{d 1} ~ π_{d 4}

. The red points represent the average deceptiveness of the paths on the maps numbered 1~5, shown in Figure 4. Figure 5e–h are box plots of the data in Figure 5a–d. It can be observed that DPP_Q outperforms

π_{d 1} ~ π_{d 4}

, although DPP_Q is not consistently better than

π_{d 1} ~ π_{d 4}

in solving some DPP problems, especially evident in

π_{d 4}

.

Table 1 demonstrates there is a significant difference in the average deceptiveness of paths generated by DPP_Q and

π_{d 1} ~ π_{d 4} .

Table 2 displays the median, first quartile (Q1), third quartile (Q3), mean, and variance of the data. DPP_Q shows a significant improvement in the average deceptiveness of paths compared to

π_{d 1} ~ π_{d 4}

.

We trained 100,000 times for each DPP problem. During the training process, we focused on the current state of the observed so that we set the learning rate

α = 0.999

. Given that we need to consider the deceptiveness at every grid point along the entire path, we set a large discount factor

γ = 0.999

, which ensures that the observed did not overly forget rewards obtained early in training. The value of epsilon (

ε

) for the epsilon-greedy method is set to 0.1.

6.2. Results of Experiment 2

Figure 6a displays the comparison of the average deceptiveness of paths generated by PGRM-based DPP_Q (paths were donated as

π_{Q}

) and AGRM-based DPP_Q (paths were donated as

π_{Q_{o_{| O |}}}

), with the function

y = x

as a reference. Every blue point represents a comparison between the two solutions to a DPP problem. Most of the points fall near the red dashed line. Figure 6b shows the box plot of the data in Figure 6a. Intuitively, the two sets of data are essentially the same.

There is no significant difference in setting the reward functions based on PGRM and AGRM on the results of DPP_Q (p = 0.907 > 0.05), which extends this to continuous domain maps.

6.3. Results of Experiment 3

We set the velocity

| \vec{v} | = 1

, and the time step

∆ t = 0.1 s

. The time constraint for every DPP problem is set to

55 s

. The observed receives a reward every

10 ∆ t = 1 s

. While the observed could further optimize the average deceptiveness of paths through additional training, 10000 iterations suffice to demonstrate the effectiveness of DPP_PPO.

For the observed performing a random walk to the real goal, we eliminated the deceptiveness on each grid node, retaining only the reward for reaching the real goal.

Specifically, in continuous domains, the observed was considered to have reached the real goal when the distance from its current position to the coordinate of the real goal is less than 1.

Figure 7a,c,e,g present the comparison results of DPP problems for maps numbered from 1 to 4 in Experiment 1, in which the blue cross represents the real goal, the teal cross represents the false goal, and the orange cross represents the start node. Green lines represent the deceptive path generated by DDP_PPO, while red lines represent that of the random one. Figure 7b,d,f,h, respectively, show the changes in the posterior probability of real goal on maps 1 to 4 as the completion of the paths varies. DPP_PPO significantly improves the average deceptiveness of paths.

7. Discussion

In this paper, we first proposed DPP_Q, a DPP method based on count-based Q-learning in discrete path-planning domains, and compared it with traditional methods

π_{d 1} ~ π_{d 4}

. DPP is aimed at solving DPP problems under specific time constraints. In contrast,

π_{d 1} ~ π_{d 4}

generates paths based on “dissimulation”—hiding the truth without considering any time constraints. The experimental results demonstrate that DPP_Q not only effectively addresses DPP problems under specific time constraints but also improves the average deceptiveness of the deceptive paths compared to those generated by

π_{d 1} ~ π_{d 4}

when utilizing the cost generated by

π_{d 1} ~ π_{d 4}

as the time constraints. Therefore, DPP_Q offers optimization over

π_{d 1} ~ π_{d 4}

.

Furthermore, to extend DPP_Q to continuous domains, we proposed AGRM based on PGRM and validated the reward functions based on both in the discrete path-planning domain using DPP_Q. Our findings indicate no significant difference in the solutions to DPP problems between PGRM-based and AGRM-based DPP_Q. This not only facilitates the extension of DPP_Q to continuous domains but also provides evidence for the Markov property of DPP problems. In other words, the deception of the observed depends on its current state and is independent of its past states only. Additionally, even without considering the extension of DPP_Q to continuous domains, the research is also meaningful. We drew statistically significant conclusions that the experimental results of PGRM-based DPP_Q and AGRM-based DPP_Q show no significant difference in this aspect. Therefore, in adversarial environments, the observed agents do not need to know whether they are being observed in real time. When they become aware of being observed, they do not need to be concerned about how many traces they left previously. Of course, this assumes that the goal recognition method of the observer is based on planning recognition theory.

Finally, we proposed DPP_PPO, a DPP method based on count-based proximal policy optimization in continuous domains, and conducted preliminary tests to demonstrate its feasibility.

8. Conclusions

In general, this paper proposed two innovative methods, namely, DPP_Q (a DPP method based on count-based Q-learning under specific time constraints in discrete path-planning domains) and DPP_PPO (a DPP method based on count-based proximal policy optimization under specific time constraints in continuous path-planning domains). Both DPP_Q and DPP_PPO demonstrate good applicability in path-planning domains with uncomplicated obstacles.

For future work, we provided the following outlook. Firstly, modeling the goal recognition method of the observer is challenging. PGRM and AGRM are just two assumed models of the observer, which may not be accurate and are likely to be inaccurate. Since we obviously cannot determine the recognition method of the observer, as we are not them, it is necessary to incorporate adversarial factors in future work. For example, the observed agents could promptly perceive the recognition results of the observer or the actions taken by the observer after recognition and then adjust their planning accordingly to further deceive, making the planning methods more universal. Secondly, for each DPP problem based on a 10 × 10 grid map, using DPP_Q, the agent needs to train fewer than 100,000 times to achieve comparable or even better results than

π_{d 1} ~ π_{d 4}

. However, DPP_Q tends to get stuck in local optima for some DPP problems, performing worse than

π_{d 1} ~ π_{d 4}

. Through experiments, we have empirically demonstrated that the issue can be partially addressed by increasing the number of training iterations (up to 3,000,000 times), but it can be time-consuming. In future research, we plan to test DPP_Q and DPP_PPO on larger maps with more diverse obstacle distributions and a greater number of false goals, which aims to assess the feasibility of addressing more complex DPP problems and further optimize these methods. We think that in the future, using imitation learning or inverse reinforcement learning and assigning higher rewards on some specific grids could help the observed escape local optima, which also applies to DPP_PPO.

Author Contributions

Conceptualization, D.C. and K.X.; methodology, D.C. and Y.Z. (Yunxiu Zeng); validation, D.C. and Y.Z. (Yi Zhang); formal analysis, D.C.; investigation, S.L.; data curation, D.C.; writing—original draft preparation, D.C.; writing—review and editing, D.C. and Y.Z. (Yi Zhang); supervision, K.X. and Q.Y.; funding acquisition, K.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of China (grant number 62103420).

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

We thank Peta Masters and Sebastian Sardina of RMIT University for their support with the open-source code. Deceptive path-planning algorithms can be found at GitHub—ssardina-planning/p4-simulator: Python Path Planning Project (P4).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Alloway, T.P.; McCallum, F.; Alloway, R.G.; Hoicka, E. Liar, liar, working memory on fire: Investigating the role of working memory in childhood verbal deception. J. Exp. Child Psychol. 2015, 137, 30–38. [Google Scholar] [CrossRef] [PubMed]
Greenberg, I. The effect of deception on optimal decisions. Oper. Res. Lett. 1982, 1, 144–147. [Google Scholar] [CrossRef]
Matsubara, S.; Yokoo, M. Negotiations with inaccurate payoff values. In Proceedings of the International Conference on Multi Agent Systems (Cat. No. 98EX160), Paris, France, 3–7 July 1998; pp. 449–450. [Google Scholar]
Shieh, E.; An, B.; Yang, R.; Tambe, M.; Baldwin, C.; DiRenzo, J.; Maule, B.; Meyer, G. Protect: A deployed game theoretic system to protect the ports of the United States. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1, Valencia, Spain, 4–8 June 2012; pp. 13–20. [Google Scholar]
Geib, C.W.; Goldman, R.P. Plan recognition in intrusion detection systems. In Proceedings of the DARPA Information Survivability Conference and Exposition II, DISCEX’01, Anaheim, CA, USA, 12–14 June 2001; pp. 46–55. [Google Scholar]
Kitano, H.; Asada, M.; Kuniyoshi, Y.; Noda, I.; Osawa, E. Robocup: The robot world cup initiative. In Proceedings of the First International Conference on Autonomous Agents, Marina del Rey, CA, USA, 5–8 February 1997; pp. 340–347. [Google Scholar]
Keren, S.; Gal, A.; Karpas, E. Privacy Preserving Plans in Partially Observable Environments. In Proceedings of the IJCAI, New York, NY, USA, 9–15 July 2016; pp. 3170–3176. [Google Scholar]
Masters, P.; Sardina, S. Cost-based goal recognition for path-planning. In Proceedings of the 16th Conference on Autonomous Agents and Multiagent Systems, Sao Paulo, Brazil, 8–12 May 2017; pp. 750–758. [Google Scholar]
Keren, S.; Gal, A.; Karpas, E. Goal recognition design for non-optimal agents. In Proceedings of the AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015. [Google Scholar]
Masters, P.; Sardina, S. Deceptive Path-Planning. In Proceedings of the IJCAI, Melbourne, Australia, 19–25 August 2017; pp. 4368–4375. [Google Scholar]
Xu, K.; Zeng, Y.; Qin, L.; Yin, Q. Single real goal, magnitude-based deceptive path-planning. Entropy 2020, 22, 88. [Google Scholar] [CrossRef] [PubMed]
Avrahami-Zilberbrand, D.; Kaminka, G.A. Incorporating observer biases in keyhole plan recognition (efficiently!). In Proceedings of the AAAI, Palo Alto, CA, USA, 26–28 March 2007; pp. 944–949. [Google Scholar]
Cohen, P.R.; Perrault, C.R.; Allen, J.F. Beyond question answering. In Strategies for Natural Language Processing; Psychology Press: East Sussex, UK, 2014; pp. 245–274. [Google Scholar]
Albrecht, D.W.; Zukerman, I.; Nicholson, A.E. Bayesian models for keyhole plan recognition in an adventure game. User Model. User-Adapt. Interact. 1998, 8, 5–47. [Google Scholar] [CrossRef]
Kaminka, G.A.; Pynadath, D.V.; Tambe, M. Monitoring teams by overhearing: A multi-agent plan-recognition approach. J. Artif. Intell. Res. 2002, 17, 83–135. [Google Scholar] [CrossRef]
Braynov, S. Adversarial planning and plan recognition: Two sides of the same coin. In Proceedings of the Secure Knowledge Management Workshop, Brooklyn, NY, USA, 28–29 September 2006; pp. 67–70. [Google Scholar]
Xu, K.; Yin, Q. Goal Identification Control Using an Information Entropy-Based Goal Uncertainty Metric. Entropy 2019, 21, 299. [Google Scholar] [CrossRef] [PubMed]
Masters, P.; Vered, M. What’s the context? implicit and explicit assumptions in model-based goal recognition. In Proceedings of the International Joint Conference on Artificial Intelligence 2021, Montreal, QC, Canada, 19–27 August 2021; pp. 4516–4523. [Google Scholar]
Ramırez, M.; Geffner, H. Goal recognition over POMDPs: Inferring the intention of a POMDP agent. In Proceedings of the IJCAI, Montreal, QC, Canada, 19–27 August 2011; pp. 2009–2014. [Google Scholar]
Charniak, E.; Goldman, R.P. Probabilistic Abduction for Plan Recognition; Brown University, Department of Computer Science: Providence, RI, USA, 1991. [Google Scholar]
Bui, H.H. A general model for online probabilistic plan recognition. In Proceedings of the IJCAI, Acapulco, Mexico, 9–15 August 2003; pp. 1309–1315. [Google Scholar]
Geib, C.W.; Goldman, R.P. A probabilistic plan recognition algorithm based on plan tree grammars. Artif. Intell. 2009, 173, 1101–1132. [Google Scholar] [CrossRef]
Ramırez, M.; Geffner, H. Plan recognition as planning. In Proceedings of the 21st International Joint Conference on Artificial Intelligence, Pasadena, CA, USA, 11–17 July 2009; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2009; pp. 1778–1783. [Google Scholar]
Maithripala, D.H.A.; Jayasuriya, S. Radar deception through phantom track generation. In Proceedings of the American Control Conference, Portland, ON, USA, 8–10 June 2005; pp. 4102–4106. [Google Scholar]
Hajieghrary, H.; Jayasuriya, S. Guaranteed consensus in radar deception with a phantom track. In Proceedings of the Dynamic Systems and Control Conference, Palo Alto, CA, USA, 21–23 October 2013; p. V002T020A005. [Google Scholar]
Lee, I.-H.; Bang, H. Optimal phantom track generation for multiple electronic combat air vehicles. In Proceedings of the 2008 International Conference on Control, Automation and Systems, Seoul, Republic of Korea, 14–17 October 2008; pp. 29–33. [Google Scholar]
Hart, P.E.; Nilsson, N.J.; Raphael, B. A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Syst. Sci. Cybern. 1968, 4, 100–107. [Google Scholar] [CrossRef]
Cai, Z.; Ju, R.; Zeng, Y.; Xie, X. Deceptive Path Planning in Dynamic Environment. In Proceedings of the 2020 3rd International Conference on Advanced Electronic Materials, Computers and Software Engineering (AEMCSE), Shenzhen, China, 24–26 April 2020; pp. 203–207. [Google Scholar]
Liu, Z.; Yang, Y.; Miller, T.; Masters, P. Deceptive reinforcement learning for privacy-preserving planning. arXiv 2021, arXiv:2102.03022 2021. [Google Scholar]
Lewis, A.; Miller, T. Deceptive reinforcement learning in model-free domains. In Proceedings of the International Conference on Automated Planning and Scheduling, Prague, Czech Republic, 8–13 July 2023; pp. 587–595. [Google Scholar]
Harlow, H.F. Learning and satiation of response in intrinsically motivated complex puzzle performance by monkeys. J. Comp. Physiol. Psychol. 1950, 43, 289. [Google Scholar] [CrossRef] [PubMed]
Barto, A.; Mirolli, M.; Baldassarre, G. Novelty or surprise? Front. Psychol. 2013, 4, 61898. [Google Scholar] [CrossRef] [PubMed]
Lai, T.L.; Robbins, H. Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 1985, 6, 4–22. [Google Scholar] [CrossRef]
Strehl, A.L.; Littman, M.L. An analysis of model-based interval estimation for Markov decision processes. J. Comput. Syst. Sci. 2008, 74, 1309–1331. [Google Scholar] [CrossRef]
Bellemare, M.; Srinivasan, S.; Ostrovski, G.; Schaul, T.; Saxton, D.; Munos, R. Unifying count-based exploration and intrinsic motivation. Adv. Neural Inf. Process. Syst. 2016, 29. [Google Scholar] [CrossRef]
Ostrovski, G.; Bellemare, M.G.; Oord, A.; Munos, R. Count-based exploration with neural density models. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 2721–2730. [Google Scholar]
Van den Oord, A.; Kalchbrenner, N.; Espeholt, L.; Vinyals, O.; Graves, A. Conditional image generation with pixelcnn decoders. Adv. Neural Inf. Process. Syst. 2016, 29. [Google Scholar] [CrossRef]
Tang, H.; Houthooft, R.; Foote, D.; Stooke, A.; Chen, X.; Duan, Y.; Schulman, J.; De Turck, F.; Abbeel, P. A study of count-based exploration for deep reinforcement learning. In Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 4–9. [Google Scholar]

Figure 1. The key elements of PGRM and AGRM.

Figure 2. The process of DPP_Q experiments.

Figure 3. The process of PGRM and AGRM experiments.

Figure 4. Comparison of deceptive paths generated by DPP_Q and

π_{d 1} ~ π_{d 4}

.

Figure 4. Comparison of deceptive paths generated by DPP_Q and

π_{d 1} ~ π_{d 4}

.

Figure 5. Comparison of the average deceptiveness (a–d) and box plots (e–h) for paths generated by DPP_Q and

π_{d 1} ~ π_{d 4}

.

Figure 5. Comparison of the average deceptiveness (a–d) and box plots (e–h) for paths generated by DPP_Q and

π_{d 1} ~ π_{d 4}

.

Figure 6. Significance testing for PGRM-based and AGRM-based DPP_Q.

Figure 7. Visualization and overall deceptiveness comparison between paths generated by DPP_PPO and Random Agent.

Table 1. Significance tests of average deceptiveness of paths generated by DPP_Q and

π_{d 1} ~ π_{d 4}

.

Table 1. Significance tests of average deceptiveness of paths generated by DPP_Q and

π_{d 1} ~ π_{d 4}

.

Paths	Normality Test		Homogeneity Test for Variance		Paired Samples t-Test		Wilcoxon Signed Rank Test
Paths	Passed	p-Value	Passed	p-Value	Significant	p-Value	Significant	p-Value
$π_{d 1}$	√	0.529	√	0.359	√	9.1 × 10⁻²³	-	-
$π_{Q}$	√	0.896	√	0.359	√	9.1 × 10⁻²³	-	-
$π_{d 2}$	√	0.094	√	0.125	√	3.3 × 10⁻¹⁴	-	-
$π_{Q}$	√	0.066	√	0.125	√	3.3 × 10⁻¹⁴	-	-
$π_{d 3}$	×	0.010	√	0.745	-	-	√	1.1 × 10⁻⁸
$π_{Q}$	√	0.324	√	0.745	-	-	√	1.1 × 10⁻⁸
$π_{d 4}$	×	0.002	√	0.706	-	-	√	1.1 × 10⁻⁵
$π_{Q}$	√	0.317	√	0.706	-	-	√	1.1 × 10⁻⁵

Table 2. Improvements in average deceptiveness of paths generated by DPP_Q compared to

π_{d 1} ~ π_{d 4}

.

Table 2. Improvements in average deceptiveness of paths generated by DPP_Q compared to

π_{d 1} ~ π_{d 4}

.

Paths	Median	Q1	Q3	Average	Std	Improvements/%
$π_{d 1}$	0.671	0.644	0.701	0.672	0.043	8.33
$π_{Q}$	0.729	0.696	0.759	0.728	0.051	8.33
$π_{d 2}$	0.369	0.331	0.402	0.361	0.071	28.81
$π_{Q}$	0.461	0.414	0.510	0.465	0.085	28.81
$π_{d 3}$	0.409	0.379	0.470	0.428	0.087	9.11
$π_{Q}$	0.467	0.420	0.521	0.467	0.085	9.11
$π_{d 4}$	0.448	0.408	0.516	0.466	0.081	3.86
$π_{Q}$	0.485	0.433	0.550	0.484	0.086	3.86

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, D.; Zeng, Y.; Zhang, Y.; Li, S.; Xu, K.; Yin, Q. Deceptive Path Planning via Count-Based Reinforcement Learning under Specific Time Constraint. Mathematics 2024, 12, 1979. https://doi.org/10.3390/math12131979

AMA Style

Chen D, Zeng Y, Zhang Y, Li S, Xu K, Yin Q. Deceptive Path Planning via Count-Based Reinforcement Learning under Specific Time Constraint. Mathematics. 2024; 12(13):1979. https://doi.org/10.3390/math12131979

Chicago/Turabian Style

Chen, Dejun, Yunxiu Zeng, Yi Zhang, Shuilin Li, Kai Xu, and Quanjun Yin. 2024. "Deceptive Path Planning via Count-Based Reinforcement Learning under Specific Time Constraint" Mathematics 12, no. 13: 1979. https://doi.org/10.3390/math12131979

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deceptive Path Planning via Count-Based Reinforcement Learning under Specific Time Constraint

Abstract

1. Introduction

2. Background

2.1. Goal Recognition Based on Competitive Relationship and Level of Rationality

2.2. Plan Recognition as Planning

2.3. Deceptive Path-Planning Methods

2.4. Count-Based Reinforcement Learning Exploration Methods

3. Preliminaries

4. Method

4.1. PGRM and AGRM

4.2. Deceptive Path Planning (DPP) via DPP_Q in Discrete Domains

4.2.1. The State Space

4.2.2. The Action Space

4.2.3. The Reward Function

4.3. Deceptive Path Planning (DPP) via DPP_PPO in Continuous Domains

4.3.1. The State Space

4.3.2. The Action Space

4.3.3. The Reward Function

5. Experiments

5.1. Experiment 1: Performance Evaluation of DPP_Q

5.2. Experiment 2: Significance Testing for PGRM and AGRM

5.3. Experiment 3: Performance Experiments of DPP_PPO

6. Results

6.1. Results of Experiment 1

6.2. Results of Experiment 2

6.3. Results of Experiment 3

7. Discussion

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI