Exploring the Use of Invalid Action Masking in Reinforcement Learning: A Comparative Study of On-Policy and Off-Policy Algorithms in Real-Time Strategy Games

Hou, Yueqi; Liang, Xiaolong; Zhang, Jiaqiang; Yang, Qisong; Yang, Aiwu; Wang, Ning

doi:10.3390/app13148283

Open AccessArticle

Exploring the Use of Invalid Action Masking in Reinforcement Learning: A Comparative Study of On-Policy and Off-Policy Algorithms in Real-Time Strategy Games

by

Yueqi Hou

^1,2

,

Xiaolong Liang

^1,2

,

Jiaqiang Zhang

^1,2,*

,

Qisong Yang

³

,

Aiwu Yang

^1,2

and

Ning Wang

^1,2

¹

Air Traffic Control and Navigation School, Air Force Engineering University, Xi’an 710051, China

²

Shaanxi Key Laboratory of Meta-Synthesis for Electronic and Information System, Air Force Engineering University, Xi’an 710051, China

³

Xi’an Research Institute of High-Technology, Xi’an 710051, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(14), 8283; https://doi.org/10.3390/app13148283

Submission received: 2 June 2023 / Revised: 13 July 2023 / Accepted: 17 July 2023 / Published: 18 July 2023

(This article belongs to the Section Robotics and Automation)

Download

Browse Figures

Versions Notes

Abstract

:

Invalid action masking is a practical technique in deep reinforcement learning to prevent agents from taking invalid actions. Existing approaches rely on action masking during policy training and utilization. This study focuses on developing reinforcement learning algorithms that incorporate action masking during training but can be used without action masking during policy execution. The study begins by conducting a theoretical analysis to elucidate the distinction between naive policy gradient and invalid action policy gradient. Based on this analysis, we demonstrate that the naive policy gradient is a valid gradient and is equivalent to the proposed composite objective algorithm, which optimizes both the masked policy and the original policy in parallel. Moreover, we propose an off-policy algorithm for invalid action masking that employs the masked policy for sampling while optimizing the original policy. To compare the effectiveness of these algorithms, experiments are conducted using a simplified real-time strategy (RTS) game simulator called Gym-

μ

RTS. Based on empirical findings, we recommend utilizing the off-policy algorithm for addressing most tasks while employing the composite objective algorithm for handling more complex tasks.

Keywords:

invalid action masking; reinforcement learning; policy gradient; proximal policy optimization; real-time strategy game

1. Introduction

Reinforcement Learning (RL) has attracted tremendous attention due to its human-level performance in challenging decision-making tasks [1,2,3,4]. However, in some more complex tasks, such as real-time strategy (RTS) games [5,6,7] and multiplayer online battle arena (MOBA) games [8,9,10], the action spaces are high-dimensional, and partial actions are usually invalid in some specific states. For instance, executing a skill on cool-down or buying an item exceeding existing gold will not be allowed. The action availability usually depends on the current state (e.g., skill cool-down finished or having enough gold). It is impractical to design variable action spaces for different states [11]. Therefore, finding an effective way to prevent the agent from choosing invalid actions is significant [12,13,14].

A common approach, called invalid action penalty, assigns a negative reward if the agent takes an invalid action so that the agent learns to avoid invalid actions by minimizing the expected penalty [15,16]. However, this technique cannot prevent the agent from taking invalid actions and can hardly scale on complex tasks [17,18]. An alternative method, action space shaping, works by reducing the full discrete action space to a simpler action space [19], which typically involves non-useful action removals and discretization of continuous action space. However, the tuning of action space shaping has been found to be challenging, and it can occasionally hinder the agent’s ability to successfully accomplish desired tasks [20]. Another popular technique is invalid action masking, which filters invalid actions by setting the probability of all invalid actions to near zero [7,8,9]. This technique ensures the agent can avoid selecting impossible or unavailable actions during training. In addition, it reduces the dimension of exploration space and improves exploration efficiency. Invalid action masking has been widely applied to many domains besides RTS or MOBA games, such as traffic signal control [21], unmanned vehicle [22], adaptive voltage control [12], etc.

Huang et al. [23] provides a closer look at invalid action masking for the first time, categorizing it into two types, invalid action policy gradient (utilizing the masked policy for gradient calculation) and naive invalid action policy gradient (utilizing the original policy for gradient calculation). Through a numerical example, it demonstrates that invalid action masking alters the policy gradient and that the invalid action policy gradient corresponds to a valid policy gradient. In addition, it is noted that employing the naive invalid action policy gradient may lead to instability. However, empirical results in [23] show that naive invalid action masking can outperform the invalid action policy gradient when the masking is removed. This finding challenges the existing theoretical analysis in [23], which has inspired our interest in understanding the mechanism behind naive invalid action masking.

Inspired by aforementioned analysis, this study provides a theoretical analysis and reveals why naive invalid action masking is an effective way to train the agent to avoid invalid actions explicitly. Specifically, the regular invalid action masking is to replace the logits of invalid actions with a large negative number [23], which we call logit-level masking. To derive the policy gradient, we present a novel masking called action-level masking. We prove that two types of masking are equivalent so that the conclusions derived by one also apply to another. We utilize the action-level masking to verify that invalid action masking indeed changes the policy gradient. More interestingly, we find that the difference between naive invalid action policy gradient and invalid action policy gradient is a state-dependent factor that directly impacts the probabilities associated with invalid actions. Based on this, we prove that the naive invalid action policy gradient (taking the masked policy as behavior policy and using the original policy to calculate the gradient) is actually a valid gradient that concurrently optimizes the masked policy and the original policy. From this conclusion, we design a Composite Objective Invalid Action Masking (CO-IAM) algorithm that concurrently optimizes the masked policy and reduces the probabilities of invalid actions. Furthermore, we present on-policy and off-policy algorithms for Invalid Action Masking, called On-PIAM and Off-PIAM, respectively. The On-PIAM algorithm aims to optimize the performance of the masked policy, which is the same as that of the invalid action policy gradient in [23]. The Off-PIAM adopts the Importance Sampling (IS) technique and optimizes the original policy. The main contributions of this paper are summarized as follows:

We introduce the concept of action-level masking, which is proven to be equivalent to logit-level masking. The theoretical analysis based on action-level masking demonstrates that the difference between the invalid action policy gradient and the naive invalid action policy gradient is a state-dependent gradient that impacts the probability distribution of invalid actions. This contributes to a better understanding of the impact of invalid action masking on gradients.
We prove that the naive policy gradient is indeed a valid gradient that optimizes both the masked policy and original policy concurrently. Although this is a counter-intuitive conclusion, it ensures that we can directly use the naive policy gradient for training without causing instability. In addition, we also propose an off-policy algorithm for invalid action masking that focuses on optimizing the original policy.
We conduct experiments based on the Gym- $μ$ RTS platform [24] to compare the performance of the proposed algorithms and invalid action penalty. The results show that Off-PIAM outperforms other algorithms even if the masking is removed, and CO-IAM scales well on complex tasks. Based on these findings, we conclude that the Off-PIAM algorithm is suitable for addressing most tasks, while the CO-IAM algorithm is well-suited for handling more complex tasks.

2. Background

2.1. Reinforcement Learning

Consider a standard reinforcement learning problem in which an agent interacts with a stochastic environment and attempts to maximize cumulative reward [25]. The stochastic environment is formalized by a Markov Decision Process (MDP), which is denoted as the tuple

M = (S, A, P, ρ, R, γ)

, a state space

S

, an action space

A

, a transition probability distribution

P : S \times A \times S \to [0, 1]

, an initial state distribution

ρ : S \to [0, 1]

, a reward function

R : S \times A \to R

, and a discount factor

γ \in [0, 1)

. Without prior knowledge about the transition probability distribution, the agent utilizes its policy

π_{θ}

to collect a trajectory

τ = {s_{0}, a_{0}, r_{0}, s_{1}, a_{1}, r_{1}, \dots}

for training. The agent improves its policy

π_{θ}

to maximize the expected discounted return for each episode

\begin{matrix} J (π_{θ}) = E_{s_{0} \sim ρ, a_{t} \sim π_{θ}, s_{t + 1} \sim P (\cdot | s_{t}, a_{t})} [\sum_{t = 0}^{H} γ^{t} r_{t}], \end{matrix}

(1)

where

r_{t} = R (s_{t}, a_{t})

, and H is the maximal length of one episode. The fundamental result underlying this problem is the Policy Gradient Theorem [25], which derives the policy gradient

\nabla_{θ} J (π_{θ})

with respect to the policy parameter

θ

. Executing gradient ascent

θ \leftarrow θ + \nabla_{θ} J (π_{θ})

results in policy improvement and maximizes the expected discounted return. The policy gradient estimate is

\begin{matrix} g = E_{τ \sim π_{θ}} [\sum_{t = 0}^{H} \nabla_{θ} log π_{θ} (a_{t} | s_{t}) G_{t}], \end{matrix}

(2)

where

G_{t} = \sum_{t^{'} = t}^{H} γ^{t^{'} - t} r_{t^{'}}

is the discounted return for a trajectory following step t. The General Advantage Estimation (GAE) technique [26] is proposed to reduce the variance of the policy gradient estimate at the cost of some bias

\begin{matrix} \hat{g} = E_{τ \sim π_{θ}} [\sum_{t = 0}^{H} \nabla_{θ} log π_{θ} (a_{t} | s_{t}) {\hat{A}}_{t}], \end{matrix}

(3)

where

{\hat{A}}_{t}

is a GAE estimator of the advantage function [26]. The policy gradient algorithm requires on-policy trajectory for training (i.e., the target policy and behavior policy should be consistent).

2.2. Logit-Level Masking

In typical deep reinforcement learning algorithms, a neural network is commonly used to represent the policy. In the case of a discrete action space, the action probability distribution is produced by implementing the Softmax operation on the output logits of the neural network, which is the framework we will assume in the rest of the paper.

Invalid action masking is a practical technique in deep reinforcement learning to prevent the agent from taking impossible or unavailable actions. The most popular way for invalid action masking is to replace the logits of invalid actions with negative infinity (or a large negative number in practice) [23], which we call logit-level masking. Logit-level masking works by applying a state-dependent masking function

m (\cdot | s)

when calculating action probability distribution [23], as shown in Figure 1.

Specifically, let us define an MDP with the state space

S

and the discrete action space

A = {a_{1}, a_{2}, \dots, a_{n}}

, where n is the number of discrete actions. The policy

π_{θ}

is parameterized by parameters

θ

of the neural network. Taking any state

s \in S

as input, the output logits of the neural network are represented by

l_{θ} (\cdot | s) = [l_{θ} (a_{1} | s), l_{θ} (a_{2} | s), \dots, l_{θ} (a_{n} | s)]

. The logit-level masking function is defined as

\begin{matrix} m (l_{θ} (a_{i} | s)) = \{\begin{matrix} l_{θ} (a_{i} | s) & if a_{i} is valid in s, \\ - \infty & if a_{i} is invalid in s . \end{matrix} \end{matrix}

(4)

Then, we can obtain the re-normalized action probability distribution

π_{θ}^{m} (\cdot ∣ s)

by implementing the Softmax operation on the masked logits

\begin{matrix} π_{θ}^{m} (\cdot ∣ s) & = Softmax (m (l_{θ} (\cdot | s))) \\ = [\frac{exp (m (l_{θ} (a_{1} | s)))}{\sum_{a_{i} \in A} exp (m (l_{θ} (a_{i} | s)))}, \dots, \frac{exp (m (l_{θ} (a_{n} | s)))}{\sum_{a_{i} \in A} exp (m (l_{θ} (a_{i} | s)))}] . \end{matrix}

(5)

Since the values of logits associated with invalid actions are

- \infty

, the sample probabilities of invalid actions are zero as

{lim}_{l \to - \infty} exp (l) = 0

.

3. Invalid Action Masking

This section proposes a novel paradigm for invalid action masking, called action-level masking. Compared with logit-level masking, the key benefit of action-level masking is its suitability for conducting theoretical analyses of action masking. Furthermore, we verify the equivalence of logit-level masking and action-level masking. Thus, the conclusions derived by Huang et al. [23] are also applicable to action-level masking.

3.1. Action-Level Masking

Different from logit-level masking, action-level masking works by replacing the action probability of invalid actions to be masked by zero and re-normalizing the action probability distribution (see Figure 2).

Taking any state

s \in S

as input, the original policy

π_{θ}

outputs the action probability distribution

[π_{θ} (a_{1} | s), π_{θ} (a_{2} | s), \dots, π_{θ} (a_{n} | s)]

, where

\begin{matrix} π_{θ} (a_{i} | s) = \frac{exp (l_{θ} (a_{i} | s))}{\sum_{a_{j} \in A} exp (l_{θ} (a_{j} | s))} . \end{matrix}

(6)

The action-level masking function

ϖ (\cdot | s)

is a state-dependent binary function

\begin{matrix} ϖ (\cdot | s) & = [ϖ (a_{1} | s), ϖ (a_{2} | s), \dots, ϖ (a_{n} | s)], \\ where ϖ (a_{i} | s) & = \{\begin{matrix} 1, & if a_{i} is valid in s, \\ 0, & if a_{i} is invalid in s . \end{matrix} \end{matrix}

(7)

Then, we can re-normalize the Hadamard product of the original policy and action-level masking function and obtain the masked policy

\begin{matrix} π_{θ}^{ϖ} (\cdot | s) & = Normalize (π_{θ} (\cdot | s) \circ ϖ (\cdot | s)) \\ = [\frac{π_{θ} (a_{1} | s) ϖ (a_{1} | s)}{\sum_{a_{i} \in A} π_{θ} (a_{i} | s) ϖ (a_{i} | s)}, \dots, \frac{π_{θ} (a_{n} | s) ϖ (a_{n} | s)}{\sum_{a_{i} \in A} π_{θ} (a_{i} | s) ϖ (a_{i} | s)}] . \end{matrix}

(8)

We assume that for any state

s \in S

, at least one action in the action space should be valid (i.e.,

\sum_{a_{i} \in A} (ϖ (a_{i} | s)) \neq 0, \forall s \in S

). However, it is important to note that this assumption may not hold true in certain scenarios, such as when a passive unit is surrounded from all sides. To address this, we propose incorporating a no-operation action into the action space so that at least one action is always legal in any given state. The sample probability of any invalid action

a_{inv}

is zero since the corresponding value of the action-level masking function

ϖ (a_{inv} | s)

is equal to zero. We design action-level masking because it is conducive to the theoretical analysis of the policy gradient algorithm.

3.2. Equivalence Verification

This section will be concerned with verifying the equivalence of logit-level masking and action-level masking, based on which the conclusions derived by Huang et al. [23] are also applicable to action-level masking. The equivalence holds if two types of invalid action masking satisfy that:

\begin{matrix} π_{θ}^{m} (a | s) = π_{θ}^{ϖ} (a | s), \forall s \in S, \forall a \in A . \end{matrix}

(9)

For any state

s \in S

, the action space can be divided into two parts, valid action space

A_{v} (s)

and invalid action space

A_{inv} (s)

. We have:

\begin{matrix} A = A_{v} (s) \cup A_{inv} (s), A_{v} (s) \cap A_{inv} (s) = \emptyset . \end{matrix}

(10)

According to (5), for any valid action

a_{v} \in A_{v} (s)

, the action probability of the logit-level masked policy is:

\begin{matrix} π_{θ}^{m} (a_{v} | s) & = \frac{exp (m (l_{θ} (a_{v} | s)))}{\sum_{a_{i} \in A} exp (m (l_{θ} (a_{i} | s)))} \\ = \frac{exp (m (l_{θ} (a_{v} | s)))}{\sum_{a_{i} \in A_{v}} exp (m (l_{θ} (a_{i} | s))) + \sum_{a_{j} \in A_{inv}} exp (m (l_{θ} (a_{j} | s)))} \\ = \frac{exp (l_{θ} (a_{v} | s))}{\sum_{a_{i} \in A_{v}} exp (l_{θ} (a_{i} | s))} \end{matrix}

(11)

According to (8), for any valid action

a_{v} \in A_{v} (s)

, the action probability of the action-level masking policy is:

\begin{matrix} π_{θ}^{ϖ} (a_{v} | s) & = \frac{π_{θ} (a_{v} | s) ϖ (a_{v} | s)}{\sum_{a_{i} \in A} π_{θ} (a_{i} | s) ϖ (a_{i} | s)} = \frac{π_{θ} (a_{v} | s)}{\sum_{a_{i} \in A_{v} (s)} π_{θ} (a_{i} | s)} \\ = \frac{exp (l_{θ} (a_{v} | s))}{\sum_{a_{j} \in A} exp (l_{θ} (a_{j} | s))} / \sum_{a_{i} \in A_{v} (s)} \frac{exp (l_{θ} (a_{i} | s))}{\sum_{a_{j} \in A} exp (l_{θ} (a_{j} | s))} \\ = \frac{exp (l_{θ} (a_{v} | s))}{\sum_{a_{i} \in A_{v} (s)} exp (l_{θ} (a_{i} | s))} \end{matrix}

(12)

In light of (11) and (12), one has:

\begin{matrix} π_{θ}^{m} (a_{v} | s) = π_{θ}^{ϖ} (a_{v} | s), \forall s \in S, \forall a_{v} \in A_{v} (s) . \end{matrix}

(13)

Obviously, the probabilities of invalid actions of both policies are equal to zero:

\begin{matrix} π_{θ}^{m} (a_{inv} | s) = π_{θ}^{ϖ} (a_{inv} | s) = 0, \forall s \in S, \forall a_{inv} \in A_{inv} (s) . \end{matrix}

(14)

Combining (13) with (14), we obtain that:

\begin{matrix} π_{θ}^{m} (a | s) = π_{θ}^{ϖ} (a | s), \forall s \in S, \forall a \in A . \end{matrix}

(15)

Thus, logit-level masking and action-level masking are equivalent. This is a significant result because it shows that the conclusions of logit-level masking [23] also apply to action-level masking. It is important to note that the substitution of negative infinity with a significantly negative value may introduce minor rounding errors. However, these errors do not impact the overall equivalence. While both action-level masking and logit-level masking are practically equivalent, action-level masking enables the explicit derivation of policy gradients, whereas logit-level masking does not. In the following, action-level masking is used to derive the policy gradient algorithm for invalid action masking.

3.3. Masking Changes Policy Gradient

This section focuses on the impact of invalid action masking on policy updates. Huang et al. [23] designed a numerical example to illustrate that the gradient generated by invalid action masking differs from the naive policy gradient. As a supplement, we further investigate this finding and reveal the difference between the naive policy gradient and the invalid action masking policy gradient.

In regular policy-based algorithms, the original policy

π_{θ}

is employed to collect samples and calculate the policy gradient using the policy gradient theorem [25]. To avoid invalid actions, we apply the invalid action masking to the policy

π_{θ}

, resulting in a masked policy denoted as

π_{θ}^{ϖ}

. The straightforward approach to policy updating is the direct utilization of the original policy gradient, called naive policy gradient:

\begin{matrix} g_{naive} = E_{τ \sim π_{θ}^{ϖ}} [\sum_{t = 0}^{H} \nabla_{θ} log π_{θ} (a_{t} | s_{t}) {\hat{A}}_{t}] . \end{matrix}

(16)

The naive policy gradient is intuitively improper because the policy gradient algorithm is an on-policy algorithm, but the target policy (

π_{θ}

) and behavior policy (

π_{θ}^{ϖ}

) are different. Huang et al. [23] proposed the invalid action policy gradient, which is the policy gradient for the masked policy

π_{θ}^{ϖ}

\begin{matrix} g_{mask} = E_{τ \sim π_{θ}^{ϖ}} [\sum_{t = 0}^{H} \nabla_{θ} log π_{θ}^{ϖ} (a_{t} | s_{t}) {\hat{A}}_{t}] . \end{matrix}

(17)

The distinction between the two policy gradients lies in different target policies for log probability calculation. Next, we provide a theoretical analysis to reveal the difference between the naive policy gradient (16) and invalid action policy gradient (17).

Theorem 1.

Consider a masked policy

π_{θ}^{ϖ}

produced by applying invalid action masking to the original policy

π_{θ}

. The difference between the naive policy gradient (16) and invalid action policy gradient (17) is a state-dependent gradient

\begin{matrix} g_{mask} - g_{naive} = E_{τ \sim π_{θ}^{ϖ}} [\sum_{t = 0}^{H} \frac{{\hat{A}}_{t}}{I (s_{t}; θ)} \sum_{a_{i} \in A_{inv} (s_{t})} \nabla_{θ} π_{θ} (a_{i} | s_{t})] \end{matrix}

(18)

where

I (s_{t}; θ) = \sum_{a_{j} \in A_{v} (s_{t})} π_{θ} (a_{j} | s_{t})

.

Proof of Theorem 1.

According to (8), the gradient item in

g_{mask}

can be written as:

\begin{matrix} \nabla_{θ} log π_{θ}^{ϖ} (a_{t} | s_{t}) = \nabla_{θ} log \frac{π_{θ} (a_{t} | s_{t}) ϖ (a_{t} | s_{t})}{\sum_{a_{i} \in A} π_{θ} (a_{i} | s_{t}) ϖ (a_{i} | s_{t})} . \end{matrix}

(19)

Since

a_{t}

is sampled from the masked policy

π_{θ}^{ϖ}

,

a_{t}

must be a valid action, and

ϖ (a_{t} | s_{t}) = 1

. Then, the gradient item (19) can be written as:

\begin{matrix} \nabla_{θ} log π_{θ}^{ϖ} (a_{t} | s_{t}) & = \nabla_{θ} log \frac{π_{θ} (a_{t} | s_{t})}{\sum_{a_{i} \in A} π_{θ} (a_{i} | s_{t}) ϖ (a_{i} | s_{t})} \\ = \nabla_{θ} log π_{θ} (a_{t} | s_{t}) - \nabla_{θ} log \sum_{a_{i} \in A} π_{θ} (a_{i} | s_{t}) ϖ (a_{i} | s_{t}) \\ = \nabla_{θ} log π_{θ} (a_{t} | s_{t}) - \frac{\nabla_{θ} \sum_{a_{i} \in A} π_{θ} (a_{i} | s_{t}) ϖ (a_{i} | s_{t})}{\sum_{a_{j} \in A} π_{θ} (a_{j} | s_{t}) ϖ (a_{j} | s_{t})} \end{matrix}

(20)

Since

\nabla_{θ} \sum_{a_{i} \in A} π_{θ} (a_{i} | s_{t}) = \nabla_{θ} 1 = 0

, (20) is written as:

\begin{matrix} \nabla_{θ} log π_{θ}^{ϖ} (a_{t} | s_{t}) = \nabla_{θ} log π_{θ} (a_{t} | s_{t}) + \frac{\sum_{a_{i} \in A} \nabla_{θ} π_{θ} (a_{i} | s_{t}) (1 - ϖ (a_{i} | s_{t}))}{\sum_{a_{j} \in A} π_{θ} (a_{j} | s_{t}) ϖ (a_{j} | s_{t})} . \end{matrix}

(21)

According to the definition of masking function (7), we have

1 - ϖ (a_{i} | s_{t}) = 0, \forall a_{i} \in A_{v} (s_{t})

and

ϖ (a_{i} | s_{t}) = 0, \forall a_{i} \in A_{inv} (s_{t})

. Substituting the above equation into (21), we obtain that

\begin{matrix} \nabla_{θ} log π_{θ}^{ϖ} (a_{t} | s_{t}) & = \nabla_{θ} log π_{θ} (a_{t} | s_{t}) + \frac{\sum_{a_{i} \in A_{inv} (s_{t})} \nabla_{θ} π_{θ} (a_{i} | s_{t})}{\sum_{a_{j} \in A_{v} (s_{t})} π_{θ} (a_{j} | s_{t})} \\ = \nabla_{θ} log π_{θ} (a_{t} | s_{t}) + \frac{1}{I (s_{t}; θ)} \sum_{a_{i} \in A_{inv} (s_{t})} \nabla_{θ} π_{θ} (a_{i} | s_{t}), \end{matrix}

(22)

where

I (s_{t}; θ) = \sum_{a_{j} \in A_{v} (s_{t})} π_{θ} (a_{j} | s_{t})

. Substituting (22) into (17), we obtain that

\begin{matrix} g_{m a s k} & = E_{τ \sim π_{θ}^{ϖ}} [\sum_{t = 0}^{H} (\nabla_{θ} log π_{θ} (a_{t} | s_{t}) + \frac{1}{I (s_{t}; θ)} \sum_{a_{i} \in A_{inv} (s_{t})} \nabla_{θ} π_{θ} (a_{i} | s_{t})) {\hat{A}}_{t}] \\ = g_{n a i v e} + E_{τ \sim π_{θ}^{ϖ}} [\sum_{t = 0}^{H} \frac{{\hat{A}}_{t}}{I (s_{t}; θ)} \sum_{a_{i} \in A_{inv} (s_{t})} \nabla_{θ} π_{θ} (a_{i} | s_{t})] \end{matrix}

This completes the Proof of Theorem 1. □

Theorem 1 illustrates that the gradient produced by invalid action masking differs from the naive policy gradient. Specifically, the residual item is the sum of gradients corresponding to invalid actions.

The experimental results derived by Huang et al. [23] show that once the masking is removed, the performance of the original policy

π_{θ}

will be unstable. We infer that the residual item in the invalid action policy gradient results in the poor performance of the original policy. In the rest of this paper, we propose a novel algorithm for invalid action masking based on Theorem 1.

4. Practical Algorithms for Action Masking

Invalid action masking transforms the original policy (

π_{θ}

) into the masked policy (

π_{θ}^{ϖ}

), which is used to collect samples. The policy gradient algorithm is an on-policy algorithm where the target policy and behavior policy should be consistent. Intuitively, the naive policy gradient is improper because the behavior policy (

π_{θ}^{ϖ}

) is different from the target policy (

π_{θ}

). The design of algorithms depends on which policy we expect to optimize.

This section lists three policy gradient algorithms for invalid action masking, the On-Policy Invalid Action Masking (On-PIAM) algorithm [23], the Off-Policy Invalid Action Masking (Off-PIAM) algorithm, and the Composite Objective Invalid Action Masking (CO-IAM) algorithm.

Each algorithm utilizes the masked policy to sample but optimize different policies. The On-PIAM algorithm optimizes the performance of the masked policy (

π_{θ}^{ϖ}

), while the Off-PIAM algorithm optimizes the original policy (

π_{θ}

). The CO-IAM algorithm optimizes the masked policy (

π_{θ}^{ϖ}

) and the original policy (

π_{θ}

) in parallel. Before introducing these algorithms, we first provide a brief summary of the above algorithms in Table 1.

4.1. On-Policy Algorithm

In the On-Policy Invalid Action Masking (On-PIAM) algorithm [23], the target and behavior policies are both masked policies (

π_{θ}^{ϖ}

). Since

π_{θ}^{ϖ}

is differentiable to its parameters

θ

, the policy gradient algorithm can be directly applied to train

π_{θ}^{ϖ}

:

\begin{matrix} \nabla_{θ} J (π_{θ}^{ϖ}) = E_{τ \sim π_{θ}^{ϖ}} [\sum_{t = 0}^{H} \nabla_{θ} log π_{θ}^{ϖ} (a_{t} | s_{t}) {\hat{A}}_{t}] . \end{matrix}

(23)

To train the agent more robustly, we can take Proximal Policy Optimization (PPO) algorithm [27] as the training algorithm:

\begin{matrix} L_{On - P}^{C L I P} (θ) & = E_{τ \sim π_{θ}^{ϖ}} [min (r_{t}^{ϖ} (θ) {\hat{A}}_{t}, clip (r_{t}^{ϖ} (θ), ϵ) {\hat{A}}_{t})], \end{matrix}

(24)

\begin{matrix} where r_{t}^{ϖ} (θ) & = \frac{π_{θ}^{ϖ} (a_{t} | s_{t})}{π_{θ_{old}}^{ϖ} (a_{t} | s_{t})}, and clip (r, ϵ) = \{\begin{matrix} 1 + ϵ, & r > 1 + ϵ \\ r, & 1 - ϵ ⩽ r ⩽ 1 + ϵ \\ 1 - ϵ, & r < 1 - ϵ \end{matrix} . \end{matrix}

(25)

During training, the agent also maximizes the policy’s entropy to encourage exploration. We combine these optimization terms and obtain the following optimization objective

\begin{matrix} L_{On - P}^{P P O} (θ) = L_{On - P}^{C L I P} (θ) + c_{e} H (π_{θ}^{ϖ}) . \end{matrix}

(26)

where

H (\cdot)

is an entropy bonus and

c_{e}

is a coefficient for the entropy item.

After the training, we obtain a trained policy

π_{θ}^{ϖ}

rather than

π_{θ}

. Note that we cannot remove the masking when using the policy because the target policy is the masked policy

π_{θ}^{ϖ}

.

4.2. Off-Policy Algorithm

In the Off-Policy Invalid Action Masking (Off-PIAM) algorithm, we treat

π_{θ}^{ϖ}

as the behavior policy, and we still optimize the original policy (

π_{θ}

). We can adapt the Importance Sampling (IS) technique to derive an unbiased estimator of a policy gradient that relies on off-policy trajectories. This is known as an off-policy policy gradient [28]. The gradient of

J (π_{θ})

corresponding to its parameter

θ

is [29]:

\begin{matrix} \nabla_{θ} J (π_{θ}) = E_{τ \sim π_{θ_{old}}^{ϖ}} [(\prod_{t = 0}^{H} \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{old}}^{ϖ} (a_{t} | s_{t})}) \sum_{t = 0}^{H} \nabla_{θ} log π_{θ} (a_{t} | s_{t}) {\hat{A}}_{t}] . \end{matrix}

(27)

Even though the above policy gradient is unbiased, the product of the important sampling ratio results in high variance [30]. We can clip the gradient the same way as PPO [27,31]:

\begin{matrix} L_{Off - P}^{C L I P} (θ) & = E_{τ \sim π_{θ_{old}}^{ϖ}} [min (r_{t}^{IS} (θ) {\hat{A}}_{t}, clip (r_{t}^{IS} (θ), ϵ) {\hat{A}}_{t})], \end{matrix}

(28)

\begin{matrix} where r_{t}^{IS} (θ) & = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{old}}^{ϖ} (a_{t} | s_{t})} . \end{matrix}

(29)

By adding the entropy item, the optimization objective is:

\begin{matrix} L_{Off - P}^{P P O} (θ) = L_{Off - P}^{C L I P} (θ) + c_{e} H (π_{θ}^{ϖ}) . \end{matrix}

(30)

Different from the On-PIAM algorithm, the Off-PIAM algorithm focuses on training the original policy

π_{θ}

. After the training, the agent can work well even if we remove the masking. That is, the agent trained by the Off-PIAM algorithm explicitly learns to avoid invalid actions.

4.3. Composite Objective Algorithm

The On-PIAM algorithm aims to improve performance of the masked policy, and the Off-PIAM algorithm focuses on optimizing the original policy. A question naturally arises, can we optimize two policies concurrently? The solution is to design a composite objective, making a compromise between the performance of the masked policy and the original policy. In the following, we present the Composite Objective Invalid Action Masking (CO-IAM) algorithm, which is actually an on-policy algorithm with an auxiliary optimization objective. Inspired by Theorem 1, we design the composite optimization objective

\begin{matrix} L_{CO} (θ) = J (π_{θ}^{ϖ}) - c_{ϕ} Φ (π_{θ}), \end{matrix}

(31)

where

\begin{matrix} J (π_{θ}^{ϖ}) & = E_{s_{0} \sim ρ, a_{t} \sim π_{θ}^{ϖ}, s_{t + 1} \sim P (\cdot | s_{t}, a_{t})} [\sum_{t = 0}^{H} γ^{t} r_{t}], \end{matrix}

(32)

\begin{matrix} Φ (π_{θ}) & = E_{τ \sim π_{θ}^{ϖ}} [\sum_{t = 0}^{H} \sum_{a_{i} \in A_{inv} (s_{t})} π_{θ} (a_{i} | s_{t})] . \end{matrix}

(33)

The first objective item

J (π_{θ}^{ϖ})

is the cumulative reward following the masked policy

π_{θ}^{ϖ}

. The second objective item

Φ (π_{θ})

is the cumulative probability sum of all invalid actions.

c_{ϕ}

is the coefficient that balances two items.

The second objective item aims to reduce the probability of invalid actions when

c_{ϕ} > 0

. In short, the composite objective aims to optimize the cumulative reward and adjust the probability of invalid actions concurrently. Next, we derive the gradient of the composite objective:

\begin{matrix} \nabla_{θ} L_{CO} (θ) = \nabla_{θ} J (π_{θ}^{ϖ}) - c_{ϕ} \nabla_{θ} Φ (π_{θ}) . \end{matrix}

(34)

The gradient of the first objective item can be directly obtained according to the Policy Gradient Theorem [25]:

\begin{matrix} \nabla_{θ} J (π_{θ}^{ϖ}) = E_{τ \sim π_{θ}^{ϖ}} [\sum_{t = 0}^{H} \nabla_{θ} log π_{θ}^{ϖ} (a_{t} | s_{t}) {\hat{A}}_{t}] . \end{matrix}

(35)

The gradient of the second objective item is:

\begin{matrix} c_{ϕ} \nabla_{θ} Φ (π_{θ}) = E_{τ \sim π_{θ}^{ϖ}} [\sum_{t = 0}^{H} c_{ϕ} \sum_{a_{i} \in A_{inv} (s_{t})} \nabla_{θ} π_{θ} (a_{i} | s_{t})] . \end{matrix}

(36)

Inspired by Theorem 1, we set

c_{ϕ} = {\hat{A}}_{t} / I (s_{t}; θ)

and obtain that:

\begin{matrix} c_{ϕ} \nabla_{θ} Φ (π_{θ}) = E_{τ \sim π_{θ}^{ϖ}} [\sum_{t = 0}^{H} \frac{{\hat{A}}_{t}}{I (s_{t}; θ)} \sum_{a_{i} \in A_{inv} (s_{t})} \nabla_{θ} π_{θ} (a_{i} | s_{t})] . \end{matrix}

(37)

Substituting (35) and (37) into (34), one has:

\begin{matrix} \nabla_{θ} L_{CO} (θ) = & E_{τ \sim π_{θ}^{ϖ}} [\sum_{t = 0}^{H} \nabla_{θ} log π_{θ}^{ϖ} (a_{t} | s_{t}) {\hat{A}}_{t}] \\ - E_{τ \sim π_{θ}^{ϖ}} [\sum_{t = 0}^{H} \frac{{\hat{A}}_{t}}{I (s_{t}; θ)} \sum_{a_{i} \in A_{inv} (s_{t})} \nabla_{θ} π_{θ} (a_{i} | s_{t})] \\ = & g_{mask} - E_{τ \sim π_{θ}^{ϖ}} [\sum_{t = 0}^{H} \frac{{\hat{A}}_{t}}{I (s_{t}; θ)} \sum_{a_{i} \in A_{inv} (s_{t})} \nabla_{θ} π_{θ} (a_{i} | s_{t})] . \end{matrix}

(38)

According to Theorem 1, we have:

\begin{matrix} g_{mask} - g_{naive} = E_{τ \sim π_{θ}^{ϖ}} [\sum_{t = 0}^{H} \frac{{\hat{A}}_{t}}{I (s_{t}; θ)} \sum_{a_{i} \in A_{inv} (s_{t})} \nabla_{θ} π_{θ} (a_{i} | s_{t})] \end{matrix}

(39)

In light of (16), (38) and (39), we obtain an interesting finding that the gradient of the composite objective is equal to the naive policy gradient:

\begin{matrix} \nabla_{θ} L_{CO} (θ) = g_{naive} = E_{τ \sim π_{θ}^{ϖ}} [\sum_{t = 0}^{H} \nabla_{θ} log π_{θ} (a_{t} | s_{t}) {\hat{A}}_{t}] . \end{matrix}

(40)

In Section 3.3, we present that the naive policy gradient is intuitively improper because it is an on-policy algorithm, but the behavior policy is not consistent with the target policy. However, based on the above derivation, we obtain that the naive policy gradient is indeed a valid gradient because it is equivalent to the composite objective invalid action masking (CO-IAM) algorithm, which optimizes the masked policy and the original policy concurrently. Although this is a counter-intuitive conclusion, it ensures that we can directly use the naive policy gradient (16) for training without causing instability.

Based on the above discussion, we suggest directly replacing the CO-IAM algorithm with the naive policy gradient in practice. That is, we only need to use the masked policy for sampling without changing the structure of the policy gradient algorithm. Similarly, we can employ PPO as the training algorithm:

\begin{matrix} L_{CO}^{C L I P} (θ) & = E_{τ \sim π_{θ}^{ϖ}} [min (r_{t}^{CO} (θ) {\hat{A}}_{t}, clip (r_{t}^{CO} (θ), ϵ) {\hat{A}}_{t})], \end{matrix}

(41)

\begin{matrix} where r_{t}^{CO} (θ) & = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{old}} (a_{t} | s_{t})} . \end{matrix}

(42)

By adding the entropy item, the optimization objective is:

\begin{matrix} L_{CO}^{P P O} (θ) = L_{CO}^{C L I P} (θ) + c_{e} H (π_{θ}^{ϖ}) . \end{matrix}

(43)

The value functions of all algorithms are updated by the squared error loss [27], which is not listed in the optimization objectives.

5. Empirical Analysis

We evaluate how well the proposed algorithms deal with invalid actions using the Gym-

μ

RTS environment [24]. It is a perfect testbed for our experiments because the space of invalid actions increases exponentially with the map size. We take four maps with different sizes for experiments (see Figure 3), and the environment setup is the same as Huang et al. [23].

In the Gym-

μ

RTS environments, an agent attempts to control the unit at the top left of the map to harvest resources as fast as possible. The agent receives a reward

+ 1

when the unit harvests a resource or returns a resource to the base. The action space is designed as the MultiDiscrete action space [19]. A composite action comprises eight discrete actions, Source Unit, Action Type, Move Parameter, Harvest Parameter, Return Parameter, Produce Direction Parameter, Produce Type Parameter, and Attack Target Unit. The first component is the location of the selected unit to perform an action. Since only the worker and base can be selected as Source Unit, selecting other grids on the map are usually invalid. This action space enables us to verify the scalability of invalid action masking.

We will compare On-PIAM, Off-PIAM, and CO-IAM algorithms on different maps. We further evaluate the performance of original policies with Masking Removed (i.e., On-PIAM-MR, Off-PIAM-MR, and CO-IAM-MR). To illustrate the superiority of invalid action masking, we also compare it with the invalid action penalty, which assigns a negative reward

r_{inv} = {- 0.01, - 0.1}

for choosing an invalid action (called Penalty-0.01 and Penalty-0.1). All the agents are trained for 1,000,000 time steps, and the maximal length of each episode is 200 iteration steps. All experiments were run with four random seeds.

5.1. Evaluation Metrics

Drawing from Huang et al. [23], we employ six metrics to measure our algorithms’ performance. Specifically,

r_{episode}

is the episode return.

a_{null}

is the number of actions that select an invalid Source Unit.

a_{busy}

is the number of actions that select a Source Unit that is executing other actions.

a_{owner}

is the number of actions that select a Source Unit that does not belong to the agent.

t_{first}

is the percentage of the total time steps that it takes for the agent to obtain the first positive reward.

t_{solve}

is the percentage of the total time steps that it takes for the episode return to exceed 40.

5.2. Results

The experimental results are presented in Figure 4 and Figure 5 and Table 2. Here, we provide some important observations.

Invalid action penalty is not a robust way to prevent the agent from taking invalid actions. In Figure 4, we observe that Penalty-0.01 achieves more than 30 rewards in

4 \times 4

and

10 \times 10

maps. It even surpasses On-PIAM in the

4 \times 4

map but can hardly extend to other maps. Further, the performance of Penalty-0.1 is obviously worse than Penalty-0.01. However, Table 2 shows that

a_{null}

,

a_{busy}

, and

a_{owner}

of Penalty-0.1 is better than Penalty-0.01. It demonstrates that the larger penalty causes lower probabilities of invalid actions, but it fails to obtain more positive rewards. Thus, we have to make more efforts to tune the value of the negative reward when we adopt the invalid action penalty.

On-PIAM solves all tasks within fewer time steps than Off-PIAM and CO-IAM, but it fails to achieve expected asymptotic performance. As shown in Table 2, On-PIAM’s

t_{solve}

is roughly 3% to 7% and very similar across different map sizes. However, we can observe that the episode return of On-PIAM achieves 34 on average, but Off-PIAM and CO-IAM achieve almost 40. The green curves in Figure 4 verify that the episode return curves of On-PIAM are not stable, and its asymptotic performance is worse than Off-PIAM and CO-IAM. Furthermore, Figure 5 shows that once the masking is removed, On-PIAM-MR only sightly works in

4 \times 4

map and cannot obtain rewards on other maps. In Table 2, we also notice that On-PIAM-MR has larger probabilities of choosing

a_{null}

and

a_{owner}

. This is expected since On-PIAM aims to optimize the masked policy rather than the original policy.

CO-IAM outperforms On-PIAM and scales well on complex tasks. As shown in Figure 5 and Table 2, CO-IAM is better than On-PIAM with or without masking. When the masking is removed, the episode return of CO-IAM-MR still maintains roughly 30. It verifies our idea that although

g_{naive}

seems to be an improper policy gradient, it indeed optimizes the masked policy and the original policy concurrently. CO-IAM’s

t_{solve}

is better than Off-PIAM. In general, the performance of CO-IAM shows that CO-IAM seems to be a combination of On-PIAM and Off-PIAM. In the

24 \times 24

map, CO-IAM obtains 37.2 rewards on average and outperforms Off-PIAM and On-PIAM. It demonstrates that CO-IAM has a better exploration ability to deal with complex tasks.

Off-PIAM achieves the best performance when the masking is removed. In Figure 5, and Table 2, we observe that Off-PIAM-MR obtains the better asymptotic performance than On-PIAM-MR and CO-IAM-MR. In addition, Off-PIAM also realizes the great performance during the training. Figure 4 shows that Off-PIAM surpasses other algorithms in

10 \times 10

and

16 \times 16

maps, and it achieves similar asymptotic performance to CO-IAM in other maps. It demonstrates that the masked policy of Off-PIAM can work well even if it focuses on optimizing the original policy. Moreover, the second row of Figure 4 shows the Kullback–Leibler (KL) divergence between the target policy and the current policy. The target policy of Off-PIAM is

π_{θ}

, and the behavior policy is

π_{θ}^{ϖ}

. We can observe that the KL divergence of Off-PIAM is large at the beginning and gradually decreases during the training. That is,

π_{θ}

and

π_{θ}^{ϖ}

become closer during the training, which illustrates that Off-PIAM guides the agent to explicitly learn to avoid invalid actions.

6. Conclusions

In this study, we compared the performance of three invalid action masking algorithms, On-PIAM, Off-PIAM, and CO-IAM. These algorithms optimize the masked policy, the original policy, and both policies, respectively. Our main contribution lies in the successful performance of the proposed Off-PIAM and CO-IAM algorithms when removing masking after training, while CO-IAM fails to work. Through our work, we have discovered several key findings. Firstly, while On-PIAM solves all tasks within fewer time steps, it fails to achieve the expected return when masking is removed. Secondly, we found that the naive policy gradient is equivalent to the CO-IAM algorithm, which optimizes the masked policy and the original policy in parallel. Thirdly, CO-IAM demonstrates superior performance compared to On-PIAM and scales well on complex tasks. Lastly, Off-PIAM outperforms other algorithms, regardless of whether masking is present or not. Based on our empirical findings, we recommend utilizing the Off-PIAM algorithm for addressing most tasks, as it consistently delivers strong performance. Additionally, the CO-IAM algorithm should be employed for handling more complex tasks. One limitation of this study is its focus on the effects of invalid action masking on RL agents in RTS games. Therefore, the findings cannot be generalized to all solutions, such as rule-based approaches or optimization-based approaches. As part of our future work, we would like to extend our experiments to include more complex game environments, such as StarCraft II, to assess the effectiveness and adaptability of our method in those settings.

Author Contributions

Conceptualization, X.L. and J.Z.; methodology, Y.H., Q.Y. and A.Y.; software, Y.H.; validation, Y.H. and A.Y.; formal analysis, Y.H., X.L. and J.Z.; investigation, Y.H. and N.W.; resources, Y.H.; data curation, Y.H. and A.Y.; writing—original draft preparation, Y.H. and Q.Y.; writing—review and editing, Y.H., Q.Y., A.Y. and N.W.; visualization, Y.H.; supervision, X.L.; project administration, X.L.; funding acquisition, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China grant number 61703427.

Data Availability Statement

The online supplementary material is available at https://github.com/afeuhyq/invalid-action-masking (accessed on 16 July 2023), as well as all the metrics, logs, and recorded videos at https://wandb.ai/drhou/invalid-action-masking (accessed on 16 July 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

Xu, Y.; Wei, Y.; Jiang, K.; Wang, D.; Deng, H. Multiple UAVs Path Planning Based on Deep Reinforcement Learning in Communication Denial Environment. Mathematics 2023, 11, 405. [Google Scholar] [CrossRef]
Li, K.; Zhang, T.; Wang, R.; Wang, Y.; Han, Y.; Wang, L. Deep Reinforcement Learning for Combinatorial Optimization: Covering Salesman Problems. IEEE Trans. Cybern. 2022, 52, 13142–13155. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Song, J.; Luo, Y.; Zhao, M.; Hu, Y.; Zhang, Y. Fault-Tolerant Integrated Guidance and Control Design for Hypersonic Vehicle Based on PPO. Mathematics 2022, 10, 3401. [Google Scholar] [CrossRef]
Liu, X.; Tan, Y. Attentive Relational State Representation in Decentralized Multiagent Reinforcement Learning. IEEE Trans. Cybern. 2022, 52, 252–264. [Google Scholar] [CrossRef] [PubMed]
Liang, W.; Wang, J.; Bao, W.; Zhu, X.; Wu, G.; Zhang, D.; Niu, L. Qauxi: Cooperative multi-agent reinforcement learning with knowledge transferred from auxiliary task. Neurocomputing 2022, 504, 163–173. [Google Scholar] [CrossRef]
Vinyals, O.; Ewalds, T.; Bartunov, S.; Georgiev, P.; Vezhnevets, A.S.; Yeo, M.; Makhzani, A.; Küttler, H.; Agapiou, J.; Schrittwieser, J.; et al. Starcraft II: A new challenge for reinforcement learning. arXiv 2017, arXiv:1708.04782. [Google Scholar]
Berner, C.; Brockman, G.; Chan, B.; Cheung, V.; Debiak, P.; Dennison, C.; Farhi, D.; Fischer, Q.; Hashme, S.; Hesse, C.; et al. Dota 2 with large scale deep reinforcement learning. arXiv 2019, arXiv:1912.06680. [Google Scholar]
Ye, D.; Liu, Z.; Sun, M.; Shi, B.; Zhao, P.; Wu, H.; Yu, H.; Yang, S.; Wu, X.; Guo, Q.; et al. Mastering complex control in moba games with deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 6672–6679. [Google Scholar]
Csereoka, P.; Roman, B.I.; Micea, M.V.; Popa, C.A. Novel Reinforcement Learning Research Platform for Role-Playing Games. Mathematics 2022, 10, 4363. [Google Scholar] [CrossRef]
Kalweit, G.; Huegle, M.; Werling, M.; Boedecker, J. Q-learning with Long-term Action-space Shaping to Model Complex Behavior for Autonomous Lane Changes. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 5641–5648. [Google Scholar]
Li, P.; Wei, M.; Ji, H.; Xi, W.; Yu, H.; Wu, J.; Yao, H.; Chen, J. Deep Reinforcement Learning-Based Adaptive Voltage Control of Active Distribution Networks with Multi-terminal Soft Open Point. Int. J. Electr. Power Energy Syst. 2022, 141, 108138. [Google Scholar] [CrossRef]
Mercado, R.; Rastemo, T.; Lindelöf, E.; Klambauer, G.; Engkvist, O.; Chen, H.; Bjerrum, E.J. Graph networks for molecular design. Mach. Learn. Sci. Technol. 2021, 2, 25023. [Google Scholar] [CrossRef]
Zhao, Z.; Wang, Q.; Li, X. Deep reinforcement learning based lane detection and localization. Neurocomputing 2020, 413, 328–338. [Google Scholar] [CrossRef]
Zahavy, T.; Haroush, M.; Merlis, N.; Mankowitz, D.J.; Mannor, S. Learn what not to learn: Action elimination with deep reinforcement learning. Adv. Neural Inf. Process. Syst. 2018, 31. Available online: https://proceedings.neurips.cc/paper_files/paper/2018/file/645098b086d2f9e1e0e939c27f9f2d6f-Paper.pdf (accessed on 16 July 2023).
Dietterich, T.G. Hierarchical reinforcement learning with the MAXQ value function decomposition. J. Artif. Intell. Res. 2000, 13, 227–303. [Google Scholar] [CrossRef] [Green Version]
Tang, C.Y.; Liu, C.H.; Chen, W.K.; You, S.D. Implementing action mask in proximal policy optimization (PPO) algorithm. ICT Express 2020, 6, 200–203. [Google Scholar] [CrossRef]
Yang, Q.; Zhu, Y.; Zhang, J.; Qiao, S.; Liu, J. UAV air combat autonomous maneuver decision based on DDPG algorithm. In Proceedings of the 2019 IEEE 15th international conference on control and automation (ICCA), Edinburgh, UK, 16–19 July 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 37–42. [Google Scholar]
Kanervisto, A.; Scheller, C.; Hautamäki, V. Action Space Shaping in Deep Reinforcement Learning. In Proceedings of the 2020 IEEE Conference on Games (CoG), Osaka, Japan, 24–27 August 2020; pp. 479–486. [Google Scholar] [CrossRef]
Dulac-Arnold, G.; Evans, R.; Sunehag, P.; Coppin, B. Reinforcement Learning in Large Discrete Action Spaces. arXiv 2015, arXiv:1512.07679. [Google Scholar]
Long, M.; Zou, X.; Zhou, Y.; Chung, E. Deep reinforcement learning for transit signal priority in a connected environment. Transp. Res. Part Emerg. Technol. 2022, 142, 103814. [Google Scholar] [CrossRef]
Xiaofei, Y.; Yilun, S.; Wei, L.; Hui, Y.; Weibo, Z.; Zhengrong, X. Global path planning algorithm based on double DQN for multi-tasks amphibious unmanned surface vehicle. Ocean. Eng. 2022, 266, 112809. [Google Scholar] [CrossRef]
Huang, S.; Ontañón, S. A Closer Look at Invalid Action Masking in Policy Gradient Algorithms. In Proceedings of the Thirty-Fifth International Florida Artificial Intelligence Research Society Conference, FLAIRS 2022, Jensen Beach, FL, USA, 15–18 May 2022. [Google Scholar] [CrossRef]
Huang, S.; Ontañón, S.; Bamford, C.; Grela, L. Gym-μRTS: Toward Affordable Full Game Real-time Strategy Games Research with Deep Reinforcement Learning. In Proceedings of the 2021 IEEE Conference on Games (CoG), Copenhagen, Denmark, 17–20 August 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–8. [Google Scholar]
Sutton, R.S.; McAllester, D.; Singh, S.; Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. Adv. Neural Inf. Process. Syst. 1999, 12, 1057–1063. [Google Scholar]
Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; Abbeel, P. High-dimensional continuous control using generalized advantage estimation. arXiv 2015, arXiv:1506.02438. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Degris, T.; White, M.; Sutton, R.S. Off-policy actor-critic. arXiv 2012, arXiv:1205.4839. [Google Scholar]
Levine, S.; Kumar, A.; Tucker, G.; Fu, J. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv 2020, arXiv:2005.01643. [Google Scholar]
Wang, Z.; Bapst, V.; Heess, N.; Mnih, V.; Munos, R.; Kavukcuoglu, K.; de Freitas, N. Sample efficient actor-critic with experience replay. arXiv 2016, arXiv:1611.01224. [Google Scholar]
Huang, S.; Ontañón, S. Action guidance: Getting the best of sparse rewards and shaped rewards for real-time strategy games. arXiv 2020, arXiv:2010.03956. [Google Scholar]

Figure 1. The overview of logit-level masking. Logit-level masking calculates the action probability distribution by applying a state-dependent masking function

m (\cdot | s)

to the output logits of the policy network.

m (\cdot | s)

is a mapping that replaces the logits of invalid actions with a large negative number.

Figure 1. The overview of logit-level masking. Logit-level masking calculates the action probability distribution by applying a state-dependent masking function

m (\cdot | s)

to the output logits of the policy network.

m (\cdot | s)

is a mapping that replaces the logits of invalid actions with a large negative number.

Figure 2. The overview of action-level masking. The action-level masking function

ϖ (\cdot | s)

is a binary function where the values corresponding to invalid actions are zero. Action-level masking calculates the action probability by replacing invalid actions’ probabilities with zero and re-normalizing the probability distribution.

Figure 2. The overview of action-level masking. The action-level masking function

ϖ (\cdot | s)

is a binary function where the values corresponding to invalid actions are zero. Action-level masking calculates the action probability by replacing invalid actions’ probabilities with zero and re-normalizing the probability distribution.

Figure 3. Four screenshots of Gym-

μ

RTS with different map sizes. The grey circular units are workers. Green square grids are resource mines from where workers can extract resources to produce more units. White square grids are bases that can produce workers. We train agents to control units at the top left to harvest resources, and the units in the bottom right will remain stationary.

Figure 3. Four screenshots of Gym-

μ

RTS with different map sizes. The grey circular units are workers. Green square grids are resource mines from where workers can extract resources to produce more units. White square grids are bases that can produce workers. We train agents to control units at the top left to harvest resources, and the units in the bottom right will remain stationary.

Figure 4. Comparison of CO-IAM, Off-PIAM, On-PIAM, Penalty-0.01, and Penalty-0.1 during training in different maps. The top four figures show the episode return over the time steps, and the remaining four show the Kullback–Leibler (KL) divergence between the target and current policy of PPO over the time steps. The lines are the average of all runs, and the shaded area is the standard deviation.

Figure 5. Comparison of CO-IAM-MR, Off-PIAM-MR, and On-PIAM-MR which are evaluated without providing the mask. The lines are the average of all runs, and the shaded area is the standard deviation.

Table 1. Summary of On-PIAM, Off-PIAM, and CO-IAM algorithms.

Algorithms	Behavior Policy	Target Policy	Objective	Policy Gradient	PPO Ratio
On-PIAM [23]	$π_{θ}^{ϖ}$	$π_{θ}^{ϖ}$	$J (π_{θ}^{ϖ})$	Equation (23)	$r_{t}^{ϖ} (θ) = π_{θ}^{ϖ} (a_{t} \| s_{t}) / π_{θ_{old}}^{ϖ} (a_{t} \| s_{t})$
Off-PIAM	$π_{θ}^{ϖ}$	$π_{θ}$	$J (π_{θ})$	Equation (27)	$r_{t}^{IS} (θ) = π_{θ} (a_{t} \| s_{t}) / π_{θ_{old}}^{ϖ} (a_{t} \| s_{t})$
CO-IAM	$π_{θ}^{ϖ}$	$π_{θ}^{ϖ}, π_{θ}$	$J (π_{θ}^{ϖ}) - c_{ϕ} Φ (π_{θ})$	Equation (40)	$r_{t}^{CO} (θ) = π_{θ} (a_{t} \| s_{t}) / π_{θ_{old}} (a_{t} \| s_{t})$

Table 2. Results averaged over four random seeds.

Algorithms	Map	$r_{episode}$	$a_{null}$	$a_{busy}$	$a_{owner}$	$t_{solve}$	$t_{first}$
CO-IAM	$4 \times 4$	38.50	–	–	–	4.17%	0.03%
	$10 \times 10$	39.30	–	–	–	5.46%	0.04%
	$16 \times 16$	39.30	–	–	–	6.21%	0.06%
	$24 \times 24$	37.20	–	–	–	10.10%	0.05%
Off-PIAM	$4 \times 4$	39.98	–	–	–	4.57%	0.04%
	$10 \times 10$	39.67	–	–	–	9.66%	0.02%
	$16 \times 16$	39.55	–	–	–	9.16%	0.04%
	$24 \times 24$	35.20	–	–	–	16.15%	0.04%
On-PIAM [23]	$4 \times 4$	36.55	–	–	–	3.49%	0.04%
	$10 \times 10$	35.40	–	–	–	4.50%	0.02%
	$16 \times 16$	33.58	–	–	–	5.40%	0.04%
	$24 \times 24$	31.85	–	–	–	7.09%	0.04%
Penalty-0.01	$4 \times 4$	39.33	59.62	1.70	19.38	6.01%	0.07%
	$10 \times 10$	32.00	108.82	0.12	4.62	13.98%	1.34%
	$16 \times 16$	8.89	30.68	0.00	0.60	–	1.10%
	$24 \times 24$	11.44	163.05	0.30	1.45	–	0.93%
Penalty-0.1	$4 \times 4$	29.97	18.55	0.78	5.80	34.48%	0.07%
	$10 \times 10$	0.00	0.00	0.00	0.00	–	3.74%
	$16 \times 16$	7.97	27.72	0.28	0.75	–	1.29%
	$24 \times 24$	0.64	94.70	0.22	0.88	–	0.57%
CO-IAM-MR	$4 \times 4$	21.37	0.27	0.00	152.59	–	0.03%
	$10 \times 10$	33.00	131.89	0.00	0.01	–	0.04%
	$16 \times 16$	32.25	133.34	0.00	0.00	–	0.06%
	$24 \times 24$	25.10	149.11	0.00	0.15	–	0.05%
Off-PIAM-MR	$4 \times 4$	39.88	9.53	17.81	1.31	7.22%	0.04%
	$10 \times 10$	39.33	9.86	4.56	1.59	34.92%	0.02%
	$16 \times 16$	38.65	10.19	1.95	1.01	25.91%	0.04%
	$24 \times 24$	35.23	15.90	5.97	0.18	–	0.04%
On-PIAM-MR	$4 \times 4$	7.13	49.33	0.00	133.20	–	0.04%
	$10 \times 10$	0.23	188.20	0.00	10.89	–	0.02%
	$16 \times 16$	0.15	195.66	0.00	3.90	–	0.04%
	$24 \times 24$	0.48	197.64	0.00	1.37	–	0.04%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hou, Y.; Liang, X.; Zhang, J.; Yang, Q.; Yang, A.; Wang, N. Exploring the Use of Invalid Action Masking in Reinforcement Learning: A Comparative Study of On-Policy and Off-Policy Algorithms in Real-Time Strategy Games. Appl. Sci. 2023, 13, 8283. https://doi.org/10.3390/app13148283

AMA Style

Hou Y, Liang X, Zhang J, Yang Q, Yang A, Wang N. Exploring the Use of Invalid Action Masking in Reinforcement Learning: A Comparative Study of On-Policy and Off-Policy Algorithms in Real-Time Strategy Games. Applied Sciences. 2023; 13(14):8283. https://doi.org/10.3390/app13148283

Chicago/Turabian Style

Hou, Yueqi, Xiaolong Liang, Jiaqiang Zhang, Qisong Yang, Aiwu Yang, and Ning Wang. 2023. "Exploring the Use of Invalid Action Masking in Reinforcement Learning: A Comparative Study of On-Policy and Off-Policy Algorithms in Real-Time Strategy Games" Applied Sciences 13, no. 14: 8283. https://doi.org/10.3390/app13148283

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Exploring the Use of Invalid Action Masking in Reinforcement Learning: A Comparative Study of On-Policy and Off-Policy Algorithms in Real-Time Strategy Games

Abstract

1. Introduction

2. Background

2.1. Reinforcement Learning

2.2. Logit-Level Masking

3. Invalid Action Masking

3.1. Action-Level Masking

3.2. Equivalence Verification

3.3. Masking Changes Policy Gradient

4. Practical Algorithms for Action Masking

4.1. On-Policy Algorithm

4.2. Off-Policy Algorithm

4.3. Composite Objective Algorithm

5. Empirical Analysis

5.1. Evaluation Metrics

5.2. Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI