Enhancing Efficiency in Hierarchical Reinforcement Learning through Topological-Sorted Potential Calculation

Zhou, Ziyun; Shang, Jingwei; Li, Yimang

doi:10.3390/electronics12173700

Open AccessArticle

Enhancing Efficiency in Hierarchical Reinforcement Learning through Topological-Sorted Potential Calculation

by

Ziyun Zhou

¹,

Jingwei Shang

^2,* and

Yimang Li

¹

School of Mechanical Engineering and Rail Transit, Changzhou University, Changzhou 213164, China

²

China Electronic Product Reliability and Environmental Testing Research Institute, Guangzhou 510610, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(17), 3700; https://doi.org/10.3390/electronics12173700

Submission received: 25 July 2023 / Revised: 19 August 2023 / Accepted: 27 August 2023 / Published: 1 September 2023

(This article belongs to the Special Issue Advances in Theories and Applications of Multi-Agent Systems)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Hierarchical reinforcement learning (HRL) offers a hierarchical structure for organizing tasks, enabling agents to learn and make decisions autonomously in complex environments. However, traditional HRL approaches face limitations in effectively handling complex tasks. Reward machines, which specify high-level goals and associated rewards for sub-goals, have been introduced to address these limitations by facilitating the agent’s understanding and reasoning with respect to the task hierarchy. In this paper, we propose a novel approach to enhance HRL performance through topologically sorted potential calculation for reward machines. By leveraging the topological structure of the task hierarchy, our method efficiently determines potentials for different sub-goals. This topological sorting enables the agent to prioritize actions leading to the accomplishment of higher-level goals, enhancing the learning process. To assess the efficacy of our approach, we conducted experiments in the grid-world environment with OpenAI-Gym. The results showcase the superiority of our proposed method over traditional HRL techniques and reward machine-based reinforcement learning approaches in terms of learning efficiency and overall task performance.

Keywords:

reinforcement learning; reward learning; topological sorting

1. Introduction

Reinforcement Learning (RL) has gained significant attention as a transformative machine learning paradigm, showcasing its prowess across diverse domains such as game playing, robotics manipulation, and motion planning [1,2,3]. However, as real-world tasks become increasingly intricate, traditional RL methodologies must address their complexities effectively. Complex tasks in motion planning encompass scenarios where an agent is required to navigate through intricate environments while adhering to a set of challenging objectives. One illustrative example is a mobile robot visiting multiple locations in a specific temporal order. In such a scenario, the robot’s mission involves more than just reaching predefined locations; it must consider intricate dependencies, temporal constraints, and the sequence of actions required for successful task completion. Consider a scenario where a delivery robot is entrusted with transporting packages to different destinations within a sprawling urban environment. The robot is tasked with picking up packages from a warehouse and delivering them to various locations across the city. The complexity of this task is further heightened by the requirement to follow a specific temporal order for deliveries, reflecting real-world constraints such as time-sensitive deliveries or specific routes to optimize efficiency.

Hierarchical Reinforcement Learning (HRL) emerges as a promising approach to surmount these challenges by offering a systematic methodology for agents to navigate intricate environments. HRL structures tasks hierarchically, enabling agents to break down complex objectives into manageable sub-tasks or sub-goals [4]. This decomposition facilitates efficient learning and empowers agents to make informed decisions across different levels of abstraction. However, as the complexity of real-world tasks escalates, traditional HRL methods need help capturing and learning the intricate dependencies and long-term objectives that define these tasks [5]. However, HRL suffers from several limitations. First, the typical HRL algorithms are myopic and focus on immediate sub-tasks, neglecting the long-term implications. Taking motion planning as an example, strategies might focus on targets that are easy to reach without considering longer-term objectives. Moreover, the inherent complexity of tasks such as motion planning often entails intricate dependencies that span different levels of abstraction. Traditional HRL methods struggle to capture and model these dependencies effectively.

To overcome these constraints, reward machines have emerged as a pivotal tool for shaping hierarchical task structures and guiding learning agents [6]. A notable example of reward machines is seen in robotics, where they provide a formal framework for specifying high-level goals and corresponding rewards for sub-goals, thereby aiding the agent’s understanding of task hierarchies. For instance, in a warehouse setting a reward machine can be established to represent the overarching goal of efficiently stocking shelves. This involves sub-goals such as navigating to a specific shelf, grasping items, and placing them accurately. Each sub-goal is assigned rewards and transitions, enabling the agent to acquire intelligent strategies that lead to the accomplishment of higher-level objectives. This hierarchical approach enhances exploration and learning within complex environments while maintaining interpretability and adaptability. Moreover, reward machines empower agents to master non-Markovian properties, enabling them to retain information about previously accomplished sub-tasks [6,7].

This paper introduces and assesses a novel HRL algorithm incorporating topologically sorted reward machines. Drawing upon the task hierarchy’s topological structure, our approach efficiently estimates potential values for distinct sub-goals. This topological sorting empowers the agent to prioritize actions that lead to attaining higher-level goals, thereby optimizing the learning trajectory. We validate the efficacy of our method through experiments conducted within the grid-world environment using the OpenAI Gym framework. The experimental results unequivocally demonstrate that our proposed approach surpasses traditional HRL techniques and state-of-the-art reward machine-based reinforcement learning methods regarding learning efficiency and overall task performance. Our algorithm paves the way for more adept and adaptive decision-making within hierarchical reinforcement learning by harnessing the power of topological sorting and reward machines.

The remaining sections of this paper are organized as follows: Section 2 reviews related work on reinforcement learning, including typical reinforcement learning techniques, hierarchical reinforcement learning techniques, and reward machine-based approaches. Section 3 provides an overview of the preliminaries of reinforcement learning, including typical Q-Learning and Double Deep Q Networks. Section 4 presents our main framework, which incorporates reward machines to guide hierarchical reinforcement learning using Q-learning techniques. We introduce a topological sorting-based reward-shaping algorithm to enhance learning efficiency in circular tasks. Section 5 evaluates the proposed method in discrete state spaces and compares its performance against typical reward machine-based approaches. The experimental results demonstrate the faster convergence and higher rewards achieved by our approach. Finally, Section 6 concludes our work and discusses potential future research directions.

2. Related Work

The application of RL in robotics has experienced substantial growth, demonstrating its effectiveness in intricate real-world scenarios such as multi-constraint, multi-scale motion planning [8], and multi-step object sorting [9] tasks. These successes underscore the potential of RL in addressing challenging environments and tasks. However, a more comprehensive examination of the existing approaches highlights both their advancements and limitations, providing a nuanced understanding of the current state of the field.

The advancements in deep reinforcement learning (DRL) enabling control policies to be learned directly from high-dimensional sensory input [2] are undoubtedly groundbreaking. The introduction of lightweight frameworks for DRL employing asynchronous gradient descent for optimizing deep neural network controllers [10] holds the promise of more efficient learning algorithms. Adapting DQN algorithms to address overestimation issues, leading to improved performance across various games [11], is commendable in addressing known challenges. However, while these advancements are substantial, a more critical analysis could delve into potential pitfalls, such as biases in learning from high-dimensional sensory input and challenges in generalizing policies to diverse environments.

Traditionally, RL algorithms design reward functions by optimizing accumulated rewards; however, this approach fails to capture history-dependent temporal properties intrinsic to complex tasks [12]. Recent efforts to integrate RL with Linear Temporal Logic (LTL) rewards reflect a shift toward addressing these temporal intricacies. While using synthetic LTL strategies to construct non-Markovian reward functions is promising, translating formal logic into effective reward design poses a challenge. There have been studies focusing on integrating RL with LTL rewards to enable agents to make multi-stage sequential decisions through reinforcement learning within a stochastic environment modeled as a Markovian Decision Process (MDP) while satisfying the corresponding properties [13,14]. Broadly, these approaches involve the conversion of LTL formulas into automata, such as Rabin Automata (DRA) or Limit-Deterministic Büchi Automata (LDBA), and the integration of these automata with the MDP of the system. This creates a product MDP encompassing behaviors that meet MDP and LTL specifications. Approaches utilizing DRA and LDBA primarily employ the reachability probability of the product MDP to design the rewards [15,16]. However, there remains a need for a comprehensive evaluation of these approaches in various scenarios in order to ascertain their effectiveness, especially considering the inherent complexities of real-world environments.

Moreover, the Q-learning with Reward Machines (QRM) approach proposed by Icarte et al. [6] presents a constructive direction for task decomposition and reward structure. However, its reliance on transitions within reward machines for reward design raises questions about its effectiveness in capturing nuanced experiences. The introduction of the Counterfactual Experiences for Reward Machine (CRM) approach addresses this limitation by incorporating counterfactual experiences into experience replay. Nevertheless, the applicability of CRM across a broader spectrum of RL algorithms and environments necessitates deeper investigation in order to assess its generalizability.

In the context of multi-agent RL, cooperative task decomposition combined with learning reward machines [17] and individual reward machine decomposition for enhanced learning efficiency [7] display promising potential. However, these approaches should be evaluated in complex multi-agent scenarios in order to determine their resilience, scalability, and potential shortcomings.

In conclusion, the presented literature showcases remarkable advancements in RL applications in robotics, addressing intricate challenges with varying degrees of success. However, a comprehensive and critical analysis reveals that while these approaches are commendable, a deeper understanding of their limitations, potential biases, and practical feasibility across diverse environments is warranted. The identified gaps present valuable opportunities for further research and innovation in refining RL methodologies for complex tasks in robotics.

3. Preliminaries

In reinforcement learning (RL), an agent learns to interact with a dynamic environment through trial and error [18]. The environment is typically modeled as an MDP represented by

MDP = 〈 S, s_{0}, A, P, R, γ 〉

, where S is a finite set of states,

s_{0} \in S

denotes the initial state, A represents a finite set of actions,

P \in S \times A \to Dist (S)

is the transition probability distribution,

R \in S \times A \times S \to R

denotes the rewards assigned to state transitions, and

γ

is the discount factor. The policy of the agent

λ \in S \to Dist (A)

represents the probability distribution of actions

a \in A

given a current state

s \in S

. At each step, the agent selects an action

a \sim λ (s)

based on the current state s and receives a reward

R (s, a, s^{'})

where

s^{'} \sim P (s, a)

. The agent’s goal is to find the optimal policy

λ^{*}

that maximizes the expected discounted reward. Q-learning and Deep Q-Learning are two popular algorithms in RL.

3.1. Q-Learning

Q-learning is a model-free RL algorithm that enables the agent to learn an optimal policy

λ^{*}

for sequential decision-making tasks [19]. It is based on the concept of a Q-value, which represents the expected return or cumulative reward an agent receives by taking a specific action in a given state. The Q-value is updated iteratively using the Bellman equation, which establishes a relationship between the current state-action pair and the expected future rewards.

The Q-learning algorithm utilizes the Q-function, denoted as

Q \in S \times A \to R

, to assess the value of state–action pairs. Initially, the Q-function is initialized with arbitrary values. At each time step t, the agent selects an action

a_{t} \in A

, transitions to the next state

s_{t + 1}

, and receives the corresponding reward

R (s_{t}, a_{t}, s_{t + 1})

. The Q-function is then updated using Equation (1), where

α \in [0, 1]

represents the learning rate. This update rule incorporates the immediate reward, the discounted maximum Q-value of the next state–action pair, and the current estimate of the Q-value.

\begin{matrix} \begin{matrix} Q (s_{t}, a_{t}) \leftarrow & Q (s_{t}, a_{t}) + α (R (s_{t}, a_{t}, s_{t + 1}) + γ \underset{a_{t + 1}}{m a x} Q (s_{t + 1}, a_{t + 1}) - Q (s_{t}, a_{t})) \end{matrix} \end{matrix}

(1)

This iterative process of updating the Q-values converges towards the optimal Q-values follow the Bellman equation, as shown in Equation (2). The Bellman equation describes the relationship between the Q-values of a state–action pair and the expected discounted future rewards of the subsequent state–action pairs.

\begin{matrix} \begin{matrix} Q^{*} (s, a) = R (s, a, s^{'}) + γ \sum_{s^{'} \in S} P (s^{'} ∣ s, a) \underset{a^{'}}{m a x} Q^{*} (s^{'}, a^{'}) \end{matrix} \end{matrix}

(2)

In this equation,

Q^{*} (s, a)

represents the optimal Q-value of state s and action a,

R (s, a, s^{'})

denotes the immediate reward received from transitioning from state s to state

s^{'}

by taking action a,

P (s^{'} ∣ s, a)

represents the transition probability from state s to state

s^{'}

given action a, and

γ

represents the discount factor indicating the importance of future rewards. By iteratively solving the Bellman equation, the agent determines the optimal Q-values, which in turn define the optimal policy for maximizing the cumulative expected discounted reward.

3.2. Deep Q-Learning

Deep Q-Learning is an extension of Q-learning that incorporates deep neural networks to handle complex and high-dimensional state spaces. Traditional Q-learning methods may struggle to represent and learn from large state spaces efficiently. Deep Q-learning addresses this limitation by using a function approximator, usually a deep neural network, to estimate the Q-values.

In Deep Q-learning, two neural networks are utilized as function approximators to estimate the Q function: the Q-network denoted as

Q (s, a; θ)

and the target network represented as

\hat{Q} (s, a; θ^{-})

[2]. Here,

θ

and

θ^{- 1}

denote the network parameters. The Q-network and target network are initialized such that

\hat{Q} = Q

. At each step t, the agent in state

s_{t} \in S

selects an action

a_{t} \in A

, transitions to state

s_{t + 1}

, and receives a reward

R (s_{t}, a_{t}, s_{t + 1})

. This transition is then stored in the replay buffer

D

. Subsequently, a random minibatch of transitions is sampled from

D

to train the target values in Equation (3). The Q-network with parameter

θ

is then trained by minimizing the loss function defined in Equation (4) [2]. After a certain number of steps

τ

, the parameters of the Q-network are copied to the target network.

y_{i} = R_{i} + γ max_{a_{i + 1}} \hat{Q} (s_{i + 1}, a_{i + 1}; θ_{i}^{- 1})

(3)

L_{i} (θ_{i}) = E [{(R_{i} + γ max_{a_{i + 1}} \hat{Q} (s_{i + 1}, a_{i + 1}; θ_{i}^{- 1}) - Q (s, a; θ_{i}))}^{2}]

(4)

Issues are identified in that random errors in actions could lead to overestimation in Q-learning, in turn introducing upward bias into DQN [2]. To address this issue, Van et al. proposed Double DQN (2DQN) which mitigates overestimation by decoupling the max operator into action selection and action evaluation [11]. In 2DQN, a Q-network was used to determine the action a that maximizes the Q value instead of relying on the target network. Consequently, the loss function in Equation (4) is modified to Equation (5).

L_{i} (θ_{i}) = E [{(R_{i} + γ \hat{Q} (s_{i + 1}, argmax Q (s_{i + 1}, a; θ_{i}); θ_{i}^{- 1}) - Q (s, a; θ_{i}))}^{2}]

(5)

3.3. Reward Shaping

In many RL scenarios, the rewards provided by the environment are sparse, meaning they are only granted upon achieving specific goals or completing tasks. Sparse rewards can make learning difficult and result in slow convergence for RL agents. Reward shaping is a technique in RL that involves modifying or shaping the environment’s reward function to provide additional intermediate rewards that help guide the learning process. These shaped rewards can make learning more efficient, improve exploration, and incorporate prior knowledge into the RL agent’s decision-making process. One popular approach to reward shaping is potential-based reward shaping, which involves assigning potential functions

ϕ

to states in the MDP. These potential functions capture the desirability or value of each state based on domain knowledge or prior understanding of the problem. The reward function is then designed to encourage the agent to transition from states with lower potential values to states with higher potential values. By designing the reward function according to Equation (6), the transition from states with high potential values to those with low potential values can be avoided.

R (s_{t}, a_{t}, s_{t + 1}) \leftarrow R (s_{t}, a_{t}, s_{t + 1}) + γ ϕ (s_{t + 1}) - ϕ (s_{t})

(6)

4. Methodology

In traditional reinforcement learning, rewards are typically designed as scalar values that provide immediate feedback to the agent. However, in real-world scenarios rewards may be more complex and depend on multiple factors or sequential patterns.

One advantage of using reward machines for reinforcement learning is their ability to represent complex and structured reward functions. Reward machines can capture these complex reward structures by providing a more expressive representation. They allow the designer to define rewards based on the agent’s interaction with the environment, taking into account different states, actions, and temporal dependencies. This flexibility enables the formulation of more nuanced and powerful reward functions that can capture long-term goals or optimize for specific behaviors in a more explicit manner.

Another advantage of reward machines is their ability to handle non-Markovian reward functions. In traditional reinforcement learning rewards are usually assumed to be Markovian, meaning that they depend only on the current state of the environment. However, in many real-world scenarios, rewards may depend on historical states or previous actions. Reward machines can address this challenge by incorporating memory and allowing the agent to reason about past experiences, making them suitable for tasks in which rewards have a temporal or sequential component.

Furthermore, reward machines provide a modular and hierarchical framework for defining rewards. They can be composed and combined to create more complex reward functions from simpler components. This modular design facilitates the reusability and composition of reward structures, enabling a more efficient and systematic approach to reward specification.

Consider the grid-world depicted in Figure 1, which exhibits random task locations denoted as A, B, C, and D. In this environment, agents are capable of traversing in four cardinal directions to access the designated task locations. The objective of the agent entails sequentially visiting A, B, C, and D in accordance with the high-level plan provided above. Drawing inspiration from prior research conducted by Icarte et al. [6], we propose a hierarchical reinforcement learning approach that employs topologically sorted potential calculation. This methodology serves to effectively acquire policies conducive to motion planning.

4.1. Framework

In this study, we employ the provided high-level plan as the foundation for constructing a reward machine that facilitates the learning of non-Markovian goals using RL. The architectural depiction of the overall framework for hierarchical reinforcement learning employing topologically sorted potential calculation is visually represented in Figure 1.

The reward machine serves as an abstraction of the environment states and offers guidance on the transitions between these abstract states. This characteristic is instrumental in enabling the agents to effectively account for non-Markovian properties inherent to the task. By following the reward machine, the agents are equipped to complete the tasks in accordance with the specific temporal order prescribed.

To utilize the high-level plan, a potential-based reward machine is initially constructed. By integrating the MDP framework with this reward machine, we can proceed with hierarchical reinforcement learning, thereby training the agents in a more efficient manner. This approach allows for the capture of experiences from the abstract reward machine, enabling the agents to effectively learn and optimize their actions.

4.2. Topological Sorting-Based Reward Machine

Inspired by [6], we initially employ high-level plans to construct the reward machine as described in Definition 1. In the definition, the reward machine is represented by a six-item tuple consisting of E,

e_{0}

, F, A,

δ_{e}

, and

δ_{a}

. Here, E denotes a finite set of states,

e_{0}

represents the initial state, F denotes the set of accepting states, A denotes the set of actions,

δ_{e}

denotes the transition function between states, and

δ_{a}

denotes the state output function responsible for outputting the actions.

Definition 1.

The reward machine

M

is defined as a tuple

〈 E, e_{0}, F, A, δ_{e}, δ_{a} 〉

, where E is a finite set of states,

e_{0} \in E

is the initial state, F is the set of accepting states, A is the set of actions,

δ_{e} \in E \times A \to E

is the transition function between states, and

δ_{a} \in E \to A

is the state output function that outputs the actions.

We then proceed to extend the reward machine by assigning potential values to each state, as outlined in Definition 2 and in accordance with [3]. By introducing the reward function

δ_{r}

and the potential function

ϕ

into

M

, denoted as

N

, we can calculate

δ_{r} [e] [δ_{a} [e]]

and

ϕ [e]

based on the status of the current state e and the associated atomic proposition

δ_{a} [e]

. In the case that the next state in the reward machine belongs to the set of accepting states, formally denoted as

δ_{e} [e] [δ_{a} [e]] \in F^{'}

, we assign

δ_{r} [e] [δ_{a} [e]]

a constant reward

R

, and

ϕ [e]

is assigned the same constant reward

R

. It is worth noting that the reward for state transition and the potential function are represented by the same scalar. The accepting states serve as indicators that the corresponding tasks have been satisfied. Conversely, if the next state does not belong to the accepting states, we assign

δ_{r} [e] [δ_{a} [e]]

a reward of 0 and

ϕ [e]

a value within the range

(0, R)

, as shown in Equations (7) and (8).

Definition 2.

Given the reward machine

M

, the potential-based reward machine

N

can be defined as a seven-item tuple

〈 E, e_{0}, F, A, δ_{e}, δ_{r}, ϕ 〉

, where E is the finite set of states,

e_{0}

is the initial state, F is the set of accepting states, A is the set of actions,

δ_{e} \in E \times A \to E

is the transition function between states,

δ_{r} \in E \times A \to R

is the reward function of the state with respect to the transition function,

ϕ \in E \to R

is the potential function of the states, and

δ_{r}

and

ϕ

satisfy the following conditions:

\begin{matrix} δ_{r} [e] [δ_{a} [e]] & = \{\begin{matrix} 0 & if δ_{e} [e] [δ_{a} [e]] \notin F \\ R & if δ_{e} [e] [δ_{a} [e]] \in F \end{matrix} \end{matrix}

(7)

\begin{matrix} ϕ [e] & = \{\begin{matrix} (0, R) & if δ_{e} [e] [δ_{a} [e]] \notin F \\ R & if δ_{e} [e] [δ_{a} [e]] \in F \end{matrix} \end{matrix}

(8)

We calculate the potential function, denoted as

ϕ

, for each state in the reward machine using a topological sorting approach. This process is described in Algorithm 1. The algorithm begins by transforming the reward machine, denoted as

N

, into a graph, denoted as

G

. The variable

l o o p S i z e

represents the number of vertices within a loop in the graph. Additionally, the algorithm initializes an empty sorted list, referred to as

s o r t e d L i s t

, to store the vertices sorted in topological order. It makes use of a set, denoted as

v i s i t e d

, to keep track of visited vertices, and a dictionary, denoted as

ϕ

, to store potential values for each vertex.

To assist in assigning potential values, the algorithm includes a helper function called

A s s i g n P o t e n t i a l

. This function assigns potential values to a vertex based on its position within the loop and its distance from the end of the loop. The algorithm includes a recursive helper function, referred to as

T o p o l o g i c a l S o r t U t i l

, which performs the topological sorting and assigns potential values accordingly. This helper function begins by marking the current vertex as visited and checks whether the vertex is part of a loop. If it is, the

A s s i g n P o t e n t i a l

function is called to assign a potential value. Then, the helper function recursively calls itself for each unvisited adjacent vertex. Finally, the helper function inserts the current vertex at the beginning of the sorted list.

Algorithm 1: Pseucode for topologically-sorted potential calculation of

N

The main topological sort function is referred to as

T o p o l o g i c a l S o r t

. This function iterates through all the vertices in the graph and calls the

T o p o l o g i c a l S o r t U t i l

function for each unvisited vertex.

At the end of the code, the function

T o p o l o g i c a l S o r t W i t h P o t e n t i a l

is defined, which calls the

T o p o l o g i c a l S o r t

function and returns the sorted list of vertices and the potential dictionary. This function enables the calculation of the potential function for each state in the reward machine.

4.3. Integrating Reward Machine with MDP

The application of MDP within reinforcement learning is a common means of modeling the environment. However, the inherent Markov property of MDP presents limitations to reinforcement learning by restricting future states to depend solely on the current state and disregarding any historical information. In this paper, we aim to extend the MDP framework, denoted as

P

, by incorporating reward machines as outlined in Definition 2 to provide rewards to agents as specified in Definition 3.

To integrate the reward machines into the MDP, it is assumed that both

P

and

N

share the same labeling function, denoted as

L \in S \times A \times S \to A

. Whenever the tuple

〈 e, L (s, a, s^{'}) 〉

is within the domain of

δ^{e}

,

N

transitions from state e to

δ^{e} (e, L (s, a, s^{'}))

. Conversely, if the tuple is not within the domain, the reward machine remains in state e.

The transition probability distribution, denoted as

\hat{P}

, remains unchanged, as it continues to be defined over

S \times A

. Similarly, the extended reward function, denoted as

\hat{R}

, undergoes modification when the next state in the reward machine is evaluated. In cases where the next state is an accepting state,

\hat{R}

is updated solely with the reward function

δ^{r}

, as indicated in Equation (9). However, if the next state is not an accepting state, then

\hat{R}

is adjusted using both

δ^{r}

and

ϕ

. In order to balance the impact of the reward function and the potential function, it is assumed that

δ^{r}

and

ϕ

are in the same scale.

To summarize, the proposed extension of integrating reward machines with the MDP framework facilitates the modeling of complex reward structures and incorporates additional historical context beyond the current state. This integration enhances the capabilities of reinforcement learning algorithms for various applications.

Definition 3.

Given the MDP

P

and the reward machine

N

, we say that

P

is extended with

N

as

P_{N} = 〈 \hat{S}, \hat{s_{0}}, A, \hat{P}, \hat{R}, γ 〉

when

P

and

N

share the same labeling function

L \in S \times A \times S \to A

, which here is

$\hat{S} = S \times E$
$\hat{s_{0}} = s_{0} \times E_{0 i}$
$\hat{P} (〈 s^{'}, e^{'} 〉 ∣ 〈 s, e 〉, a) = P (s^{'} ∣ s, a)$
$\hat{R} (〈 s, e 〉, a, 〈 s^{'}, e^{'} 〉)$ is defined for all $〈 s, e 〉 \in S \times E$ $a \in A$ , such that

$\begin{matrix} \{\begin{matrix} δ^{r} (e, L (s, a, s^{'})) & e^{'} \in F \\ δ^{r} (e, L (s, a, s^{'})) + γ ϕ (e^{'}) - ϕ (e) & e^{'} \notin F \end{matrix} \end{matrix}$

(9)

Based on

P_{N}

, we propose a hierarchical reinforcement learning (HRL) approach to enhance the learning efficiency of the agents, as illustrated in Algorithm 2. The algorithm takes several inputs, including the Markov Decision Process (MDP)

P_{N}

, the discount factor

γ \in (0, 1]

, the learning rate

α \in (0, 1]

, and the exploration rate

ϵ \in (0, 1]

.

To initialize the Q-values

Q (〈 s, e 〉, a)

and

Q (e, a)

, we adopt an arbitrary approach. Subsequently, the algorithm employs a policy derived from the Q-values

Q (e, a)

to select a high-level option following an

ϵ

-greedy exploration strategy. Similarly, a low-level action a is chosen using a policy derived from the Q-values

Q (〈 s, e 〉, a)

, again employing an

ϵ

-greedy approach.

After selecting action a, the algorithm executes it and observes the resulting next state

〈 s^{'}, e^{'} 〉

. Subsequently, the algorithm computes the reward

δ_{r} (e, a)

for the current option–action pair

(e, a)

and evaluates the extended reward

\hat{R} (〈 s, e 〉, a, 〈 s^{'}, e^{'} 〉)

for the extended state–action–next state tuple.

Furthermore, the algorithm checks whether the next option

e^{'}

is an accepting state (indicated by F), which signifies a terminal state for the current option. In such cases, the algorithm updates the Q-value

Q (〈 s, e 〉, a)

with the extended reward

\hat{R}

and updates the Q-value

Q (e, a)

with the reward

δ_{r}

. On the other hand, if the next option

e^{'}

is not an accepting state, the algorithm proceeds to update the Q-value accordingly.

Algorithm 2: Pseucode for hierarchical reinforcement learning with

P_{N}

5. Experiment and Evaluation

5.1. Experiment Setup

This section presents empirical evaluations of the proposed algorithms in two distinct experimental settings. The implemented algorithms, Algorithms 1 and 2, were integrated into the OpenAI Gym environment [20] for evaluation. OpenAI Gym is designed to provide a standardized environment for developing and testing reinforcement learning algorithms, and offers a collection of environments that simulate various tasks and scenarios. The experiments were conducted using a grid-world task structure, where the size of the map was set to

50 \times 50

. The task locations within the grid world were generated randomly, resulting in four different tasks with unique high-level goals, as illustrated in Figure 2.

The primary objective of these experiments was to assess the performance of a single agent navigating within the grid-world and successfully reaching the specified tasks associated with their respective high-level goals. It is worth noting that based on the visual representation provided in Figure 2 the first two tasks do not contain any loops, while the third and fourth tasks feature internal loops within the task structure.

In order to evaluate the effectiveness of the proposed Hierarchical Reinforcement Learning (HRL) approach with topologically sorted reward machine (HRM-TS), we compared it against three state-of-the-art baseline algorithms. These baseline algorithms included the typical Q-learning (QL) algorithm [19], the typical HRL with reward machine (HRM) algorithm [6], and QL with topologically sorted reward machine (QL-TS).

To conduct a comprehensive evaluation, we compared the learning curves and the optimal rewards achieved by these algorithms in two different settings.

The first setting involved different tasks implemented on the same map, with identical task locations. The purpose of this setting was to assess how effectively the algorithms can adapt to different tasks while operating in a fixed environment. In the second setting, we considered a scenario in which the task locations varied while the tasks themselves remained the same, allowing us to investigate the algorithms’ ability to generalize their learned policies across different task configurations.

5.2. Simulation Results under Different High-Level Task Goals

A comprehensive evaluation of the proposed methodologies was undertaken in the conducted experiments, shedding light on their performance across various scenarios. The first experiment aimed to compare the effectiveness of the four algorithms against four distinct high-level goals. Figure 3 provides a visual representation of the learning curves attained by each algorithm when confronted with different task goals. Notably, all experiments were executed on an identical map with consistent task locations, ensuring a controlled environment for comparison. We mainly examine the average reward per episode as the rewards attain convergence throughout the training iterations.

Evidently, the HRM and HRM-TS approaches outshine the QL and QL-TS algorithms in terms of learning speed and efficiency. This disparity underscores the superiority of HRL over the traditional Q-Learning approach. The ability of HRL to capitalize on the high-level reward machine for accelerated learning becomes evident through these outcomes. As exemplified in the initial depiction in Figure 3, both the HRM and HRM-TS methodologies demonstrate the capacity to attain an average reward per episode nearing 1400. In contrast, the QL approach merely attains 600, and the QL-TS approach reaches only 1200 upon convergence. It is noteworthy that this trend extends uniformly across all the considered task objectives.

Furthermore, the utilization of topological sorting techniques introduces a significant advantage. The HRM-TS algorithm consistently exhibits superior performance in comparison to the HRM algorithm that lacks such sorting techniques. This showcases the substantial impact of incorporating topological sorting in enhancing learning efficiency within hierarchical reward machines.

An intriguing aspect of the results pertains to scenarios in which high-level goals involve loops. In instances where loops are absent, the HRM and HRM-TS approaches yield comparable optimal rewards. For example, in the first two figures of Figure 3 HRM and HRM-TS converge to the same average reward; however, when loops are present within the high-level goals, the HRM approach exhibits a tendency to converge towards suboptimal rewards. In striking contrast, the HRM-TS algorithm excels by attaining higher rewards in scenarios characterized by loop-containing high-level objectives. This observation underscores the added value of incorporating topological sorting techniques, as the HRM-TS algorithm manages to circumvent the suboptimal convergence observed in the HRM approach.

In summary, the results of the first experiment underscore the prowess of HRL compared to traditional Q-Learning, with the HRM and HRM-TS approaches demonstrating superior learning speed. The implementation of topological sorting techniques further amplifies the advantages of hierarchical reward machines, ensuring more robust and efficient learning, especially in cases involving complex high-level goals with loops. These findings substantiate the effectiveness of the proposed methodologies and highlight the potential of HRL with topological sorting to surmount the challenges posed by intricate task structures.

5.3. Simulation Results on Different Maps

In the second experiment, we focused on the third task as the target task and compared the performance of the four algorithms with different task locations. This choice of experiment stemmed from the consideration that the grid-world environment settings can significantly impact the learning efficiency of the algorithms.

For instance, if a task requires the agent to reach point B after reaching point A, and if A and B are located in close proximity on the map, it tends to be relatively easier for the agents to learn the policy to navigate from A to B. However, if A and B are situated far apart on a different map, the learning process may become more challenging due to the complexities introduced by distant task locations and intricate task goals.

The comprehensive evaluation of the algorithms, as depicted in Figure 4, corroborates the trends observed in the initial experiment. Notably, the HRM-TS approach maintains its superiority across diverse task locations compared to the other three algorithms. The consistency of these outcomes underscores the robustness and effectiveness of the HRM-TS approach. A notable observation is the persistent pattern of faster learning rates exhibited by the HRL-based algorithms when juxtaposed against their QL-based counterparts. For example, on the third map HRM-TS can converge to the average reward of 170, while HRM can only converge to 125. This consistency in accelerated learning demonstrates the inherent advantages of HRL in expediting the acquisition of optimal strategies.

An intriguing facet of these results is the HRM algorithm’s susceptibility to converging towards suboptimal strategies, particularly when tasks involve high-level goals characterized by loops. This observation reinforces the issue of the HRM algorithm’s efficacy being compromised when confronted with intricate task structures, especially those involving cyclic dependencies.

In summary, as illustrated in Figure 4, the comparison outcomes not only substantiate the preeminence of the HRM-TS approach, they underline the broader advantages of HRL in augmenting learning efficiency. Concurrently, the limitations exhibited by the HRM algorithm in specific scenarios underscore the significance of employing sophisticated techniques such as topological sorting to navigate challenges arising from complex task structures. These findings collectively emphasize the potential of the HRM-TS approach and contribute to a nuanced understanding of the interplay between hierarchical reinforcement learning algorithms and intricate task environments.

6. Conclusions and Future Work

In conclusion, this study has introduced a novel approach that significantly enhances the performance of HRL by leveraging topologically sorted potential calculation within the framework of reward machines. By exploiting the inherent topological structure of the task hierarchy, our method offers a sophisticated means of determining potentials for distinct sub-goals. This empowers agents to make more informed decisions, efficiently allocating their actions toward accomplishing higher-level objectives.

Our thorough experimental evaluations conducted in the grid-world environment using the OpenAI-Gym framework substantiate the efficacy of the proposed approach. The obtained results serve as compelling evidence unequivocally showcasing the superiority of our method when compared to conventional HRL techniques and reward machine-based reinforcement learning approaches. The observed improvements encompass learning efficiency and overall performance across various complex tasks.

These noteworthy outcomes underscore the pivotal role of integrating topological sorting techniques within hierarchical reinforcement learning. Our approach offers a means to transcend the limitations inherent in traditional HRL methodologies, and signifies a significant step forward in the evolution of this field. By harnessing the power of task hierarchy and judiciously allocating resources, our method equips agents to navigate intricate environments and effectively achieve high-level objectives.

In the broader context, our proposed methodology not only addresses the challenges posed by traditional HRL, it sheds light on the potential for further advancements in this domain. Moreover, the applicability of our approach to a spectrum of diverse and complex environments could be explored by further assessing its adaptability and robustness. Furthermore, future investigations may delve into extensions and optimizations to enhance the method’s scalability and resilience, thereby continuing to push the boundaries of hierarchical reinforcement learning.

Author Contributions

Conceptualization, Z.Z. and Y.L.; methodology, Z.Z. and Y.L.; software, J.S.; validation, J.S.; investigation, Z.Z.; writing—original draft preparation, Z.Z.; writing—review and editing, J.S. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Thrun, S.; Littman, M.L. A Review of Reinforcement Learning. AI Mag. 2000, 21, 103. [Google Scholar] [CrossRef]
Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep Reinforcement Learning: A Brief Survey. IEEE Signal Process. Mag. 2017, 34, 26–38. [Google Scholar] [CrossRef]
Zhu, C.; Cai, Y.; Zhu, J.; Hu, C.; Bi, J. GR (1)-Guided Deep Reinforcement Learning for Multi-Task Motion Planning under a Stochastic Environment. Electronics 2022, 11, 3716. [Google Scholar] [CrossRef]
Botvinick, M.M. Hierarchical reinforcement learning and decision making. Curr. Opin. Neurobiol. 2012, 22, 956–962. [Google Scholar] [CrossRef] [PubMed]
Hutsebaut-Buysse, M.; Mets, K.; Latré, S. Hierarchical reinforcement learning: A survey and open research challenges. Mach. Learn. Knowl. Extr. 2022, 4, 172–221. [Google Scholar] [CrossRef]
Icarte, R.T.; Klassen, T.Q.; Valenzano, R.; McIlraith, S.A. Reward machines: Exploiting reward function structure in reinforcement learning. J. Artif. Intell. Res. 2022, 73, 173–208. [Google Scholar] [CrossRef]
Zhu, C.; Zhu, J.; Cai, Y.; Wang, F. Decomposing Synthesized Strategies for Reactive Multi-agent Reinforcement Learning. In Proceedings of the International Symposium on Theoretical Aspects of Software Engineering, Bristol, UK, 4–6 July 2023; pp. 59–76. [Google Scholar] [CrossRef]
Gu, S.; Chen, G.; Zhang, L.; Hou, J.; Hu, Y.; Knoll, A. Constrained reinforcement learning for vehicle motion planning with topological reachability analysis. Robotics 2022, 11, 81. [Google Scholar] [CrossRef]
An, Q.; Chen, Y.; Zeng, H.; Wang, J. Sorting operation method of manipulator based on deep reinforcement learning. Int. J. Model. Simul. Sci. Comput. 2023, 14, 2341007. [Google Scholar] [CrossRef]
Wu, N.; Xie, Y. A survey of machine learning for computer architecture and systems. ACM Comput. Surv. (CSUR) 2022, 55, 1–39. [Google Scholar] [CrossRef]
Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar] [CrossRef]
Zhu, C.; Cai, Y.; Hu, C.; Bi, J. Efficient Reinforcement Learning with Generalized-Reactivity Specifications. In Proceedings of the 2022 29th Asia-Pacific Software Engineering Conference (APSEC), IEEE, Virtual Event, 6–9 December 2022; pp. 31–40. [Google Scholar] [CrossRef]
Ding, X.; Smith, S.L.; Belta, C.; Rus, D. Optimal control of Markov decision processes with linear temporal logic constraints. IEEE Trans. Autom. Control 2014, 59, 1244–1257. [Google Scholar] [CrossRef]
Zhu, C.; Butler, M.; Cirstea, C.; Hoang, T.S. A fairness-based refinement strategy to transform liveness properties in Event-B models. Sci. Comput. Program. 2023, 225, 102907. [Google Scholar] [CrossRef]
Gao, Q.; Hajinezhad, D.; Zhang, Y.; Kantaros, Y.; Zavlanos, M.M. Reduced variance deep reinforcement learning with temporal logic specifications. In Proceedings of the 10th ACM/IEEE International Conference on Cyber-Physical Systems (ICCPS), Montreal, QC, Canada, 16–18 April 2019; pp. 237–248. [Google Scholar] [CrossRef]
Hasanbeig, M.; Kantaros, Y.; Abate, A.; Kroening, D.; Pappas, G.J.; Lee, I. Reinforcement learning for temporal logic control synthesis with probabilistic satisfaction guarantees. In Proceedings of the 2019 IEEE 58th Conference on Decision and Control (CDC), Nice, France, 11–13 December 2019; pp. 5338–5343. [Google Scholar] [CrossRef]
Neary, C.; Xu, Z.; Wu, B.; Topcu, U. Reward Machines for Cooperative Multi-Agent Reinforcement Learning. In Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, Virtual Event, 3–7 May 2021; pp. 934–942. [Google Scholar] [CrossRef]
Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement learning: A survey. J. Artif. Intell. Res. 1996, 4, 237–285. [Google Scholar] [CrossRef]
Watkins, C.J.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. OpenAI Gym: A Toolkit for Developing and Comparing Reinforcement Learning Algorithms. Available online: https://github.com/openai/gym (accessed on 18 May 2023).

Figure 1. Framework of Hierarchical Reinforcement Learning through Topologically Sorted Potential Calculation.

Figure 2. High-level goals of the four different tasks.

Figure 3. Comparison of learning curves of the four algorithms on the same map with four different task goals.

Figure 4. Comparison of learning curves for the four algorithms on the same map with different task locations.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, Z.; Shang, J.; Li, Y. Enhancing Efficiency in Hierarchical Reinforcement Learning through Topological-Sorted Potential Calculation. Electronics 2023, 12, 3700. https://doi.org/10.3390/electronics12173700

AMA Style

Zhou Z, Shang J, Li Y. Enhancing Efficiency in Hierarchical Reinforcement Learning through Topological-Sorted Potential Calculation. Electronics. 2023; 12(17):3700. https://doi.org/10.3390/electronics12173700

Chicago/Turabian Style

Zhou, Ziyun, Jingwei Shang, and Yimang Li. 2023. "Enhancing Efficiency in Hierarchical Reinforcement Learning through Topological-Sorted Potential Calculation" Electronics 12, no. 17: 3700. https://doi.org/10.3390/electronics12173700

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Efficiency in Hierarchical Reinforcement Learning through Topological-Sorted Potential Calculation

Abstract

1. Introduction

2. Related Work

3. Preliminaries

3.1. Q-Learning

3.2. Deep Q-Learning

3.3. Reward Shaping

4. Methodology

4.1. Framework

4.2. Topological Sorting-Based Reward Machine

4.3. Integrating Reward Machine with MDP

5. Experiment and Evaluation

5.1. Experiment Setup

5.2. Simulation Results under Different High-Level Task Goals

5.3. Simulation Results on Different Maps

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI