Deep Q-Learning Network with Bayesian-Based Supervised Expert Learning

Kim, Chayoung

doi:10.3390/sym14102134

Open AccessArticle

Deep Q-Learning Network with Bayesian-Based Supervised Expert Learning

by

Chayoung Kim

College of Liberal Arts and Interdisciplinary Studies, Kyonggi University, 154-42 Gwanggyosan-ro, Yeongtong-gu, Suwon-si 16227, Korea

Symmetry 2022, 14(10), 2134; https://doi.org/10.3390/sym14102134

Submission received: 9 September 2022 / Revised: 26 September 2022 / Accepted: 9 October 2022 / Published: 13 October 2022

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

:

Deep reinforcement learning (DRL) algorithms interact with the environment and have achieved considerable success in several decision-making problems. However, DRL requires a significant number of data before it can achieve adequate performance. Moreover, it might have limited applicability when DRL agents are able to learn in a real-world environment. Therefore, some algorithms combine DRL agents with supervised learning and leverage previous additional knowledge. Some have integrated a deep Q-learning network with a behavioral cloning model that can exploit supervised learning as prior learning. The algorithm proposed in this study is also based on these methods and is intended to update the loss function of the existing technique into a Bayesian approach. The supervised loss function used in existing algorithms and the loss function based on the Bayesian method proposed in this study differ in terms of the utilization of prior knowledge. Using prior knowledge and not using prior knowledge, such as the cross entropy being symmetric. As a result of the various OpenAI Gym environments, such as Cart-Pole and MountainCar, the learning convergence performance was improved. In particular, the proposed method can be applied to achieve fairly stable learning during the early stage when learning in a sparse environment is uncertain.

Keywords:

deep reinforcement learning; deep Q-learning network; behavioral cloning model; expert supervised learning; Bayesian approach

1. Introduction

Deep reinforcement learning (DRL) has achieved significant success over the past few years in terms of learning policies and sequential decision-making problems. Notable representatives include the deep Q-learning network (DQN) of Atari games [1], end-to-end policy searches for robot motor control [2], and the strategy game Go, which is combined with a Monte Carlo search [3]. The results have been quite encouraging because they leverage the scalability and performance of deep learning [1,4] when building datasets from previous experiences through the training of a large-scale convolutional neural network. Using a sampling from a dataset of previous experiences leads to a super-human control policy, which is quite good because the correlation is mitigated in the state distribution bias. However, it is still difficult to apply previous algorithms to real-world environments such as autonomous driving scenarios [5,6]. These algorithms require millions of simulation steps when learning a good control policy. This means that we should have a reliable and accurate simulator. This gap between the real world and a perfectly informative simulated environment restricts the widespread application of DRL to the learning of certain valuable policies. Therefore, due to the uncertainty of real-world circumstances, building a simulation environment for implementing a DRL model that corresponds to the real world is a significant challenge. DRL agents cannot obtain sufficient experience to improve their policies without a reliable and informative simulation environment. Because it cannot fully interact with the real world, the DRL model is forced during training to select random actions sampled from random policies [7] that are essential for the agent to explore the entire state-action space. This random action selection is prohibited in many practical, real-world situations because it has the potential to seriously damage the system. For example, in autonomous driving scenarios [6], randomly selected policies can lead to traffic accidents, making it impossible to learn how to avoid serious consequences. Even in this limited simulation environment, the transition into an environment similar to real-world circumstances for ensuring agent reliability is one of the most pressing tasks in the use of DRL models. Recent studies suggested integrating human experiences as a potential solution to improve the adaptability of DRL models [8]. Many supervised learning algorithms based on deep neural networks demonstrated the powerful capabilities of feature pattern extraction and included expert decisions derived from human experience and trajectory data, which are also referred to as demonstrations [9]. Therefore, most studies suggested methods that work reasonably well in exploiting system data. With these methods, we can exploit demonstrations to properly pre-train the agent learning from the start of the training and then continue to improve the self-generated data. This helps us gain sufficient experience in improving policy strategies. If agent learning is properly enabled, it can lead to the possibility of applying DRL to many real-world problems for which perfectly reliable simulators do not exist.

Deep Q-learning from demonstration (DQfD) [8] uses a double DQN algorithm [10] to significantly accelerate learning using small numbers of demonstration data. DQfD [8] is pre-trained only on demonstration data using a hybrid combination of the time difference loss and supervised losses. The large-margin supervised loss function [11] enables DQfD [8] to support agents in learning expert knowledge by continuously enhancing the agent learning policy, which is closely related to the policy of the demonstration. After pre-training, the agent learns a self-consistent value function with reinforcement learning through the time difference loss and starts interacting within the domain. Agents can update the deep learning network by combining demonstrations and self-generated data. However, DQfD [8] has some issues that need to be addressed. The transition trajectory of the demonstration is a single data source that contributes to an expert loss during the learning process. When self-generated transitions are sampled from the experience replay buffer, the demonstration data are idle at the same time, and thus the demonstration utilization is inefficient. Moreover, an incomplete demonstration can be detrimental to the learned policy when it indicates a policy that is inferior to the learned policy. A supervised-assisted DRL framework [12], referred to as deep Q-learning of specific algorithms in dynamic demonstrations, was recently proposed based on the behavioral cloning model [13,14]. The algorithm [12] applies a hybrid framework, i.e., a supervised-assisted DRL for merging machine intelligence and human experience and uses a supervised learning algorithm to support DRL. It [14] also includes an auto-update mechanism to adaptively enhance the demonstration data and behavioral cloning (BC) [13,14] models. A BC model can include high-quality transition samples to alleviate the potential adverse effects of incomplete demonstrations.

Although this study is also based on the supervised-assisted DRL model [12], we propose a newly updated loss function based on a Bayesian approach rather than the cross-entropy function proposed as an expert loss function in the model. The supervised-assisted DRL [12] model uses the cross entropy function for the same reason: “for smoothing the output of the total loss function, stabilizing the training process, and improving the sample efficiency” [12]. The reason for using the expert loss function in the Bayesian technique applied in this study is to reduce the initial instability through the utilization of the prior, particularly when learning in a sparse environment is unstable. Therefore, the BC model can be utilized to enhance reasonable actions and stabilize the training processes. The remainder of this paper is structured as follows. Section 2 discusses existing studies on imitation learning algorithms and on improvements to DRL algorithms. Section 3 presents the problem definition, and Section 4 provides details of the algorithm proposed in this paper. In Section 5, experiments and analyses conducted to validate the proposed algorithm are described. Section 6 presents some concluding remarks and directions for future research.

2. Background

Imitative learning [15] is primarily concerned with matching the performance of the demonstrator. Demonstrations have “experts” that can provide useful information [16] on a number of difficult problems, such as autonomous driving [17] or robotic control [16]. Imitative learning uses demonstrations to imitate the behavior of such experts; the BC model [13,14] is a type of imitative learning and has received significant attention due to such advantages as high speed. The BC model, which effectively utilizes learning speed, simplicity, and demonstration, models the policy represented by the demonstration through a mapping from a state into action in a supervised manner and uses the supervised loss function from the demonstration to achieve the learning target. Dagger [18] proposed solving the problem of covariate shifts in BC models. In [18], an expert on Dagger models provides additional feedback, allowing a completely new and valuable transition to be added to the demo data. Moreover, because Dagger models do not combine imitative learning and reinforcement learning, the models cannot learn how to improve beyond the expert, and the expert is therefore always needed, making it less practical. AggreVaTeD [19] extends Dagger to work with deep neural networks and continuous workspaces. However, like Dagger, Deeply AggreVaTeD only applies imitative learning and cannot learn how to improve beyond the expert.

Another large field of imitative learning is generative adversarial imitative learning [20]. Instead of mapping directly from a state into action, generative adversarial imitative learning introduces generators and discriminators to learn from the demonstrations. A generator learns the policies contained in the demo and generates state-action pairs, whereas the discriminator is trained to distinguish whether a state-action pair is a policy of an expert or a learned policy. The goal of policy improvement is achieved through adversarial learning. Generative adversarial imitative learning performs well in high-order continuous control issues [21].

Demonstrations were recently shown to be helpful in the difficult exploration of reinforcement learning [22]. Therefore, studies that converge reinforcement learning and imitative learning, where the demonstration can be used in the field of DRL training, have been actively conducted. Because the reward of DRL is immediate feedback from the environment for evaluating the performance of the policy, restructuring the reward function through a demonstration is an effective way to improve the performance of DRL [23]. The algorithm [24] transfers knowledge directly from human policy and has shown how expert advice or demonstrations can be used to structure rewards for DRL issues [23]. Another approach is to structure policies that are used to sample experiences [25] or iterate policies from demonstrations [26]. The framework in [11] allows demonstrators to work in scenarios in which they are rewarded by the environment and has been called reinforcement learning with expert demonstrations.

Most of these algorithms use demonstrations in which they imitate or create rewards. However, a more advanced approach is to enhance the policy directly on demonstration data. DQfD [8] is a representative example, where both the samples from the experience replay buffer and the demonstration extracted by the DRL agent are used for training. Algorithms such as DQfD are hybrids that place the demonstration into the experience replay buffer and sample from this buffer. DQfD is similar to that described in [11] in that it combines time difference and classification losses in a model-free batch algorithm. However, in DQfD, the agents are initially pre-trained on a demonstration, and batches of self-generated data change over time and are used as experiential replays to train the DQN. A prioritized replay method is also used to balance the number of demonstrations in each mini-batch. Combining time difference loss with supervised classification loss has also been shown to improve imitative learning [11]. The most similar approach to DQfD is the accelerated DQN with expert trajectories [26], which also combines time difference and classification losses in deep Q-learning. A trained DQN agent is used to generate demonstrations that are better than human data. Accelerated DQN with expert trajectories uses a cross-entropy classification loss rather than the large margin loss used by DQfD and does not pre-train the agent to perform well on the first interaction with the environment. The self-imitation learning algorithm selects various loss functions and adds new trajectories to the demonstration. The self-imitation learning updates the same neural network twice with the actor-critic and self-imitation learning loss functions [27]. However, most previous algorithms have various problems that do not fully exploit this demonstration [21].

Supervised assisted DRL [12] is a double DQfD BC model designed to overcome the shortcomings of the aforementioned algorithms. Supervised learning models and expert loss can fully extract a demonstration and mimic expert policies. In supervised-assisted DRL, dynamic demonstration updates might help overcome the negative effects of static demonstrations. However, although supervised assisted DRL uses cross-entropy classification loss, accelerated DQN with expert trajectories applied it first. The difference between supervised-assisted DRL and accelerated DQN with expert trajectories is whether the agent is pre-trained. The algorithm proposed in this study is based on supervised-assisted DRL. However, it is based on a simple DQN, not a double DQN. Moreover, it does not use a cross-entropy function as an expert loss function. A newly updated expert-supervised loss function based on a Bayesian approach is therefore proposed. The proposed algorithm reduces the initial instability by applying prior knowledge when learning is initially extremely unstable. Therefore, the BC model can be used to provide reasonable actions and more stable training processes.

3. Proposed Algorithm

3.1. Preliminaries

DQNs are designed using a Markov decision process [28], which is defined based on the following tuples: state space S, action space A, reward function R(s, a), state transition distribution T(s, a, s′) = P(s′| s, a), and discount factor γ. The state transition probability distribution P(s’| s, a) describes the process of selecting action a in the current state s, where s’ represents the next state, and the reward function R(s, a) provides feedback. Policy π designs the actions the agent will select in each state. The goal of the DRL agent is to find policy π that maps states to actions that maximize the expected total discounted rewards over the lifetime of the agent. The value function Qπ(s, a) for a given state-action pair (s, a) is an estimate of the expected future reward that can be obtained from (s, a) when the DRL agents follow policy π. The optimal value function Q*(s, a) provides the maximum in all states and is determined by solving the Bellman equation:

Q^{*} (s, a) = E [r (s, a) + γ \sum_{s^{'}} P (s^{'} | s, a) \max_{a^{'}} Q^{*} (s^{'}, a^{'})]

(1)

A DQN [1] is one of the most famous DRL methods. The DQN utilizes deep neural networks to approximate the value function Q(s, a) for each state-action pair (s, a) and outputs the action value Q(s, •; θ), where θ is the deep neural networks parameter. Transitions are uniformly sampled from the experience reply buffer to update the deep neural network parameters by minimizing the loss function denoted by (2). A DQN uses two main components to make it more stable. One of the components applies a separate target Q-network that is copied at every τ step in the Q-network. Usually, the current Q-network uses θ to compute the value function, and the target Q-network denotes the network parameter as θ′. The other component adds all experiences of the DRL agent to the experience replay buffer and then uniformly samples it to conduct updates to the current network.

L_{Q} (Q) = {(r (s, a) + {γ \max}_{a} Q (s_{t + 1}, a; θ^{'}) - Q (s, a; θ))}^{2}

(2)

The conversions of supervised learning and DRL are extremely important, as shown by the previous DQfD [8] and DQfD-BC [12] algorithms. Supervised learning allows the DRL agent to learn from the experience and decision-making policies of experts. However, supervised learning is limited by the demonstration data and has the disadvantage of being unable to discover new knowledge. Otherwise, the learning rate of the DRL agent is relatively slower than that of supervised learning, and it becomes expensive to achieve perfect results. Therefore, various models have previously been proposed that can realize the full potential of machine intelligence found in DRL, as well as use the advantages provided by supervised learning, allowing the DRL agent to continuously and appropriately learn new knowledge.

Prioritized experience replay [29] modifies the DQN agent to frequently sample more significant transitions from the experience replay buffer. Prioritized experience replay is used in both DQfD and DQfD-BC; the algorithm proposed in this study is also based on both DQfD and DQfD-BC. In order to avoid the problem of overestimating the action value of a DQN, a double DQN was used in both DQfD and DQfD-BC, and the loss function was also changed when considering the DDQN [10], such as through Equation (3). However, because this study focused on the initial unstable state of a DQN, we used a DQN-based loss function. DQfD proposes a hybrid loss function with a supervised loss. In DQfD, the complete loss function is denoted by (4), which consists of four parts: double Q-learning time difference loss, n-step double Q-learning TD loss, expert loss, and L2 normalized loss. The parameters λ₁, λ₂, and λ₃ are the weights of each loss. Expert loss

ℒ

_E in Equation (5) is considered to be the most significant DQfD loss.

L_{DQ} (Q) = {(r (s, a) + γ Q (s_{t + 1}, a_{t + 1}^{\max}; θ^{'}) - Q (s, a; θ))}^{2}

(3)

L (Q) = L_{DQ} (Q) + λ_{1} L_{n} (Q) + λ_{2} L_{Ε} (Q) + λ_{3} L_{L 2} (Q)

(4)

L_{E} (Q) = \max_{a \in A} [Q (s, a) + ℓ (a_{Ε}, a)] - Q (s, a_{Ε})

(5)

In the large-margin supervisory loss (a_E, a) of

ℒ

_E, a_E represents the action of the expert in the demonstration data, and a is the action of the agent. This results in lower margins for a non-activation based on a demonstration [8]. If the sampled transition is not in the demonstration data, expert loss by weight λ₂ will be inapplicable.

In order to complement the potential low results of the large-margin loss function of DQfD by directly utilizing a demonstration, DQfD-BC, which uses the BC model to extract useful information from the demonstration, was proposed. For the dynamic demonstration used in DQfD-BC, when the final cumulative reward for each episode reaches a new high score, transition samples are automatically generated and inserted into the demonstration dataset. Thus, the diversity and quantity of the demonstrations can be continuously improved, and positive effects on the BC model performance can be generated. The DQfD-BC approach is similar to that of Dagger [18]. DQfD-BC includes a DNN-based BC model and maintains two types of buffers for an experience replay and demonstration, which are commonly used by DRL models. With DQfD-BC, the BC model pre-trains the DRL agent using the buffer for a demonstration to obtain the initial decision-making capabilities, and the prioritized experience replay [29] is adapted in both the experience replay buffer and the buffer used for the demonstration to improve the sample efficiency.

DQfD-BC exploits expert losses for self-generated transitions, and to solve the multi-label classification problem, the cross-entropy loss function H (output, target), as shown in (6), was used. During the self-learning process with the BC model, expert loss with cross entropy over the entire loss function is always available. In Equation (6), a_E indicates the actions of the expert regarding the demonstration, and π_bc(s_t) denotes the policy learned from the BC model. When the cross-entropy loss function is used in a self-learning state, it can also be expressed as in Equation (7), where a is the output of the Q-network, and π_bc(s_t) is the output of the most current state of the BC model.

ℓ_{BC} = H (π_{bc} (s_{t}), a_{E})

(6)

ℓ (a, π_{bc} (s)) = H (a, π_{bc} (s))

(7)

DQfD-BC consists of three main steps: BC model pre-training, agent pre-training, and joint-model self-learning. The BC model exploits the cross entropy for the initial start-up decision making with a demonstration and for pre-training the Q-network. It generates expert actions using BC models and constructs expert loss functions proposed by DQfD-BC. Therefore, the supervised expert loss function of DQfD-BC [12] is as described in Equation (8), which differs from the expert loss function of DQfD. A self-learning ability can be maintained with a time difference error, preventing an over-fitting of the weights of the DNNs with L2 regularization loss. Therefore, the final form of the loss in DQfD-BC is the same as in Equation (4).

L_{E} (Q) = \max_{a \in A} [Q (s, a) + ℓ (a, π_{bc} (s))] - Q (s, a_{Ε})

(8)

In order to describe the difference between the Bayesian-based supervised expert loss function used in this study and the cross entropy used in DQfD-BC, we described the cross entropy, which is used to calculate the difference between the two probability distributions P and Q, and is expressed as follows:

H (X) = - \sum_{x} p (x) \log p (x)

(9)

The cross-entropy loss function takes a tuple (output, target) as a pair of inputs and computes an estimate of how far the output is from the target [12]. The cross-entropy loss occurs when p is the actual data distribution P(Y|X) as the target, and q is the distribution P(Y|X; θ) of the predicted result of the model as the output. This is defined as the difference between P(x) and Q(x) estimated by the model [12]. In addition, it can be interpreted as an expected value of negative log-likelihood (logQ(x)) under parameter θ of the deep neural networks used and is expressed as follows:

- \sum_{x} P (y | x) \log P (y | x; θ)

(10)

Therefore, the goal was to determine parameter θ of the neural networks that minimize the above equation. The expression is log-likelihood if stated other than as a negative number [30]. The cross entropy was minimized by maximizing the log-likelihood and adding a negative number. Thus, finding the model with the maximum likelihood is equivalent to minimizing the cross entropy, which is the difference between the distribution of the training data and the distribution of the outputs predicted by the model. Therefore, a negative log-likelihood is generally used as the loss function for deep learning models [30]. The reason for using the cross entropy, that is, the log-likelihood, was stated as follows: “the goal of introducing a different expert loss function is to smooth the output of the total loss function and stabilize the training process. More importantly, the BC model was utilized to provide reasonable actions and generate supervised losses for all self-generated transitions in the experience replay buffer. Compared with the DQfD method, the sample efficiency is significantly improved” [12]. However, in this study, we exploited a Bayesian-based expert loss function that can utilize prior knowledge instead of cross entropy. Further details are described in Section 3.2.

3.2. Proposed Bayesian-Based Supervised Expert Loss

In this section, when the proposed Bayesian-based supervised expert loss was used, the differences between the loss models of the existing models and the pseudo-code of the algorithm proposed in this paper were presented.

The DQfD with the BC model uses a Q-network and target Q-network and includes an experience replay buffer as shown in Figure 1. The BC model was added [12,13,14] to utilize the demonstration data in the replay buffer. The model also computes the expert loss based on a Bayesian approach, considering both the outputs of the trained Q-network and the BC model.

We exploited a Bayesian approach instead of the maximum log-likelihood, such as cross-entropy minimization. First, we compared the Bayesian equation with the log-likelihood and determined the advantages of using the Bayesian equation over the maximum log-likelihood. The log-likelihood and Bayesian equation can be expressed as follows [31,32,33]:

(11)

X: Observation. These are the actual data that we have, and in machine learning and deep learning, they are also called the training data;
Θ: Hypothesis. This is the final target we want to estimate through observation. In the classification, each discrete class can be used, and in the case of linear regression, the weights can be estimated. Therefore, it is a target that we want to estimate for both deep learning and machine learning;
P(X): Marginal probability. This is the distribution of data X;
P(Θ): Priori probability. This refers to the probability that we have in advance;
P(X|Θ): Likelihood. This refers to the distribution of data with a hypothesis;
P(Θ|X): Posteriori probability. This is the distribution of the hypothesis when the observation is provided.

This equation allows the posteriori probability to be affected by data X.

If we examine the relationship between the maximum likelihood estimation and a posteriori, the following equation is as shown in Equation (12) [31,32,33]. The maximum likelihood estimation and maximum a posteriori were used to estimate a certain variable (Θ) using a probabilistic model.

(12)

The final target is to find a certain value of Θ that maximizes the posteriori probability. However, we have no idea of the value on the left-hand side. Therefore, by using a Bayesian approach, we applied a strategy to find Θ in the direction in which the right-hand side, for which the equality with the left-hand side holds, becomes larger. However, because P(x) in the denominator is actual data, it is a constant; therefore, it is possible to exclude it from the proportional relationship. Hence, the equation above can be used as follows: First, the method for maximizing the entire term in the right hand can be thought of as maximizing a posteriori probability, which is the maximum a posteriori. Second, if P(Θ) is excluded, this is the maximum likelihood estimation. Eventually, if we want to find our target using only actual data, we can apply the maximum likelihood estimation alone, and if we want to reflect our prior knowledge along with the actual data, we can use the maximum a posteriori. We examined the advantages of using prior knowledge in the supervised expert loss function. The above equation is rewritten as follows [32,33,34,35]:

\ln P (θ | X) = \ln P (X | θ) + \ln P (θ) - \ln P (X)

(13)

In order to find the maximum θ, we take the partial derivative and set it to 0, as in the following [32,33,34,35]:

(14)

The second term indicates the “prior knowledge of the model parameters” and is the part where the Bayes theorem used in this study has meaning. In order to find the meaning of the second term, “prior knowledge of the alpha”, the first term, “likelihood”, was replaced with a simple linear regression equation that is easy to interpret. First, we proved that both the minimization of the loss function of simple linear regression and the calculation of the probability distribution of parameter θ that best fits the data through the maximum likelihood estimation is equivalent. Second, while replacing the minimized loss function of the linear regression with the maximum likelihood estimation in alpha, we examined the role and meaning of the loss function of the linear regression equation and the prior knowledge when they are used simultaneously. Therefore, we attempted to determine the meaning of the second term, “prior knowledge of the model parameters” proposed in this paper by applying a more comprehensible equation and what meaning it has when used in the BC model.

ƒ (X, θ) = θ_{0} + θ_{1} x

(15)

The linear regression Equation (15) [32,33,34,35] does not accurately predict the actual “y” and is accompanied by a specific error, as in (16) [32,33,34,35].

y (i) = ƒ (X (i), θ) + ε (i)

(16)

The error has a mean of 0 and a variance of σ2. It is said that error (i) follows a normal distribution, that is, zero mean Gaussian [32,33,34,35]. The goal is to minimize this error term, that is, the difference between the actual y and the estimated y (^y).

\frac{1}{n} \sum_{i = 1}^{n} {[y (i) - ƒ (x (i), Θ)]}^{2}

(17)

Therefore, we attempted to minimize (17) [32,33,34,35]. This loss function is called the mean squared error. We should determine the equivalence between the minimization of the loss function and the calculation of the probability distribution of parameter θ that best fits the data through the maximum likelihood estimation. In order to achieve this, if the likelihood was generated from the linear regression equation, it was written as follows: we then logged it as described in Equation (18) [32,33,34,35].

\ln p (X, Y | Θ) = \ln p (X) + \sum_{i = 1}^{n} \ln p (y (i) | X, Θ)

(18)

If we can examine the second term on the right-hand side, it can be assumed that X and θ are static such that y(i) follows the shape of the distribution of the error term. That is, y(i) follows an N(0,σ²) Gaussian distribution with mean f(x_i|θ) and variance σ². Thus, given the data (X) and parameters, the conditional probability of y(i) was written and then logged as indicated in Equation (19) [32,33,34,35].

\ln p (X, Y | Θ) = \ln p (X) + \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} {[y (i) - ƒ (X (i), Θ)]}^{2}

(19)

We determined that the maximum likelihood estimation can be achieved by minimizing the sum of squares shown on the right-hand side of (19). The term ln(p(x)) in Equation (19) disappears when differentiating with respect to θ. It was therefore confirmed that the maximum likelihood estimation and mean squared error loss minimizations were equivalent. Therefore, it was proven that the maximum likelihood estimation is the ideal loss function in many cases in both deep learning and machine learning [31,32,33,34]. We can replace the maximum likelihood estimation, which is an extremely reasonable loss function, with “the sum of squares”. We can also determine what it means to use “prior knowledge” if we arrange Equation (19) according to Equation (14) into Equation (20) [31,32,33,34]:

(20)

From a Bayesian point of view, to find the parameter of the model that allows us to find the observed data with the highest probability, we need to consider the extra term ln(p(θ)) as well as the maximum likelihood estimation when replaced by the “sum of squares”. The extra term is the “prior knowledge of the model parameters”. The part focused on in Equation (20) is σ. Σ, which indicates the standard deviation of the prediction. If the variance is unstable when making predictions, the parameters cannot be properly estimated due to insufficient data at the beginning of the training. Equation (20) indicates that as σ increases, the first term becomes extremely small, and the influence of ln(p(θ)) increases. Therefore, prior knowledge contributes when learning is unstable. When the training of an agent is unstable, a large variance at the beginning of the training makes the parameter small. If the variance is small, it does not matter if the parameter is a meaningfully large value. However, if the variance is large, it is definitely better to reduce the parameter to have less influence on the learning model. In this way, the Bayesian approach results in reducing the initial instability through the application of prior knowledge, and thus it can be seen that prior knowledge has a significant effect on the model training. Hence, the supervised expert loss function proposed in this paper used “prior knowledge,” and its meaning was examined. The algorithm proposed in this paper is in Algorithm 1.

Algorithm 1: DQfD-BC with supervised expert loss based on Bayesian approach

1.

D^replay and D^demo, initialized using the demonstration dataset; θ, weights for Qnetwork; θ⁻, weights for the target network; τ, frequency at which to update the target network; k, the number of pre-training gradient updates.

2.

for t = 1 to k do

A.: Sample a mini-batch of transitions from D^replay
B.: Calculate the supervised expert loss 𝓙_E(Q) using the BC based on Bayesian approach
C.: Calculate loss 𝓙(Q) using the target network
D.: Apply a gradient descent to update 𝜃
E.: if every τ then θ⁻ = θ; end if

3.

end for

4.

for e = 1 to E do

A.

for t = 1 to T do

Sample action a_t with probability ε-greedy or max_a(θ)
Execute at and observe r_t and new s_t
Store (s_t, a_t, r_t, s_t+1) in D^replay
Sample a mini-batch of transitions from D^replay
Calculate the supervised expert loss 𝓙_E(Q) using a Bayesian-based BC
Calculate loss 𝓙(Q) using the target network
Apply a gradient descent to update θ
if every τ then θ⁻ = θ; end if
s_t = s_t+1

B.

end for

C.

if it reaches the average best episode score, then

I

for i = 1 to M do

Agent interactions with the environment
Store the trajectory in D^demo

II.

end for

III.

Adjust the BC with Bayesian approach with newly updated D^demo

D.

end if

5.

end for

4. Results

4.1. Cart-Pole

Four observations and two discrete actions are shown in the Cart-Pole in Figure 2 [36]. The pole is attached to a cart that can move back and forth toward the left and right. The goal of Cart-Pole is to stand without falling from a certain angle when the cart speeds up or slows down. In other words, if the pole is almost vertical until the next action ends, the agent receives +1 reward from the environment during every step. At the end of the episode, the pole angle is between −12° and +12°, and the cart location with the pole is between −2.4 and +2.4. In general, we did not wait until the end of an episode. The following requirements can be considered to obtain better solutions [37]. When the agent of Cart-Pole falls before the end of the episode, it receives −100 from the environment. Moreover, even if the average reward was 490 or more 10 times in a row, even before the end of the episode, the Cart-Pole agent was terminated before the episode was complete. The DQN, which applies the supervised expert loss function proposed in this paper, was implemented using Tensorflow [38] and Keras [39] in OpenAI Gym [40].

We considered a case in which training the agent is extremely unstable at the beginning of the learning, and to determine through experiments that prior knowledge contributes to reducing the initial instability, as the best case, we changed the length of an episode from 500 to 200 steps. We determined the ending without a perturbation within a short period of time. As shown in the best case of Figure 3 and the average case of Figure 4, the results of both models using the loss proposed in this paper (DQfD with Bayesian-based BC) and the model without using the loss (DQfD with BC based on cross entropy) were not significantly different. However, when the maximum episode length was limited to 200, the worst cases in Figure 5 and Figure 6 had frequent perturbations. Although it does not happen often, there were quite a few cases in which the episode did not end at the maximum length, and the numbers of such cases were almost the same. Moreover, Cart-Pole has wrapper code that artificially terminates an episode after a certain time step [41]. The time limits for Cart-Pole-V0 and Cart-Pole-VI are 200 and 500, respectively. This wrapper code will help the episode end but may cause some issues. If the pole is upright and the agent receives +1 for every step, the reward of the straight pole is then infinite. However, at a time limit of 500, the agent times out, and a terminal flag is passed to the agent. Therefore, it bootstraps to zero. This can make it difficult for agents to learn. Several methods were proposed for handling this issue. One method is to bootstrap to the value of the next state predicted by the time-sensitive network instead of bootstrapping to zero when the above cases occur [41]. Figure 7 shows the comparison of the average rewards based on this method. As can be seen in the figure, the results of both models using the proposed loss (DQfD with Bayesian-based BC) and the model without using the loss (DQfD with BC based on cross entropy) were the same. Therefore, determining a significant difference between the two methods is difficult for Cart-Pole.

4.2. MountainCar

Two observations and three discrete actions can be seen in MountainCar in Figure 8 [42]. An extremely small car moves back and forth between the mountains on the left and right. The goal is to reach the top of the right mountain. However, the car is insufficiently strong to climb the top of the right mountain without momentum. Thus, it goes back and forth between the left and right mountains in a non-stop manner, building up momentum. At each step, the agent receives a reward of −1 if the position of the car is almost halfway between the left and right mountains. In general, an episode ended with 200 iterations. As in the case of Cart-Pole, even before the episode is completely over, if the agent obtains an average reward of +110.0 more than 100 times in a row, it is considered resolved. However, unlike Cart-Pole, MountainCar is an extremely sparse environment. In other words, because the reward is received when the car climbs to the top of the mountain, there is no reward even if it gets close to the top. Therefore, whether the method proposed in this study will work well, even in a sparse environment, is another issue.

As in the case of Cart-Pole, the number of episodes ending in MountainCar was changed to 30. Because MountainCar has an extremely sparse environment, we evaluated it as “how often the car reached the top of the mountain on the right within a limited number of steps during an episode” rather than “at the end of the episode” [43]. When the car reached the top of the right mountain, a report of 100 was received, and the agent’s reward was between 0 and 100. The model using the supervised expert loss proposed in this study (DQfD with Bayesian-based BC) and the previous model (DQfD with BC based on cross entropy) did not show any differences, such as with the best case in Figure 9 or the worst case in Figure 10. However, in a sparse environment such as MountainCar, the results in Figure 11 show how well the method proposed in this paper (DQfD with Bayesian-based BC) can be utilized. In Figure 11, it can be seen that the method proposed in this study allowed the top of the right mountain to be climbed a much higher number of times than the previous method (DQfD with BC based on cross entropy). Moreover, these studies can be used to improve performances in sparse time-sensitive networking. It can therefore be concluded that prior knowledge is extremely useful for training agents in sparse environments.

5. Discussion

When comparing the results of MountainCar representing a sparse environment, it can be seen that the results of MountainCar are significantly improved compared to Cart-Pole. A study [9,10] conducted on various supervised expert loss functions showed that DQNs with such loss functions gradually improved. Moreover, it can be seen that they can also be improved using various other methods [29]. In particular, the supervised expert loss function proposed in this study yields better results in a sparse environment because, in a sparse environment, the more the agent explores, the greater the variation. It builds prior knowledge through the BC model and learns such knowledge in an environment in which the variation can be controlled to a certain extent. Although this study was based on a well-known algorithm such as a DQN, various studies on sparse environments were conducted. These studies can be used as other proposals for performance improvement in various environments and can be utilized as additional components. Previously, the author showed that various loss functions proposed in a discrete environment could be provided with enormous value, even in a continuous domain [44]. Therefore, the next step is to study the application of a continuous environment to various loss functions proposed in a DQN, allowing it to be applied to algorithms such as a deep deterministic policy gradient [45], which has a representative continuous environment.

6. Conclusions

We presented a newly updated loss function based on a Bayesian approach rather than a cross-entropy function, which is proposed for use as the expert loss function in the applied model (DQfD-BC). When training the agent, if the supervised expert loss function based on Bayesian information is utilized, it is possible to reduce the instability based on prior knowledge during the initial training process within a sparse environment. The results of the experiments conducted in this study, particularly in a sparse environment, were stable because in a sparse environment, the more exploration that is conducted by the agent, the greater the variation. This differs significantly from the results of previous studies. Therefore, the supervised expert loss function proposed based on a Bayesian approach can be used in environments where it is necessary to control the variance to a certain extent. However, if extensive studies on improvements to a DQN provide much better results, additional studies will be needed to adjust these learning results to work well within a continuous environment.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.; Veness, J.; Bellemare, M.; Graves, A.; Riedmiller, M.; Fidjeland, A.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Levine, S.; Finn, C.; Darrell, T.; Abbeel, P. End-to-end training of deep visuomotor policies. J. Mach. Learn. (JMLR) 2016, 17, 1–40. [Google Scholar]
Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Driessche, G.V.D.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M. Mastering the game of go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef] [PubMed]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Hester, T.; Stone, P. TEXPLORE: Real-time sample-efficient reinforcement learning for robots. Mach. Learn. 2013, 90, 385–429. [Google Scholar] [CrossRef] [Green Version]
Zhao, Z.; Wang, Q.; Li, X. Deep reinforcement learning based lane detection and localization. Neurocomputing 2020, 413, 328–338. [Google Scholar] [CrossRef]
Li, J.; Shi, X.; Li, J.; Zhang, X.; Wang, J. Random curiosity-driven exploration in deep reinforcement learning. Neurocomputing 2020, 418, 139–147. [Google Scholar] [CrossRef]
Hester, T.; Vecerik, M.; Pietquin, O.; Lanctot, M.; Schaul, T.; Piot, B.; Gruslys, A. Deep q-learning from demonstrations. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LO, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Behbahani, F.; Shiarlis, K.; Chen, X.; Kurin, V.; Kasewa, S.; Stirbu, C.; Gomes, J.; Paul, S.; Oliehoek, F.A.; Messias, J.; et al. Learning from demonstration in the wild. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), IEEE, Montreal, QC, Canada, 20–24 May 2019; pp. 775–781. [Google Scholar]
Hasselt, H.v.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 2094–2100. [Google Scholar]
Piot, B.; Geist, M.; Pietquin, O. Boosted and reward-regularized classification for apprenticeship learning. In Proceedings of the 2014 International Conference on Autonomous Agents and Multi-Agent Systems, International Foundation for Autonomous Agents and Multiagent Systems, Paris, France, 5–9 May 2014; pp. 1249–1256. [Google Scholar]
Li, X.; Wang, X.; Zheng, X.; Jin, J.; Huang, Y.; Zhang, J.J.; Wang, F.-Y. SADRL: Merging human experience with machine intelligence via supervised assisted deep reinforcement learning. Neurocomputing 2022, 467, 300–309. [Google Scholar] [CrossRef]
Torabi, F.; Warnell, G.; Stone, P. Behavioral cloning from observation. arXiv 2018, arXiv:180501954. [Google Scholar]
Bühler, A.; Gaidon, A.; Cramariuc, A.; Ambrus, R.; Rosman, G.; Burgard, W. Driving through ghosts: Behavioral cloning with false positives. arXiv 2020, arXiv:200812969. [Google Scholar]
Hussein, A.; Gaber, M.M.; Elyan, E.; Jayne, C. Imitation learning: A survey of learning methods. ACM Comput. Surveys (CSUR) 2017, 50, 1–35. [Google Scholar] [CrossRef]
Ravichandar, H.; Polydoros, A.S.; Chernova, S.; Billard, A. Recent advances in robot learning from demonstration. Annu. Rev. Control Robot. Auton. Syst. 2020, 3, 297–330. [Google Scholar] [CrossRef] [Green Version]
Bojarski, M.; Del Testa, D.; Dworakowski, D.; Firner, B.; Flepp, B.; Goyal, P.; Jackel, L.D.; Monfort, M.; Muller, U.; Zhang, J.; et al. End to end learning for self-driving cars. arXiv 2016, arXiv:160407316. [Google Scholar]
Ross, S.; Gordon, G.J.; Bagnell, J.A. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), Ft. Lauderdale, FL, USA, 11–13 April 2011. [Google Scholar]
Sun, W.; Venkatraman, A.; Gordon, G.J.; Boots, B.; Bagnell, J.A. Deeply aggrevated: Differentiable imitation learning for sequential prediction. CoRR abs/1703.01030. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017. [Google Scholar]
Ho, J.; Ermon, S. Generative adversarial imitation learning. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 4565–4573. [Google Scholar]
Kang, B.; Jie, Z.; Feng, J. Policy optimization with demonstrations. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 2469–2478. [Google Scholar]
Subramanian, K.; Isbell, C.L., Jr.; Thomaz, A. Exploration from demonstration for interactive reinforcement learning. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems (AAMAS), Singapore, 9–13 May 2016. [Google Scholar]
Brys, T.; Harutyunyan, A.; Suay, H.B.; Chernova, S.; Taylor, M.E.; Nowé, A. Reinforcement learning from demonstration through shaping. In Proceedings of the Twenty Fourth International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015. [Google Scholar]
Taylor, M.; Suay, H.; Chernova, S. Integrating reinforcement learning with human demonstrations of varying ability. In Proceedings of the International Conference on Autonomous Agents and Multiagent Systems (AAMAS), Taipei, Taiwan, 2–6 May 2011. [Google Scholar]
Cederborg, T.; Grover, I.; Isbell, C.; Thomaz, A. Policy shaping with human teachers. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI 2015), Buenos Aires, Argentina, 25–31 July 2015. [Google Scholar]
Lakshminarayanan, A.S.; Ozair, S.; Bengio, Y. Reinforcement learning with few expert demonstrations. In Proceedings of the NIPS Workshop on Deep Learning for Action and Interaction, Barcelona, Spain, 10 December 2016. [Google Scholar]
Oh, J.; Guo, Y.; Singh, S.; Lee, H. Self-imitation learning. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 3878–3887. [Google Scholar]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998; Volume 1. [Google Scholar]
Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized experience replay. In Proceedings of the International Conference on Learning Representations, San Juan, PR, USA, 2–4 May 2016. [Google Scholar]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
MLE vs. MAP. Available online: https://agustinus.kristia.de/techblog/2017/01/01/mle-vs-map/ (accessed on 5 September 2022).
Linear Regression: A Bayesian Point of View. Available online: https://agustinus.kristia.de/techblog/2017/01/05/bayesian-regression/ (accessed on 5 September 2022).
Bayes Theorem, MLE/MAP. Available online: https://sanghyu.tistory.com/10 (accessed on 5 September 2022).
Murphy; Kevin, P. Machine Learning: A Probabilistic Perspective; MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
Simple Mathematical Proof of Maximum Likelihood Estimation and Bayes Theorem. Available online: https://zzaebok.github.io/machine_learning/mle-bayes-mathematical-proof/ (accessed on 5 September 2022).
Cart-Pole. Available online: https://www.gymlibrary.dev/environments/classic_control/cart_pole/ (accessed on 5 September 2022).
Cart-Pole_DQN. Available online: https://github.com/rlcode/reinforcement-learning-kr/blob/master/2-cartpole/1-dqn/cartpole_dqn.py (accessed on 5 September 2022).
Tensorflow. Available online: https://github.com/tensorflow/tensorflow (accessed on 5 September 2022).
Keras. Available online: https://keras.io/ (accessed on 5 September 2022).
OpenAI Gym. Available online: https://gym.openai.com/ (accessed on 5 September 2022).
Available online: https://books.google.co.kr/books?id=_zczEAAAQBAJ&pg=PA262&lpg=PA262&dq=cart-pole+time+sensitive+networking&source=bl&ots=Lq_M7jueQb&sig=ACfU3U0DgRydx4RvvOuMIX45ssoNuTfaIw&hl=ko&sa=X&ved=2ahUKEwjrnfmTsqX6AhXPMHAKHdpaCkAQ6AF6BAgbEAM#v=onepage&q=cart-pole%20time%20sensitive%20networking&f=false (accessed on 5 September 2022).
MountainCar. Available online: https://www.gymlibrary.dev/environments/classic_control/mountain_car/ (accessed on 5 September 2022).
MountainCar-v0. Available online: https://github.com/shivaverma/OpenAIGym/blob/master/mountain-car/MountainCar-v0.py (accessed on 5 September 2022).
Kim, C. Temporal Consistency-Based Loss Function for Both Deep Q-Networks and Deep Deterministic Policy Gradients for Continuous Actions. Symmetry 2021, 13, 2411. [Google Scholar] [CrossRef]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. In Proceedings of the 4th ICLR 2016, San Diego, CA, USA, 2–4 May 2016; Available online: https://arxiv.org/abs/1509.02971 (accessed on 5 September 2022).

Figure 1. DQfD with the BC model proposed within the framework of this study.

Figure 2. Cart-Pole [36] in OpenAI Gym [40].

Figure 3. Best case comparison between the previous model (DQfD-BC with cross entropy) and the proposed model (DQfD-BC with Bayesian approach), which finished the shortest episode in time before reaching 200 steps.

Figure 4. Average case comparison between the previous model (DQfD-BC with cross entropy) and the proposed model (DQfD-BC with Bayesian approach), which finished the episode within the average amount of time before reaching 200 steps.

Figure 5. A worst-case comparison between the previous model (DQfD-BC with cross entropy) and the proposed model (DQfD-BC with Bayesian approach), which could not finish the episode because the reward was not achieved, even when it reached 200 steps.

Figure 6. Another worst-case comparison between the previous model (DQfD-BC with cross entropy) and the proposed model (DQfD-BC with Bayesian approach), which could not finish the episode because the reward was not satisfied, even when it reached 200 steps.

Figure 7. Average reward comparison according to the time-sensitive networking between the previous model (DQfD-BC with cross entropy) and proposed model (DQfD-BC with Bayesian approach) based on bootstrapping to the value of the next state predicted by the time-sensitive network.

Figure 8. MountainCar [42] in OpenAI Gym [40].

Figure 9. A best-case comparison between the previous model (DQfD-BC with cross entropy) and the proposed model (DQfD-BC with Bayesian approach), which reached the goals the greatest number of times during a single episode.

Figure 10. A worst-case comparison between the previous model (DQfD-BC with cross entropy) and the proposed model (DQfD-BC with Bayesian approach), which achieved the slowest learning in one episode.

Figure 11. A comparison between the previous model (DQfD-BC with cross entropy) and the proposed model (DQfD-BC with Bayesian approach), which showed “how many times the goals” were reached during each episode. The results of this comparison showed that the method (DQfD-BC with Bayesian approach) proposed in this paper could be suitable for a sparse environment.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, C. Deep Q-Learning Network with Bayesian-Based Supervised Expert Learning. Symmetry 2022, 14, 2134. https://doi.org/10.3390/sym14102134

AMA Style

Kim C. Deep Q-Learning Network with Bayesian-Based Supervised Expert Learning. Symmetry. 2022; 14(10):2134. https://doi.org/10.3390/sym14102134

Chicago/Turabian Style

Kim, Chayoung. 2022. "Deep Q-Learning Network with Bayesian-Based Supervised Expert Learning" Symmetry 14, no. 10: 2134. https://doi.org/10.3390/sym14102134

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Q-Learning Network with Bayesian-Based Supervised Expert Learning

Abstract

1. Introduction

2. Background

3. Proposed Algorithm

3.1. Preliminaries

3.2. Proposed Bayesian-Based Supervised Expert Loss

4. Results

4.1. Cart-Pole

4.2. MountainCar

5. Discussion

6. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI