Analysis of Performance Measure in Q Learning with UCB Exploration

Ye, Weicheng; Chen, Dangxing

doi:10.3390/math10040575

Open AccessArticle

Analysis of Performance Measure in Q Learning with UCB Exploration

by

Weicheng Ye

^1,*,†

and

Dangxing Chen

^2,*

¹

Credit Suisse Securities, New York, NY 10010-3698, USA

²

Zu Chongzhi Center for Mathematics and Computational Sciences, Duke Kunshan University, Kunshan 215316, China

^*

Authors to whom correspondence should be addressed.

^†

Most work was done while at Carnegie Mellon University. Opinions expressed in this paper are of the author, and do not reflect the view of Credit Suisse.

Mathematics 2022, 10(4), 575; https://doi.org/10.3390/math10040575

Submission received: 3 January 2022 / Revised: 27 January 2022 / Accepted: 5 February 2022 / Published: 12 February 2022

(This article belongs to the Special Issue Mathematical Method and Application of Machine Learning)

Download Versions Notes

Abstract

:

Compared to model-based Reinforcement Learning (RL) approaches, model-free RL algorithms, such as Q-learning, require less space and are more expressive, since specifying value functions or policies is more flexible than specifying the model for the environment. This makes model-free algorithms more prevalent in modern deep RL. However, model-based methods can more efficiently extract the information from available data. The Upper Confidence Bound (UCB) bandit can improve the exploration bonuses, and hence increase the data efficiency in the Q-learning framework. The cumulative regret of the Q-learning algorithm with an UCB exploration policy in the episodic Markov Decision Process has recently been explored in the underlying environment of finite state-action space. In this paper, we study the regret bound of the Q-learning algorithm with UCB exploration in the scenario of compact state-action metric space. We present an algorithm that adaptively discretizes the continuous state-action space and iteratively updates Q-values. The algorithm is able to efficiently optimize rewards and minimize cumulative regret.

Keywords:

reinforcement learning; Q-learning; multi-armed bandit; theory of machine learning

1. Background

1.1. Markov Decision Process

The Markov Decision Process (MDP) is a discrete-time sequential decision-making problem. In MDP, there is a finite set of states

S

and a finite set of actions

A

. At time-step t, the agent is at some state

s \in S

, and the agent takes some action

a \in A

. By the state transition probability function

P

, the agent moves to some new state

s^{'} \in S

at time

t + 1

, where

s^{'} \sim P (\cdot | s, a)

. The agent will then receive some immediate reward

r (s^{'}, a)

, the reward after transitioning to state

s^{'}

, due to action a. In MDP, for the certain pair

(s, a)

, the state transition process satisfies the Markov property. That means the process is conditionally independent of all of the past states and actions. For example, if there exists only one action in each state (fixed transition probability in each state) and all the rewards are zeros, then MDP reduces to the Markov Chain. Policy

π

is a map from state to action (

π : S \to A

). If the environment has finite time-steps, then we call it a finite time horizon. Otherwise, the environment is an infinite time horizon. The core problem in MDP is to find an optimal policy that can maximize the cumulative function of random rewards over a finite/infinite time horizon. For example, the typical cumulative function of random rewards in the finite time horizon is a sum of rewards. For example, for the specific policy

π

with a finite time horizon H, it is

E_{s_{t + 1} \sim P (\cdot | s_{t}, π (s_{t}))} [\sum_{t = 0}^{H} r (s_{t + 1}, π (s_{t}))]

.

MDP hence contains two important functions: the reward function r and state transition function

P

. If we know both functions, that is, we can predict which reward will be received and which states will be in the next transition, then MDP can be solved by Dynamic Programming (DP). DP requires the full description of MDP, with known transition probabilities and reward functions.

1.2. Model-Based and Model-Free Reinforcement Learning

However, for most cases, reward function r and state transition function

P

are not precisely known. The purpose of reinforcement learning (RL) is to solve MDP when the reward and transition functions are unknown. One approach in RL is to learn a model of how the environment works from its observation and then find a solution using that model. For example, with a new state-action pair and the observation of the corresponding reward, we can update the estimate of r and

P

. With the simulation of different actions in many states, we can have a learned transition function and learned reward function with high confidence. If we adequately model the environment, then we can use the learned model to find policy. This approach is called model-based RL. However, model-based RL requires too many trials to learn a particular task independent of the choice of policy iteration or value iteration. Hence, model-based RL methods are often inapplicable to large-scale systems. In contrast, we do not need to learn a model, but can directly find a policy. For example, the Q-learning algorithm is one approach that learns MDP and solves it to find the optimal policy at the same time. Q-learning directly estimates the optimal Q-values for each action in each state, and the policy could be derived by choosing the action with the highest Q-value in the given state. Since we do not learn a model of the environment, we call it model-free RL. The core idea of Q-learning is to balance the exploration and exploitation trade-off: trying random actions to explore the underlying MDP, while following the so far optimal policy to maximize the long-term reward.

2. Introduction

Reinforcement Learning (RL), as a learning problem to maximize the numerical reward, learns how to map situations to actions in unknown underlying system dynamics [1]. While model-based RL can approach high-quality and effective decisions, the stability issue about scaling up to the approximate setting with high-dimensional state and action spaces has been demonstrated [2,3]. Besides, due to the assumption of the model, model-based RL suffers from model bias, especially when the samples are few and informative prior knowledge is limited [4,5,6]. On the contrary, model-free RL algorithms can directly update the value and policy functions without any assumption on the model environment [1]. As the most classical model-free RL algorithm, Q-learning was introduced by Watkins in his Ph.D. dissertation. Watkins designed the action-reply MDP algorithm, which achieves convergence to the optimal Q function with a finite state and action space setting [7,8]. This approach is considered as a simpler way of learning optimality in a controlled Markov domain. From the classical Q-learning algorithm to modern Deep Q-Networks [9], Advantage Actor-Critic [10], and Trust Region Policy Optimization [11], most state-of-the-art RL are in the model-free approach. Model-free RL algorithms require less space and are more expressive since specifying value functions or policies are more flexible compared to specifying the model of the environment. The advantages of model-free methods underlie their success in deep RL applications [12].

However, model-based methods are generally more promising for efficient extraction of the information from available data than the model-free approaches, such as Q-learning [6,11]. The requirement of data efficiency in model-free RL is hence increasing. The key to achieve good data efficiency generally lies in the tradeoff between exploration and exploitation. Using the bandit to explore the uncertain environment while maximizing the reward has been imported to obtain optimal data efficiency [12,13,14]. Q-learning with a common

ϵ

-greedy exploration strategy exponentially takes many episodes to learn [12]. The Upper Confidence Bound (UCB) algorithm, as a multi-armed bandit algorithm, assigns exploration bonuses in many stochastic settings [15,16,17,18]. The UCB exploration policy is hence considered as a tool for improving the exploration bonuses in Q-learning [12]. This approach can improve data efficiency while leveraging the advantages of model-free methods. However, the Q-learning algorithm with UCB exploration is restricted to the environment of finite state and action space [7,12,19]. Due to the natural settings of RL applications, RL over underlying MDP with the environment of continuous state and action space has substantially been an intractable and challenging problem [20,21]. For example, most robotic applications of RL require continuous state spaces defined by utilizing continuous variables, such as position, velocity, torque, and so forth. Although Q-learning is commonly applied to problems with discrete states and actions, it has been extended to continuous space under different scenarios [22,23]. This motivates us to extend the state-of-the-art Q-learning algorithm equipped with an UCB exploration policy to a continuous state-action space environment. In modern RL, the algorithm that can leverage the advantages of model-free methods and data efficiency in the continuous state-action space is demanding.

Our contributions are: (1) Development of a state-of-the-art Q-learning algorithm with an UCB exploration policy to compact state and action metric space; and (2) Achievement of a tight bound for the cumulative regret for the underlying algorithm using the covering dimension techniques. We provide a rigorous analysis of bounding the regret with the concentration of measure analysis and combinatorial approaches.

3. Notation and Settings

3.1. Q-Learning

We consider a setting of an episodic Markov decision process MDP (

S, A, H, P, r

), where state space is a compact metric space

(S, d_{S})

, and action space is also a compact metric space

(A, d_{A})

.

S \times A

is a product compact metric space with metric

d_{S \times A}

. H is the number of steps in each episode.

P

is the state transition function where the transitional probability from the current state-action pair in

S \times A

to the next state is

x^{'} \sim P (\cdot | x, a)

. The reward function at step

h \in [H]

, is

r_{h} : S \times A \to [0, 1]

. Since we discuss the finite time horizon Q-learning here, there is no discount factor included in the value function. Let the policy function be

π : S \to A

. Denote by

V_{h}^{π} : S \to R

the value function at step h under policy

π

.

V_{h}^{π}

stands for the expected sum of remaining rewards collected from

x_{h} = x

to the end of episode given by policy

π

. Mathematically,

V_{h}^{π} (x) : = E [\sum_{h^{'} = h}^{H} r_{h^{'}} (x_{h^{'}}, π_{h^{'}} (x_{h^{'}})) | x_{h} = x] .

Accordingly, we can define

Q_{h}^{π} : S \times A \to R

as the Q-value function at step h under policy

π

. This gives the expected sum of the remaining rewards, starting from

x_{h} = x, a_{h} = a

to the end of episode. Mathematically,

Q_{h}^{π} (x, a) : = r_{h} (x, a) + E [\sum_{h^{'} = h + 1}^{H} r_{h^{'}} (x_{h^{'}}, π_{h^{'}} (x_{h^{'}})) | x_{h} = x, a_{h} = a] .

For simplicity, we define the expected future value

P_{h} V_{h} (x, a) : = E_{x^{'} \sim P (\cdot | x, a)} V_{h + 1} (x^{'}),

where

V_{h + 1}

is the proxy of the value function. The empirical counterpart of episode k, say

x_{h + 1}^{k}

the next observed state from

(x, a)

, is

{\hat{P}}_{h}^{k} V_{h} (x, a) : = V_{h + 1} (x_{h + 1}^{k}) .

(1)

The above notations follow from [12]. Define the optimal policy

π^{*}

by the optimal value

V_{h}^{*} (x) = {sup}_{π} V_{h}^{π} (x)

. The Bellman equation for any (deterministic) policy

π

is

\begin{matrix} \{\begin{matrix} V_{h}^{π} (x) = Q_{h}^{π} (x, π (x)) \\ Q_{h}^{π} (x, a) = r_{h} (x, a) + P_{h} V_{h}^{π} (x, a) . \end{matrix} \end{matrix}

In addition,

V^{*}

and

Q^{*}

are the value and Q-value functions associated with the optimal policy:

\begin{matrix} \{\begin{matrix} V_{h}^{*} (x) = sup_{a \in A} Q_{h}^{*} (x, a) \\ Q_{h}^{*} (x, a) = r_{h} (x, a) + P_{h} V_{h}^{*} (x, a) . \end{matrix} \end{matrix}

(2)

3.2. Lipschitz Assumption

In this work, we assume

V^{*}

is Lipschitz-continuous with Lipschitz constant D. Assume the next step from the current state-action pair is Lipschitz-continuous. That is, let

x_{h + 1}

be the next step from

(x_{h}, a_{h})

and

y_{h + 1}

be the next step from

(y_{h}, b_{h})

. Then

d_{S} (x_{h + 1}, y_{h + 1}) \leq M \cdot d_{S \times A} ((x_{h}, a_{h}), (y_{h}, b_{h}))

with probability 1 for some constant M. This implies that

\hat{P} V^{*}

is also Lipschitz-continuous:

\begin{matrix} ‖ \hat{P_{h}} V_{h}^{*} (x_{h}, a_{h}) - \hat{P_{h}} V_{h}^{*} (y_{h}, b_{h}) ‖ & = ‖ V_{h + 1}^{*} (x_{h + 1}) - V_{h + 1}^{*} (y_{h + 1}) ‖ \\ \leq D \cdot d_{S} (x_{h + 1}, y_{h + 1}) \leq D \cdot M \cdot d_{S \times A} ((x_{h}, a_{h}), (y_{h}, b_{h})) \\ = L \cdot d_{S \times A} ((x_{h}, a_{h}), (y_{h}, b_{h})) \end{matrix}

(3)

for

L = D \cdot M

. The Lipschitz constant is L for

\hat{P} V^{*}

.

3.3. Performance Measure

Our main interest is to provide a practical algorithm that can bound the regret. Assume K is the total number of episodes. At the beginning of each episode k, the adversary picks a starting state

x_{1}^{k}

. The agent will choose a corresponding greedy policy

π_{k}

before the start of episode k. Recall that

V_{h}^{π_{k}}

is the value function of

π_{k}

in episode k. The cumulative regret (performance measure) of K episodes is

Regret (K) : = \sum_{k = 1}^{K} (V_{1}^{*} (x_{1}^{k}) - V_{1}^{π_{k}} (x_{1}^{k})) .

(4)

3.4. Iterative Updating Rule

Since

S \times A

is a compact metric space, running the algorithm will partition the space into a discrete framework, for example, balls. In the algorithm, for

(x, a) \in S \times A

at episode k, we look at all the past state-action pairs that are in the same ball of

(x, a)

,

B (x, a)

, from the episode up to

k - 1

. Assume at the past episode,

k_{i} \in {1, \dots, k - 1}

and step h,

(x_{h}^{k_{i}}, a_{h}^{k_{i}}) \in B (x, a)

was selected. We call

\{(x_{h}^{k_{i}}, a_{h}^{k_{i}})\}

active points. Denote by

n_{h}^{k} (x, a)

the total number of state-action pairs

\{(x_{h}^{k_{i}}, a_{h}^{k_{i}})\}

in

B (x, a)

at step h up to episode k. Additionally, its observed state at step

h + 1

is

x_{h + 1}^{k_{i}}

for episode

k_{i}

. We denote

Q_{h}^{k}

,

V_{h}^{k}

,

n_{h}^{k}

as

Q_{h}

,

V_{h}

, and

n_{h}

at the beginning of episode k. We also have a bandit term

b_{t}

, of which more details will be given later.

Definition 1

(Iterative updating rule of

Q_{h}^{k}

).

Q_{h}^{k + 1} (x, a) = \{\begin{matrix} (1 - α_{t}) Q_{h}^{k} (x, a) + α_{t} [r_{h} (x, a) + V_{h + 1}^{k} (x_{h + 1}^{k}) + b_{t}] & if (x_{h}^{k}, a_{h}^{k}) in B (x, a); \\ Q_{h}^{k} (x, a) & otherwise . \end{matrix}

(5)

Here, t is the counter for the times the algorithm has visited the

B (x, a)

at step h, so

t = n_{h}^{k} (x, a)

. Accordingly, the iterative updating rule for

V_{h}^{k}

is

V_{h}^{k} (x) \leftarrow min \{H, sup_{a^{'} \in A} Q_{h}^{k} (x, a^{'})\}, \forall x \in S .

(6)

For the choice of learning rate

α_{t}

in the updating rule, we choose

α_{t} : = \frac{1}{t}

. Then, we define the following related quantities:

α_{t}^{0} = \prod_{j = 1}^{t} (1 - α_{j}), α_{t}^{i} = α_{i} \cdot \prod_{j = i + 1}^{t} (1 - α_{j}) .

(7)

The above definition yields the following properties that are useful in our analysis.

Lemma 1.

The following properties hold for

α_{t}

and

α_{t}^{i}

:

1.

\sum_{t = 1}^{\infty} α_{t} = \infty

and

\sum_{t = 1}^{\infty} α_{t}^{2} < \infty

, 2.

\frac{1}{\sqrt{t}} \leq \sum_{i = 1}^{t} \frac{α_{t}^{i}}{\sqrt{i}} \leq \frac{2}{\sqrt{t}}

.

Proof.

The first property is trivial. We will prove the second property by induction on t.

Let us check the right-hand side. Clearly the statement is true for

t = 1

. Assume the result holds for some

t - 1

, i.e.,

\sum_{i = 1}^{t - 1} \frac{α_{t - 1}^{i}}{\sqrt{i}} \leq \frac{2}{\sqrt{t - 1}}

. Note that

α_{t}^{i} = α_{t} = \frac{1}{t}

. By using the induction hypothesis we can write

\begin{matrix} \sum_{i = 1}^{t} \frac{α_{t}^{i}}{\sqrt{i}} & = \frac{α_{t}^{t}}{\sqrt{t}} + (1 - α_{t}) \sum_{i = 1}^{t - 1} \frac{α_{t - 1}^{i}}{\sqrt{i}} \\ = \frac{α_{t}}{\sqrt{t}} + (1 - α_{t}) \sum_{i = 1}^{t - 1} \frac{α_{t - 1}^{i}}{\sqrt{i}} \\ \leq \frac{α_{t}}{\sqrt{t}} + (1 - α_{t}) \frac{2}{\sqrt{t - 1}} \end{matrix}

All the remaining is to prove that

\frac{α_{t}}{\sqrt{t}} + (1 - α_{t}) \frac{2}{\sqrt{t - 1}} \leq \frac{2}{\sqrt{t}}

, i.e.,

\frac{2 (\sqrt{t} - \sqrt{t - 1})}{2 \sqrt{t} - \sqrt{t - 1}} \leq α_{t} = \frac{1}{t}

.

Clearly, we have

\begin{matrix} \frac{2 \sqrt{t} - \sqrt{t - 1}}{2 (\sqrt{t} - \sqrt{t - 1})} & = \frac{1}{2} (\sqrt{t} + \sqrt{t - 1}) (2 \sqrt{t} - \sqrt{t - 1}) \\ = \frac{1}{2} (t + 1 + \sqrt{t (t - 1)}) \\ \geq \frac{1}{2} (t + 1 + t - 1) \\ = t . \end{matrix}

We have hence proven the right-hand side (upper bound). Followed by the same induction process, the statement on the left-hand side (lower bound) is equivalent to showing

\frac{α_{t}}{\sqrt{t}} + (1 - α_{t}) \frac{1}{\sqrt{t - 1}} \geq \frac{1}{\sqrt{t}}

. This is trivial, since

α_{t} = \frac{1}{t} .

□

By the updated rule for Q in Definition 1, and the definition of the learning rate (7), we have

Q_{h}^{k} (x, a) = α_{t}^{0} \cdot H + \sum_{i = 1}^{t} α_{t}^{i} [r_{h} (x, a) + V_{h + 1}^{k_{i}} (x_{h + 1}^{k_{i}}) + b_{i}] .

(8)

4. Performance Measure Algorithms and Theorems

Next, we present the main algorithm and the main theorem.

Theorem 1.

For any

p \in (0, 1)

, if we choose bandit

b_{t} = c \sqrt{\frac{H^{3} ι}{t}}

where

ι : = log (\frac{K}{p})

and some absolute constant c (shown later in the analysis), then under certain conditions (shown later in this paper), the Q-learning with UCB exploration algorithm (in Algorithm 1) has the performance measure

Regret (K) = \sum_{k = 1}^{K} (V_{1}^{*} (x_{1}^{k}) - V_{1}^{π_{k}} (x_{1}^{k}))

bounded above by

\tilde{O} (K^{\frac{d + 1}{d + 2}})

with probability

1 - p

, where d is the covering dimension of the underlying

S \times A

.

We can propose the corresponding main algorithm in the following:

Algorithm 1 Q-learning with UCB bandit in metric space.

1:: initialize a covering $B$ of $S \times A$ with ball of $r = 1$ . Initialize $Q_{h} (x, a) \leftarrow H$ and $n_{h} (x, a) \leftarrow 0$ for all $(x, a)$ , and $V_{h} (x) \leftarrow H$ for all x. Choose $p \in (0, 1)$ .
2:: for $k = 1, 2, \dots K$ do
3:: receive $x_{1}$ from randomly sampling.
4:: for $h = 1, 2, \dots, H$ do
5:: if $r \leq \frac{c}{8 L} \sqrt{\frac{H^{3} ι}{1 + n_{h} (x_{h}, a_{h})}}$ is violated then
6:: for the violated ball $B (x_{h}, a_{h})$ , $r \leftarrow \frac{r}{2}$ , while keeping the same center.
7:: while there exists some region in $S \times A$ not covered by $B$ do
8:: $B \leftarrow B \cup B_{r} (x, a)$
9:: update $n_{h} (x_{h}, a_{h}) \leftarrow n_{h} (x_{h}, a_{h}) + 1$ .
10:: take action $a_{h} \leftarrow {arg max}_{a^{'} \in A} Q_{h} (x_{h}, a^{'})$ and observe $x_{h + 1}$ .
11:: iteratively update $Q_{h} (x, a)$ and $V_{h} (x)$ if $(x, a)$ in $B (x_{h}, a_{h})$ :

$Q_{h} (x_{h}, a_{h}) \leftarrow (1 - α_{t}) Q_{h} (x_{h}, a_{h}) + α_{t} [r_{h} (x_{h}, a_{h}) + V_{h + 1} (x_{h + 1}) + b_{t}] .$
12:: $V_{h} (x_{h}) \leftarrow min \{H, sup_{a^{'} \in A} Q_{h} (x_{h}, a^{'})\} .$

The main theorem is a bound for the cumulative regret in (4).

In the following sections, we prove some very useful techniques to bound the performance measure. We provide the novel and rigorous analysis that can be valuable in the analysis of the Q-learning problem in general.

5. Main Analysis

5.1. Concentration of Measure

First, we use the following lemmas to bound

(Q^{k} - Q^{*})

with high concentration. Recall that the chosen bandit

b_{t} = c \sqrt{\frac{H^{3} ι}{t}}

. Let

β_{t} : = \sum_{i = 1}^{t} α_{t}^{i} b_{i}

. The next-step empirical observation for

(x, a)

at step h in episode k is

{\hat{P}}_{h}^{k} V_{h} (x, a) : = V_{h + 1} (x_{h + 1}^{k})

.

Lemma 2.

For any

(x, a, h, k) \in S \times A \times [H] \times [K]

,

t = n_{h}^{k} (x, a)

, and suppose

B (x, a)

was selected previously at step h of episodes

{k_{i}}_{i = 1}^{t}

. We have

\begin{matrix} (Q_{h}^{k} - Q_{h}^{*}) (x, a) \\ = α_{t}^{0} (H - Q_{h}^{*} (x, a)) \\ + \sum_{i = 1}^{t} α_{t}^{i} [(V_{h + 1}^{k_{i}} - V_{h + 1}^{*}) \\ (x_{h + 1}^{k_{i}}) + ({\hat{P}}_{h}^{k_{i}} - P_{h}) \\ V_{h}^{*} (x, a) + ({\hat{P}}_{h}^{k_{i}} V_{h}^{*} (x_{h}^{k_{i}}, a_{h}^{k_{i}}) - {\hat{P}}_{h}^{k_{i}} V_{h}^{*} (x, a)) + b_{i}] . \end{matrix}

(9)

Proof.

From the Bellman optimality in Equation (2),

Q_{h}^{*} (x, a) = (r_{h} + P_{h} V_{h}^{*}) (x, a)

. Using the fact that

\sum_{i = 0}^{t} α_{t}^{i} = 1

from the definition of learning rate,

Q_{h}^{*} (x, a) = α_{t}^{0} Q_{h}^{*} (x, a) + \sum_{i = 1}^{t} α_{t}^{i} \cdot Q_{h}^{*} (x, a) = α_{t}^{0} Q_{h}^{*} (x, a) + \sum_{i = 1}^{t} α_{t}^{i} [r_{h} (x, a) + P_{h} V_{h}^{*} (x, a)]

. From (8),

(Q_{h}^{k} - Q_{h}^{*}) (x, a) = α_{t}^{0} (H - Q_{h}^{*} (x, a)) + \sum_{i = 1}^{t}

α_{t}^{i} [V_{h + 1}^{k_{i}} (x_{h + 1}^{k_{i}}) - P_{h} V_{h}^{*} (x, a) + b_{i}]

,

\begin{matrix} (Q_{h}^{k} - Q_{h}^{*}) (x, a) & = α_{t}^{0} (H - Q_{h}^{*} (x, a)) + \sum_{i = 1}^{t} α_{t}^{i} [(V_{h + 1}^{k_{i}} - V_{h + 1}^{*}) (x_{h + 1}^{k_{i}}) + ({\hat{P}}_{h}^{k_{i}} - P_{h}) V_{h}^{*} (x, a) + b_{i}] \\ + \sum_{i = 1}^{t} α_{t}^{i} [V_{h + 1}^{*} (x_{h + 1}^{k_{i}}) - {\hat{P}}_{h}^{k_{i}} V_{h}^{*} (x, a)] . \end{matrix}

(10)

Note that

x_{h + 1}^{k_{i}}

is the next step from

(x_{h}^{k_{i}}, a_{h}^{k_{i}})

where

(x_{h}^{k_{i}}, a_{h}^{k_{i}}) \in B (x, a)

by the Definition 1. Additionally,

V_{h + 1}^{*} (x_{h + 1}^{k_{i}}) - {\hat{P}}_{h}^{k_{i}} V_{h}^{*} (x, a)

satisfies

\begin{matrix} V_{h + 1}^{*} (x_{h + 1}^{k_{i}}) - {\hat{P}}_{h}^{k_{i}} V_{h}^{*} (x, a) \\ = (V_{h + 1}^{*} (x_{h + 1}^{k_{i}}) - {\hat{P}}_{h}^{k_{i}} V_{h}^{*} (x_{h}^{k_{i}}, a_{h}^{k_{i}})) + ({\hat{P}}_{h}^{k_{i}} V_{h}^{*} (x_{h}^{k_{i}}, a_{h}^{k_{i}}) - {\hat{P}}_{h}^{k_{i}} V_{h}^{*} (x, a)) \\ = {\hat{P}}_{h}^{k_{i}} V_{h}^{*} (x_{h}^{k_{i}}, a_{h}^{k_{i}}) - {\hat{P}}_{h}^{k_{i}} V_{h}^{*} (x, a), \end{matrix}

(11)

where

V_{h + 1}^{*} (x_{h + 1}^{k_{i}}) = {\hat{P}}_{h}^{k_{i}} V_{h}^{*} (x_{h}^{k_{i}}, a_{h}^{k_{i}})

by (1). Combining (10) and (11), we have the statement of Lemma 2.

□

To complete the concentration of

Q^{k} - Q^{*}

, we bound each term in (9).

Lemma 3.

|\sum_{i = 1}^{t} α_{t}^{i} ({\hat{P}}_{h}^{k_{i}} V_{h}^{*} (x, a) - {\hat{P}}_{h}^{k_{i}} V_{h}^{*} (x_{h}^{k_{i}}, a_{h}^{k_{i}}))| \leq \frac{β_{t}}{4}

.

Proof.

Note that

β_{t} = \sum_{i = 1}^{t} α_{t}^{i} b_{i} = \sum_{i = 1}^{t} α_{t}^{i} c \sqrt{\frac{H^{3} ι}{i}} = c \sqrt{H^{3} ι} \sum_{i = 1}^{t} \frac{α_{t}^{i}}{\sqrt{i}} \geq c \sqrt{\frac{H^{3} ι}{t}}

. Since

{\{(x_{h}^{k_{i}}, a_{h}^{k_{i}})\}}_{i = 1}^{t} \in B (x, a)

, the distance between

(x_{h}^{k_{i}}, a_{h}^{k_{i}})

and

(x, a)

is bounded by the radius of the ball. As Algorithm 1 goes on, it keeps refining balls, and hence, the bound will be smaller. By the Lipschitz assumption of

\hat{P} V^{*}

in (3),

‖ {\hat{P}}_{h}^{k_{i}} V_{h}^{*} (x, a) - {\hat{P}}_{h}^{k_{i}} V_{h}^{*} (x^{'}, a^{'}) ‖ \leq L \cdot d_{S \times A} ((x, a), (x^{'}, a^{'}))

.

Following from Algorithm 1 and Lemma 1,

\begin{matrix} |\sum_{i = 1}^{t} α_{t}^{i} ({\hat{P}}_{h}^{k_{i}} V_{h}^{*} (x, a) - {\hat{P}}_{h}^{k_{i}} V_{h}^{*} (x_{h}^{k_{i}}, a_{h}^{k_{i}}))| \leq \sum_{i = 1}^{t} α_{t}^{i} (\frac{c}{8} \sqrt{\frac{H^{3} ι}{1 + n_{h}^{k} (x, a)}}) \leq \frac{β_{t}}{4} . \end{matrix}

□

Lemma 4.

For any

p \in (0, 1)

, with a probability of at least

1 - p

, we have

0 \leq (Q_{h}^{k} - Q_{h}^{*}) (x, a) \leq α_{t}^{0} H + \sum_{i = 1}^{t} α_{t}^{i} (V_{h + 1}^{k_{i}} - V_{h + 1}^{*}) (x_{h + 1}^{k_{i}}) + \frac{3}{2} β_{t} .

The proof of Lemma 4 is similar to Lemma 4.3 in [12].

Proof.

Based on the definition in (7),

β_{t} = \sum_{i = 1}^{t} α_{t}^{i} b_{i} \leq \sum_{i = 1}^{t} α_{i} b_{i} = \sum_{i = 1}^{t} \frac{1}{i} c \sqrt{\frac{H^{3} ι}{i}} < 4 c \sqrt{\frac{H^{3} ι}{t}} .

Recall

t = n_{h}^{k} (x, a)

. Let

k_{i} = min ({k \in [K] | k > k_{i - 1} \land (x_{h}^{k}, a_{h}^{k}) \in B (x, a)} \cup {K + 1})

. That is,

k_{i}

is the episode in which

B (x, a)

was selected at step h for the

i

th time (or

k_{i} = K + 1

if it is selected for fewer than i times). The random variable

k_{i}

is a stopping time in this case. Let

F_{i}

be the

σ

-field generated by all the random variables until episode

k_{i}

at step h. Then

{(𝟙_{k_{i} \leq K} [({\hat{P}}_{h}^{k_{i}} - P_{h}) V_{h}^{*}] (x, a))}_{i = 1}^{t}

is a martingale difference sequence with filtration

{F_{i}}_{i \geq 0}

. By the sequence bounded difference, for any i,

α_{t}^{i} [({\hat{P}}_{h}^{k_{i}} - P_{h}) V_{h}^{*}] (x, a) \leq α_{t}^{i} \cdot H

. Applying the Azuma–Hoeffding inequality, for any

s > 0

, we have

P (|\sum_{i = 1}^{t} α_{t}^{i} [({\hat{P}}_{h}^{k_{i}} - P_{h}) V_{h}^{*}] (x, a)| \leq s) \geq 1 - 2 exp \{\frac{- s^{2}}{2 H^{2} \sum_{i = 1}^{t} {(α_{t}^{i})}^{2}}\} .

(12)

There exists some absolute constant c, such that

\sqrt{2} \cdot H \sqrt{\sum_{i = 1}^{t} {(α_{t}^{i})}^{2}} \sqrt{ln (\frac{2 H}{p})} \leq s \leq \frac{c}{4} \sqrt{\frac{H^{3} ι}{t}}

. This is true since (1)

β_{t} = \sum_{i = 1}^{t} α_{t}^{i} b_{i} = \sum_{i = 1}^{t} α_{t}^{i} c \sqrt{\frac{H^{3} ι}{i}} = c \sqrt{H^{3} ι} \sum_{i = 1}^{t} \frac{α_{t}^{i}}{\sqrt{i}} \geq c \sqrt{\frac{H^{3} ι}{t}} \geq 4 s,

(13)

and (2). since

s \geq \sqrt{2} \cdot H \sqrt{\sum_{i = 1}^{t} {(α_{t}^{i})}^{2}} \sqrt{ln (\frac{2 H}{p})}

. Then

\begin{matrix} 1 - 2 exp \{\frac{- s^{2}}{2 H^{2} \sum_{i = 1}^{t} {(α_{t}^{i})}^{2}}\} \geq 1 - \frac{p}{H} > 1 - p . \end{matrix}

(14)

From (12) and (13), and for any fixed step h, we have

P (|\sum_{i = 1}^{t} α_{t}^{i} [({\hat{P}}_{h}^{k_{i}} - P_{h}) V_{h}^{*}] (x, a)| \leq \frac{β_{t}}{4}) \geq 1 - p .

(15)

From (15), Lemmas 2 and 3, we have the right-hand side of Lemma 4.

Now for the left-hand side, we use the backward induction on

h = H, H - 1, \dots, 1

. Observe the initial case

(Q_{H}^{k} - Q_{H}^{*}) (x, a) \geq 0

. Assume

(Q_{h}^{k} - Q_{h}^{*}) (x, a) \geq 0

. Since

V_{h}^{k} (x) = {sup}_{a^{'} \in A} Q_{h}^{k} (x, a^{'}) \geq {sup}_{a^{'} \in A} Q_{h}^{*} (x, a^{'}) = V_{h}^{*} (x)

. From (9) and (14), this implies that

(Q_{h - 1}^{k} - Q_{h - 1}^{*}) (x, a) \geq 0

with a probability of at least

1 - \frac{p}{H}

for one step backward. Now using the union bound on

h = H, H - 1, \dots, 1

, we have the left-hand side. We hence have for all h with a probability of at least

1 - p

,

0 \leq (Q_{h}^{k} - Q_{h}^{*}) (x, a) \leq α_{t}^{0} H + \sum_{i = 1}^{t} α_{t}^{i} (V_{h + 1}^{k_{i}} - V_{h + 1}^{*}) (x_{h + 1}^{k_{i}}) + \frac{3}{2} β_{t}

.

□

5.2. Assumptions for Concentration

We need two mild and reasonable assumptions shown below. From Lemma 4,

\begin{matrix} \sum_{k = 1}^{K} (Q_{h}^{k} - Q_{h}^{*}) (x, a) \leq \sum_{k = 1}^{K} α_{t}^{0} H + \sum_{k = 1}^{K} \sum_{i = 1}^{t} α_{t}^{i} (V_{h + 1}^{k_{i}} - V_{h + 1}^{*}) (x_{h + 1}^{k_{i}}) + \frac{3}{2} \sum_{k = 1}^{K} β_{t}, \end{matrix}

(16)

for any

i = 1, \dots, n_{h}^{k}

where

n_{h}^{k} = t

, each

k_{i}

is the actual selection. Here we make the first mild assumption:

V_{h + 1}^{k_{i}} - V_{h + 1}^{*} \leq R \cdot \sqrt{\frac{H^{3} ι}{i}}

for some large R. This assumption is very mild due to the iterative convergence. Then,

V_{h + 1}^{k_{i}} - V_{h + 1}^{*} \leq \frac{R}{c} \cdot b_{i}

. We hence have

\sum_{i = 1}^{t} α_{t}^{i} (V_{h + 1}^{k_{i}} - V_{h + 1}^{*}) (x_{h + 1}^{k_{i}}) \leq \frac{R}{c} \sum_{i = 1}^{t} α_{t}^{i} b_{i} = \frac{R}{c} \cdot β_{t} .

(17)

In (16), we have a term

\sum_{k = 1}^{K} α_{t}^{0} \cdot H

. Recall from (7),

α_{j} = \frac{1}{j}

. Since

α_{t}^{0} = \prod_{j = 1}^{n_{k}} (1 - α_{j}) = \{\begin{matrix} 1 & if n_{k} = 0; \\ 0 & if n_{k} \geq 0, \end{matrix}

(18)

then

\sum_{k = 1}^{K} α_{t}^{0} \cdot H = H \sum_{k = 1}^{K} \prod_{j = 1}^{t} (1 - α_{j}) = H \sum_{k = 1}^{K} 𝟙_{{n_{k} = 0}}

. Intuitively,

𝟙_{{n_{k} = 0}}

stands for the partitioned balls from the algorithm with no active state-action pair contained. Here, we make the second mild assumption:

\sum_{k = 1}^{K} 𝟙_{{n_{k} = 0}} \leq Θ (K^{\frac{d + 1}{d + 2}})

. This assumption is again mild, since we will soon show the cumulative number of partitions

P_{K}

up to K is bounded above by

Θ (K^{\frac{d}{d + 2}})

in Lemma 7. Note that here, d is the covering dimension in

S \times A

that will be introduced in the next section.

From the second mild assumption, (16), (17) and Lemma 4, for any

p \in (0, 1)

, with a probability of at least

1 - p

, we have

\begin{matrix} \sum_{k = 1}^{K} (Q_{h}^{k} - Q_{h}^{*}) (x, a) & \leq \sum_{k = 1}^{K} (α_{t}^{0} \cdot H + \sum_{i = 1}^{t} α_{t}^{i} (V_{h + 1}^{k_{i}} - V_{h + 1}^{*}) (x_{h + 1}^{k_{i}}) + \frac{3}{2} β_{t}) \\ \leq Θ (K^{\frac{d + 1}{d + 2}}) + μ \sum_{k = 1}^{K} β_{t}, \end{matrix}

(19)

where

μ = \frac{R}{c} + \frac{3}{2}

. To bound

(Q_{h}^{k} - Q_{h}^{*})

, the remaining is to bound

β_{t}

.

Note that we also provide some additional approaches that can be valuable in bounding Q-value function in the Appendix A. These approaches can be well applicable given some special conditions satisfied.

5.3. Upper Bound of Cumulative Regret

Now we return to our main discussion about the

Regret (K)

. The following lemma shows the relationship between

Regret (K)

and

(Q_{h}^{k} - Q_{h}^{*})

. Note that the following approach provides a very general trajectory analysis technique that can be easily applicable to reinforcement learning problems with the continuous state and action space framework.

Lemma 5.

For any

p \in (0, 1)

, with a probability of at least

1 - p

, the cumulative regret up to K has the following upper bound:

\begin{matrix} \sum_{k = 1}^{K} V_{1}^{k} (x_{1}^{k}) - V_{1}^{π_{k}} (x_{1}^{k}) \\ \leq Θ (K^{\frac{d + 1}{d + 2}}) \\ + \sum_{j = 0}^{H - 1} \int_{traj (x_{h}^{k}, a_{h}^{k}) \in TRAJ (h \to h + j)} P (traj (x_{h}^{k}, a_{h}^{k}) \\ \cdot (\sum_{k = 1}^{K} (Q_{h + j}^{k} - Q_{h + j}^{*}) (s, a^{s})) d P . \end{matrix}

Here, the sequence of actions

{a^{s}}

is corresponding to the sequence of states

{s}

, which is the same as maximum selection in (21) and

traj (x_{h}^{k}, a_{h}^{k})

is the trajectory starting from

(x_{h}^{k}, a_{h}^{k})

.

TRAJ (h \to h + j)

is a trajectory space that contains all the possible trajectories from step h up to step

h + j

, where d is the covering dimension of

S \times A

.

Proof.

The proof follows the idea from the Ref. [24]. Note that for some action

a_{h}^{k}

,

V_{h}^{k} (x_{h}^{k}) - V_{h}^{π_{k}} (x_{h}^{k}) \leq (Q_{h}^{k} (x_{h}^{k}, a_{h}^{k}) - Q_{h}^{*} (x_{h}^{k}, a_{h}^{k})) + (Q_{h}^{*} (x_{h}^{k}, a_{h}^{k}) - Q_{h}^{π_{k}} (x_{h}^{k}, a_{h}^{k})) + \frac{k^{\frac{- 1}{d + 2}}}{H} .

(20)

The inequality holds because

V_{h}^{k} (x_{h}^{k}) \leq sup_{a^{'} \in A} Q_{h}^{k} (x_{h}^{k}, a^{'}) \leq Q_{h}^{k} (x_{h}^{k}, a_{h}^{k}) + \frac{k^{\frac{- 1}{d + 2}}}{H}

(21)

for some

a_{h}^{k}

from (6). We name it maximum action selection. Additionally,

\begin{matrix} Q_{h}^{*} (x_{h}^{k}, a_{h}^{k}) - Q_{h}^{π_{k}} (x_{h}^{k}, a_{h}^{k}) & = (P_{h} V_{h}^{*} (x_{h}^{k}, a_{h}^{k}) - P_{h} V_{h}^{π_{k}} (x_{h}^{k}, a_{h}^{k})) \\ = \int_{s \in S} P (s | x_{h}^{k}, a_{h}^{k}) (V_{h + 1}^{*} - V_{h + 1}^{π_{k}}) (s) d P (s), \end{matrix}

(22)

where we take all the states at step

h + 1

into account,

s \sim P (• | x_{h}^{k}, a_{h}^{k})

. Note that from Lemma 4, for any p, we have

Q_{h}^{k} \geq Q_{h}^{*}

with a probability of at least 1 − p for all steps h. Then with a probability of at least 1 − p,

V_{h}^{k} = \sup_{a \in A} Q_{h}^{k} \geq \sup_{a \in A} Q_{h}^{*} = V_{h}^{*}

. Combining this inequality, (20) and (22), we have

V_{h}^{k} (x_{h}^{k}) - V_{h}^{π_{k}} (x_{h}^{k}) \leq \frac{k^{\frac{- 1}{d + 2}}}{H} + (Q_{h}^{k} (x_{h}^{k}, a_{h}^{k}) - Q_{h}^{*} (x_{h}^{k}, a_{h}^{k})) + \int_{s \in S} P (s | x_{h}^{k}, a_{h}^{k}) (V_{h + 1}^{k} - V_{h + 1}^{π_{k}}) (s) d P (s)

(23)

We observe that (23),

(V_{h}^{k} - V_{h}^{π_{k}}) (x_{h}^{k})

is on the left-hand side and

(V_{h + 1}^{k} - V_{h + 1}^{π_{k}}) (s)

is in the integral on the right-hand side. Recall that the space that contains all the possible trajectories from step h up to step

h + j

is

TRAJ (h \to h + j)

. We have the following recursive relationship:

\begin{matrix} V_{h}^{k} (x_{h}^{k}) - V_{h}^{π_{k}} (x_{h}^{k}) \\ \leq \frac{2 k^{\frac{- 1}{d + 2}}}{H} + (Q_{h}^{k} (x_{h}^{k}, a_{h}^{k}) - Q_{h}^{*} (x_{h}^{k}, a_{h}^{k})) \\ + \int_{s \in S} P (s | x_{h}^{k}, a_{h}^{k}) \\ ((Q_{h + 1}^{k} (s, a^{s}) - Q_{h + 1}^{*} (s, a^{s})) + \int_{s^{'} \in S} P (s^{'} | s, a) \\ (V_{h + 2}^{k} - V_{h + 2}^{π_{k}}) (s^{'}) d P (s^{'})) \\ d P (s) \dots \\ \leq k^{\frac{- 1}{d + 2}} + \sum_{j = 0}^{H - h} \int_{traj (x_{h}^{k}, a_{h}^{k}) \in TRAJ (h \to h + j)} (P (traj (x_{h}^{k}, a_{h}^{k})) \\ \cdot (Q_{h + j}^{k} - Q_{h + j}^{*}) (s, a^{s})) d P, \end{matrix}

(24)

where sequence

{a^{s}}

corresponding to each

{s}

is the same as our selection in (21), and

traj (x_{h}^{k}, a_{h}^{k})

is the trajectory starting from

(x_{h}^{k}, a_{h}^{k})

.

For simplicity, we use

(Q_{h + j}^{k} - Q_{h + j}^{*})

to stand for

(Q_{h + j}^{k} - Q_{h + j}^{*}) (s, a^{s})

and

traj

to stand for

traj (x_{h}^{k}, a_{h}^{k})

. Then,

\begin{matrix} \sum_{k = 1}^{K} V_{1}^{k} (x_{1}^{k}) - V_{1}^{π_{k}} (x_{1}^{k}) \leq \sum_{k = 1}^{K} k^{\frac{- 1}{d + 2}} + \sum_{k = 1}^{K} \sum_{j = 0}^{H - 1} \int_{traj \in TRAJ (h \to h + j)} P (traj) \cdot (Q_{1 + j}^{k} - Q_{1 + j}^{*}) d P . \end{matrix}

(25)

If we apply Fubini’s theorem [25], then (25) is equivalent to

\sum_{k = 1}^{K} V_{1}^{k} (x_{1}^{k}) - V_{1}^{π_{k}} (x_{1}^{k}) \leq Θ (K^{\frac{d + 1}{d + 2}}) + \sum_{j = 0}^{H - 1} \int_{traj \in TRAJ (h \to h + j)} P (traj) \cdot (\sum_{k = 1}^{K} (Q_{1 + j}^{k} - Q_{1 + j}^{*})) d P .

□

6. Regret Bound in Covering Dimension

We have mentioned the covering dimension in the earlier section. For bounding the cumulative regret (performance measure), a covering dimension on the regret bound has been suggested [15,16,26]. The covering dimension is a useful tool when we need to reduce the upper bound of the regret in multi-armed bandit problems. In this section, we will present the analysis of a tight upper bound in the covering dimension. Note that this proof technique is very general and can be easily extended to works where a covering dimension is applicable. We will follow the same definition as in the Refs. [15,16,26].

Definition 2.

Consider a metric space

(X, l)

. Denote by

N (r)

the minimum number of balls with diameter r to cover

X

. The covering dimension of

X

is defined as

C O V (X) = inf {d : \exists c \forall r > 0 N (r) \leq c \cdot r^{- d}}

.

Lemma 6.

Let

E

be a bounded space that can be embedded into an n-dimensional Euclidean space. Then,

COV (E) \leq n

.

Proof.

W.L.O.G. we assume

E \subseteq {[0, 1]}^{n}

. Now, let

r > 0

be any real number, then we can find

{(\frac{\sqrt{n}}{r})}^{n}

many n-dimensional hypercubes with side-length

\frac{r}{\sqrt{n}}

covering

{[0, 1]}^{n}

. Now for every such hypercube, there exists a ball with diameter r such that the hypercube is contained inside the ball. Hence, we can have

c \cdot r^{- n}

many balls with diameter r cover

E

, where the constant

c = {(\sqrt{n})}^{n}

in this case. Since

COV (E)

takes the infimum, then

COV (E) \leq n

. □

Because of Lemma 6, to obtain a tighter upper bound, we use the covering dimension of

S \times A

instead of the regular Euclidean dimension. In this section, we use the covering dimension to bound

\sum_{k = 1}^{K} β_{t}

. Assume

P_{K}

stands for the number of balls (partitions) created up to episode K in the Algorithm 1.

Lemma 7.

We have the tight bound

P_{K} \leq Θ (K^{\frac{d}{d + 2}})

.

Proof.

Assume d is the covering dimension of

S \times A

. Then, the number of balls with diameter m to cover

S \times A

will be bounded by

c \cdot m^{- d}

for some c. Note that Algorithm 1 produces many balls during adaptive partition. The center of the balls (past active points) of radius

\frac{m}{2}

will always be in different balls from the covering of

S \times A

because each pair of centers has a distance of at least m between them. For example, say

(s_{i}, a_{i})

and

(s_{j}, a_{j})

are two past active centers. Consider the covering of

S \times A

. Then,

(s_{i}, a_{i}) \in B_{i}

and

(s_{j}, a_{j}) \in B_{j}

for

B_{i}, B_{j}

in the covering, and they are disjoint. Thus, the number of the balls with diameter m from Algorithm 1 is bounded by

c \cdot m^{- d}

, the covering number.

Note that for each ball B of radius r with the center inside

S \times A

, we can use at most

c^{'} \cdot 2^{n}

balls of radius

\frac{r}{2}

to cover

B \cap (S \times A)

, where n is the Euclidean dimension of

S \times A

and

c^{'}

is some constant. From the algorithm, each ball will split only if

r \leq c \sqrt{\frac{H^{3} ι}{1 + n_{k}}}

is violated. That is, for any ball of radius

\frac{r}{2}

split from the original ball of radius r (condition violated), it must be filled by

c^{2} \cdot \frac{H^{3} ι}{r^{2}}

past active points. Since there are K points in total, then the number of balls of radius r violated is at most

\frac{K}{c^{2} \cdot \frac{H^{3} ι}{r^{2}}} = \frac{1}{c^{2} H^{3} ι} \cdot r^{2} K

. Thus, after splitting (violation), the number of balls of radius

\frac{r}{2}

is at most

(c^{'} \cdot 2^{n}) \cdot (\frac{1}{c^{2} H^{3} ι} \cdot r^{2} K) = c^{″} \cdot r^{2} K

, where

c^{″} = c^{'} \cdot 2^{n} \cdot \frac{1}{c^{2} H^{3} ι}

.

Along the algorithm, as radius r becomes

\frac{r}{2}

, the number of partitioned balls is bounded by the covering number, which is

c \cdot r^{- d}

, and also bounded by

c^{″} \cdot r^{2} K

for any

r > 0

from the above analysis. Consider

r = 1, \frac{1}{2}, \frac{1}{4}, \dots, 2^{- i} \dots

, where we have

P_{K} = \sum_{i = 1}^{\infty} min \{c {(2^{- i})}^{- d}, c^{″} {(2^{- i})}^{2} K\}

.

We split this term into two cases. (1). For a large radius, use

c \cdot r^{- d}

and (2). For a very small radius, use the bound

c^{″} \cdot r^{2} K

. To maximize the upper bound, we find some optimal J where

c {(2^{- J})}^{- d} = c \cdot 2^{J \cdot d} = c^{″} \cdot 2^{- 2 J} \cdot K

. Thus,

2^{J \cdot d} = Θ (K^{\frac{d}{d + 2}})

, where

D = \frac{c^{″}}{c}

. This implies that

P_{K} \leq \sum_{i = 1}^{J} c {(2^{- i})}^{- d} + \sum_{i = J + 1}^{\infty} c^{″} {(2^{- i})}^{2} K \leq 2 c \cdot 2^{J \cdot d} + 8 c^{″} \cdot 2^{- 2 (J + 1)} K = Θ (K^{\frac{d}{d + 2}})

.

□

Lemma 8.

We have the following tight upper bound

\sum_{k = 1}^{K} β_{t} \leq \tilde{O} (K^{\frac{d + 1}{d + 2}})

, where d is the covering dimension of

S \times A

.

Proof.

Recall that

β_{t} = \sum_{i = 1}^{t} α_{t}^{i} b_{i} \leq \sum_{i = 1}^{t} α_{i} b_{i} = \sum_{i = 1}^{t} \frac{1}{i} c \sqrt{\frac{H^{3} ι}{i}} < 4 c \sqrt{\frac{H^{3} ι}{t}}

. Up to episode K, we create

P_{K}

many balls in total. Assume there are

b_{i}

past active state-action pairs in the

i^{t h}

ball. Clearly,

\sum_{i = 1}^{P_{K}} b_{i} = K

. Thus,

\begin{matrix} \sum_{k = 1}^{K} β_{t} & \leq \sum_{k = 1}^{K} 4 c \sqrt{\frac{H^{3} ι}{t}} \leq \sum_{i = 1}^{P_{K}} \sum_{j = 1}^{b_{i}} 4 c \sqrt{\frac{H^{3} ι}{j}} \\ \leq \sum_{i = 1}^{P_{K}} 4 c \sqrt{H^{3} ι} \cdot 2 \sqrt{b_{i}} = \sum_{i = 1}^{P_{K}} 8 c \sqrt{H^{3} ι} \cdot \sqrt{b_{i}} \end{matrix}

(26)

\begin{matrix} \leq 8 c \sqrt{H^{3} ι} \cdot P_{K} \sqrt{\sum_{i = 1}^{P_{K}} \frac{1}{P_{K}} \cdot b_{i}} \end{matrix}

(27)

\begin{matrix} = 8 c \sqrt{H^{3} ι} \cdot P_{K} \sqrt{\frac{K}{P_{K}}} \end{matrix}

(28)

\begin{matrix} = 8 c \sqrt{H^{3} ι} \cdot \sqrt{K} \sqrt{P_{K}}, \end{matrix}

(29)

where (26) is from Jensen’s inequality on the concave function

\sqrt{x}

, and (27) is from

\sum_{i = 1}^{P_{K}} b_{i} = K

. With (28), we have

\sum_{k = 1}^{K} β_{t} \leq \tilde{O} (K^{\frac{d + 1}{d + 2}})

.

□

7. Conclusions of Main Theorem

7.1. Proof of Main Theorem

We can combine the above results to complete the proof of the main theorem of this paper, Theorem 1. Recall that from Lemma 5, the cumulative regret up to episode K is given by

\sum_{k = 1}^{K} V_{1}^{k} (x_{1}^{k}) - V_{1}^{π_{k}} (x_{1}^{k}) \leq Θ (K^{\frac{d + 1}{d + 2}}) + \sum_{j = 0}^{H - 1} \int_{traj \in TRAJ (h \to h + j)} P (traj) \cdot \sum_{k = 1}^{K} (Q_{1 + j}^{k} - Q_{1 + j}^{*}) d P .

(30)

In addition, from (19),

\sum_{k = 1}^{K} (Q_{h}^{k} - Q_{h}^{*}) (x, a) \leq Θ (K^{\frac{d + 1}{d + 2}}) + μ \sum_{k = 1}^{K} β_{t}

. Combined with Lemma 8, we have

\sum_{k = 1}^{K} (Q_{h}^{k} - Q_{h}^{*}) \leq \tilde{O} (K^{\frac{d + 1}{d + 2}})

, where d is the covering dimension of

S \times A

. Thus,

{max}_{h \leq H} \sum_{k = 1}^{K} (Q_{h}^{k} - Q_{h}^{*}) \leq \tilde{O} (K^{\frac{d + 1}{d + 2}})

. Inequality (30) becomes

\begin{matrix} \sum_{k = 1}^{K} V_{1}^{k} (x_{1}) - V_{1}^{π_{k}} (x_{1}) \\ \leq Θ (K^{\frac{d + 1}{d + 2}}) + \sum_{j = 0}^{H - 1} \int_{traj \in TRAJ (h \to h + j)} P (traj) \\ \cdot (max_{j} \sum_{k = 1}^{K} (Q_{1 + j}^{k} - Q_{1 + j}^{*})) d P \\ \leq Θ (K^{\frac{d + 1}{d + 2}}) + \sum_{j = 0}^{H - 1} \int_{traj \in TRAJ (h \to h + j)} P (traj) \\ \cdot \tilde{O} (K^{\frac{d + 1}{d + 2}}) d P \\ = \tilde{O} (K^{\frac{d + 1}{d + 2}}) . \end{matrix}

This completes Theorem 1, which is the main scope of this paper.

7.2. Discussion and Conclusions

Some applications in discrete space include board games like poker, bridge, and chess. For example, DeepMind

^{’}

s powerful computer program AlphaGo Zero has its self-learning algorithm based on a discrete space set up in Go [27]. AlphaGo demonstrated super-human capabilities in navigating through an exhaustive discrete action search space and achieved a breakthrough of AI in challenging gaming. On the other hand, most robotics and automated vehicle applications of reinforcement learning require continuous state spaces which utilize continuous variables, such as position, velocity, torque, and so forth. For example, motion estimation and depth estimation in the perception tasks of autonomous driving are in the continuous environment [28]. If reinforcement learning algorithms with a continuous state and action space environment can be efficiently implemented, then these applications, such as robotics and automated vehicles, can potentially be improved.

In this paper, to accommodate the needs of the continuous environment, we proposed a state-of-the-art Q-learning algorithm with an UCB exploration policy in the compact state-action metric space. Our algorithm adaptively partitioned the state-action space and flexibly selected the future actions to maximize Q-values with bandit exploration bonuses. By some mild assumptions for concentration, we found a tight upper bound with respect to covering dimension for the performance measure in a general state-action metric space. Our work contained rigorous techniques, including novel combinatorial and probabilistic analysis. These approaches are not limited to the analysis of performance measures in Q-learning, but are also valuable to general reinforcement learning analysis. This work suggests that the bandit strategy is an effective way to increase data efficiency for a general model-free reinforcement learning problem. Our analysis is easily applicable to reinforcement learning problems with the continuous state and action space framework. For future steps, we will apply our algorithms to various practices, including robotics and automated vehicles.

Author Contributions

Conceptualization, D.C.; Formal analysis, W.Y.; methodology, W.Y.; writing—original draft, W.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank Debsoumya Chakraborti, Shlomo Ta’asan and Tianyu Wang for their insight and comments.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Here we provide some approaches that can be valuable in bounding Q-value function.

Appendix A.1. Sublinearity

Lemma A1

(Sublinearity). If

Q_{h}^{k}

converges to

Q_{h}^{*}

in k and fixed h, then

\sum_{k = 1}^{K} | Q_{h}^{k} - Q_{h}^{*} | < O (K)

.

Proof.

We prove by contradiction. Assume

a_{i}

converges to a. Then there exists

m > 0

such that we have the contradiction:

\sum_{i = 1}^{K} | a - a_{i} | \geq m K

for any large K. This is equivalent to

\sum_{i = 1}^{K} | a - a_{i} | - m K = \sum_{i = 1}^{K} (| a - a_{i} | - m) \geq 0

Since

a_{i}

converges to a, there exists

N_{1}

such that for any

i > N_{1}

,

| a - a_{i} | < \frac{m}{2}

, then

(| a - a_{i} | - m) < - \frac{m}{2}

. Hence for

N_{2}

large,

|\sum_{i = N_{1} + 1}^{N_{2}} (| a - a_{i} | - m)| > |\sum_{i = 1}^{N_{1}} (| a - a_{i} | - m)|

and

\sum_{i = N_{1} + 1}^{N_{2}} (| a - a_{i} | - m) < 0

.

This implies that

\sum_{i = 1}^{N_{2}} (| a - a_{i} | - m) < 0

. This contradicts our assumption and completes the proof of contradiction. For

Q_{h}^{k}

converges to

Q_{h}^{*}

in k, we have

\sum_{k = 1}^{K} | Q_{h}^{k} - Q_{h}^{*} | < O (K)

□

From Lemma 4 and the above Lemma A1, if one can have

Q_{h}^{k}

converges to

Q_{h}^{*}

in k, then one can show sublinear performance measure.

Appendix A.2. Differential Equation Approach

Here we use Grönwall’s inequality. First we recall Grönwall’s inequality. Given I is the interval of real line in a format of

[a + \infty)

. If we have

u^{'} (t) \leq β (t) u (t)

then for all t, u is bounded by the corresponding differential equation

v^{'} (t) = β (t) v (t)

,

u (t) \leq u (a) exp (\int_{1}^{t} β (s) d s)

Lemma A2

(Differential Equation Approach). For some τ where

0 < τ < 1

, if

\frac{d}{d k} θ_{k} \leq (τ - 1) \cdot \frac{1}{k} θ_{k}

(A1)

Then

\sum_{k = 1}^{K} | Q^{k} - Q^{*} | \leq O (K^{τ})

.

Assume

θ_{k} = Q_{h}^{k} - Q_{h}^{*}

, which stands for the difference between Q functions. By using Grönwall’s inequality and the analysis of differential equations, we have

Proof.

One potential approach is to use Grönwall’s inequality [29]. Notice that

e^{(\int_{1}^{k} ((τ - 1) \cdot \frac{1}{s}) d s)} = k^{(τ - 1)}

. By Gronwall’s inequality, for constant

C = θ_{k}

, if

\frac{d}{d k} θ_{k} \leq (τ - 1) \cdot \frac{1}{k} θ_{k}

(A2)

Then

θ_{k} \leq C \cdot k^{τ - 1}

(A3)

Thus,

\sum_{k = 1}^{K} | Q^{k} - Q^{*} | = \sum_{k = 1}^{K} θ_{k} \leq O (K^{τ})

. □

From Lemma 4 and the above Lemma A2, if the difference

θ_{k}

satisfies the given differential equation, then one can also show sublinear performance measure with specific convergence rate.

The above approaches are helpful to bound the performance measure in Q-learning.

References

Sutton, R.; Barto, A. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
Asadi, K.; Misra, D.; Litterman, M. Lipschitz continuity in model-based Reinforcement Learning. In Proceedings of the International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 264–273. [Google Scholar]
Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep Reinforcement Learning: A brief survey. IEEE Signal Process. Mag. 2017, 34, 26–38. [Google Scholar] [CrossRef] [Green Version]
Schaal, S. Learning from demonstration. In Advances in Neural Information Processing System; Mozer, M.C., Jordan, M., Petsche, T., Eds.; MIT Press: Cambridge, MA, USA, 1997; Volume 9. [Google Scholar]
Schneider, J.G. Exploiting model uncertainty estimates for safe dynamic control learning. In Advances in Neural Information Processing System; Mozer, M.C., Jordan, M., Petsche, T., Eds.; MIT Press: Cambridge, MA, USA, 1997; Volume 9. [Google Scholar]
Deisenroth, M.P.; Rasmussen, C.E. PILCO: A model-based and data-efficient approach to policy search. In Proceedings of the International Conference on Machine Learning, Bellevue, WA, USA, 28 June–2 July 2011. [Google Scholar]
Watkins, C.J. Learning from Delayed Rewards. Ph.D. Thesis, King’s College, University of Cambridge, Cambridge, UK, 1989. [Google Scholar]
Watkins, C.J.; Dayan, P. Technical Note: Q-Learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with deep Reinforcement Learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
Mnih, V.; Badia, A.P.; Mehdi, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep Reinforcement Learning. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1928–1937. [Google Scholar]
Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust region policy optimization. In Proceedings of the International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015; pp. 1889–1897. [Google Scholar]
Jin, C.; Zhu, A.Z.; Yan, X.; Bubeck, S.; Jordan, M. Is Q-learning provably efficient? In Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montreal, QC, Canada, 3–8 December 2018. [Google Scholar]
Agrawal, S.; Jia, R. Optimistic posterior sampling for Reinforcement Learning: Worst-case regret bounds. In Proceedings of the 31st Conference on Neural Information Processing Systems (NeurIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Azar, M.G.; Osband, I.; Munos, R. Minimax regret bounds for Reinforcement Learning. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017. [Google Scholar]
Kleinberg, R.; Slivkins, A.; Upfal, E. Multi-armed bandits in metric spaces. In Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing, Victoria, BC, Canada, 17–20 May 2008; pp. 681–690. [Google Scholar]
Bubeck, S.; Munos, R.; Stoltz, G.; Szepesvári, C. X-armed bandits. J. Mach. Learn. Res. 2011, 12, 1655–1695. [Google Scholar]
Auer, P.; Ortner, R. UCB Revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Period. Math. Hung. 2010, 61, 55–65. [Google Scholar] [CrossRef] [Green Version]
Wang, T.; Ye, W.; Geng, D.; Rudin, C. Towards practical Lipschitz bandits. In Proceedings of the 2020 ACM-IMS on Foundations of Data Science Conference, San Francisco, CA, USA, 19–20 October 2020; pp. 129–138. [Google Scholar]
Strehl, A.L.; Li, L.; Wiewiora, E.; Langford, J.; Litterman, M.L. PAC model-free reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML), Pittsburgh, PA, USA, 25–29 June 2006; pp. 881–888. [Google Scholar]
Szepesvari, C.; Smart, W.D. Interpolation-based Q-learning. In Proceedings of the International Conference on Machine Learning (ICML), Banff, AB, Canada, 4–8 July 2004. [Google Scholar]
Tsitsiklis, J.N.; van Roy, B. Feature-based methods for large scale dynamic programming. Mach. Learn. 1996, 22, 59–94. [Google Scholar] [CrossRef] [Green Version]
Gaskett, C.; Wettergreen, D.; Zelinsky, A. Q-Learning in continuous state and action spaces. In Lecture Notes in Computer Science, Proceedings of the Advanced Topics in Artificial Intelligence, Sydney, Australia, 6–10 December 1999; Springer: Berlin/Heidelberg, Germany, 1999. [Google Scholar]
Irpan, A. Q-Learning in Continuous State Action Spaces; Technical report; University of California: Berkeley, CA, USA, 2015. [Google Scholar]
Dong, K.; Wang, Y.; Chen, X.; Wang, L. Q-learning with UCB exploration is sample efficient for infinite-horizon MDP. arXiv 2019, arXiv:1901.09311v1. [Google Scholar]
Folland, G.B. Real Analysis: Modern Techniques and Their Applications; Wiley: Hoboken, NJ, USA, 1999; Volume 40, pp. 67–68. [Google Scholar]
Slivkins, A. Contextual bandits with similarity information. J. Mach. Learn. Res. 2014, 15, 2533–2568. [Google Scholar]
Liu, Y.C.; Tsuruoka, Y. Asymmetric Move Selection Strategies in Monte-Carlo Tree Search: Minimizing the Simple Regret at Max Nodes. arXiv 2016, arXiv:1605.02321. [Google Scholar]
Kiran, B.R.; Sobh, I.; Talpaert, V.; Mannion, P.; Al Sallab, A.A.; Yogamani, S.; Pérez, P. Deep reinforcement learning for autonomous driving: A survey. IEEE Trans. Intell. Transp. Syst. 2021. early access. [Google Scholar] [CrossRef]
Grönwall, T.H. Note on the derivatives with respect to a parameter of the solutions of a system of differential equations. Ann. Math. 1919, 20, 292–296. [Google Scholar] [CrossRef]

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ye, W.; Chen, D. Analysis of Performance Measure in Q Learning with UCB Exploration. Mathematics 2022, 10, 575. https://doi.org/10.3390/math10040575

AMA Style

Ye W, Chen D. Analysis of Performance Measure in Q Learning with UCB Exploration. Mathematics. 2022; 10(4):575. https://doi.org/10.3390/math10040575

Chicago/Turabian Style

Ye, Weicheng, and Dangxing Chen. 2022. "Analysis of Performance Measure in Q Learning with UCB Exploration" Mathematics 10, no. 4: 575. https://doi.org/10.3390/math10040575

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Analysis of Performance Measure in Q Learning with UCB Exploration

Abstract

1. Background

1.1. Markov Decision Process

1.2. Model-Based and Model-Free Reinforcement Learning

2. Introduction

3. Notation and Settings

3.1. Q-Learning

3.2. Lipschitz Assumption

3.3. Performance Measure

3.4. Iterative Updating Rule

4. Performance Measure Algorithms and Theorems

5. Main Analysis

5.1. Concentration of Measure

5.2. Assumptions for Concentration

5.3. Upper Bound of Cumulative Regret

6. Regret Bound in Covering Dimension

7. Conclusions of Main Theorem

7.1. Proof of Main Theorem

7.2. Discussion and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1. Sublinearity

Appendix A.2. Differential Equation Approach

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI