On Convergence Rate of MRetrace

Chen, Xingguo; Qin, Wangrong; Gong, Yu; Yang, Shangdong; Wang, Wenhao

doi:10.3390/math12182930

Open AccessArticle

On Convergence Rate of MRetrace

by

Xingguo Chen

¹

,

Wangrong Qin

¹,

Yu Gong

¹,

Shangdong Yang

¹ and

Wenhao Wang

^2,3,*

¹

Jiangsu Key Laboratory of Big Data Security & Intelligent Processing, Nanjing University of Posts and Telecommunications, Nanjing 210023, China

²

College of Electronic Engineering, National University of Defense Technology, Changsha 410073, China

³

Science and Technology on Information Systems Engineering Laboratory, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(18), 2930; https://doi.org/10.3390/math12182930

Submission received: 2 August 2024 / Revised: 15 September 2024 / Accepted: 16 September 2024 / Published: 20 September 2024

(This article belongs to the Section E1: Mathematics and Computer Science)

Download

Browse Figures

Versions Notes

Abstract

:

Off-policy is a key setting for reinforcement learning algorithms. In recent years, the stability of off-policy learning for value-based reinforcement learning has been guaranteed even when combined with linear function approximation and bootstrapping. Convergence rate analysis is currently a hot topic. However, the convergence rates of learning algorithms vary, and analyzing the reasons behind this remains an open problem. In this paper, we propose an essentially simplified version of a convergence rate to generate general off-policy temporal difference learning algorithms. We emphasize that the primary determinant influencing convergence rate is the minimum eigenvalue of the key matrix. Furthermore, we conduct a comparative analysis of the influencing factor across various off-policy learning algorithms in diverse numerical scenarios. The experimental findings validate the proposed determinant, which serves as a benchmark for the design of more efficient learning algorithms.

Keywords:

finite sample analysis; off-policy learning; minimum eigenvalues; MRetrace

MSC:

68T05

1. Introduction

Off-policy learning generates experience data by a behavior policy and learns a different target policy. Off-policy TD learning with a linear function approximation may diverge in counterexamples known as “the deadly triad” [1]. The fundamental reason is that the key matrix of off-policy TD is not guaranteed to be positive definite [2]. In the recent 30 years, the main research focused on the convergence guarantee of off-policy algorithms via the construction of positive definite matrices, e.g., Bellman residual (BR) [3], gradient temporal difference (GTD) [4], fast gradient temporal difference (GTD2) and TD with gradient correction (TDC) [5], emphatic TD (ETD) [2], and modified Retrace (MRetrace) [6].

Recently, due to the guarantee of convergence, more research has paid attention to the convergence rate analysis of reinforcement learning algorithms. Dalal et al. [7] proposed convergence rates both in expectation and with a high probability for one-timescale temporal difference learning algorithms. Dalal et al. [8], Gupta et al. [9], Xu et al. [10], and Dalal et al. [11] obtained convergence rates with a high probability for two-timescale temporal difference learning algorithms. Durmus et al. [12] proposed tight high-probability bounds for linear stochastic approximation with a fixed step size. For control settings, Xu and Liang [13] proposed convergence rates for Greedy-GQ. Zhang et al. [14] proposed convergence rates for projected SARSA. Wang et al. [15] proposed convergence rates with a high probability for distributionally robust Q-learning.

However, the above analysis did not answer the following questions: Which of these algorithms is faster? Which one should we choose? The purpose of this paper is to give an intuitive comparison of the convergence rate.

Our contributions: (1) We propose a simplified version of the expected convergence rate theorem. (2) We analyze the core elements of the convergence rates by assuming the same settings for each algorithm, and we find that the main factor affecting convergence rate is the minimum eigenvalue of the key matrix. (3) We calculate core elements of different temporal difference learning algorithms for several environmental examples and validate by experimental studies.

2. Background

This section introduces MDP and reinforcement learning algorithms with their key matrices.

2.1. Markov Decision Process

Consider a discounted Markov decision process

〈 S

,

A

,

R

,

T

,

γ 〉

, where

S

is a state space,

| S | = n

,

A

is an action space,

T : S \times A \times S \to [0, 1]

is a transition function,

R : S \times A \times S \to R

is a reward function, and

γ \in [0, 1)

is a discount factor. The state value function is

V^{π} (s) = E_{π} [\sum_{t = 0}^{\infty} γ^{t} r_{t} | s_{0} = s],

(1)

where

r_{t}

is an immediate reward, and

π

is a policy used to select action a in state s with a probability

π (a | s)

. The value function is approximated with a linear function, as follows:

V (s) \approx V_{θ} (s) = θ^{⊤} ϕ (s) = \sum_{i = 1}^{m} θ_{i} ϕ_{i} (s),

(2)

where

θ

is the weight vector, and

ϕ (s)

is the feature vector of state s. This paper is concerned with off-policy learning, where a different behavior policy

μ

generates experiences

〈 s_{t}

,

a_{t}

,

r_{t + 1}

,

s_{t + 1}

,

a_{t + 1} 〉

.

2.2. Learning Algorithms and Their Key Matrices

Learning algorithms and their key matrices are summarized in Table 1.

2.2.1. Off-Policy TD

The update rule of off-policy TD [2] is as follows:

\begin{matrix} θ_{t + 1} \dot{=} & θ_{t} + ρ_{t} α_{t} (r_{t + 1} + γ θ_{t}^{⊤} ϕ_{t + 1} - θ_{t}^{⊤} ϕ_{t}) ϕ_{t} \\ = & θ_{t} + α_{t} (ρ_{t} r_{t + 1} ϕ_{t} - ρ_{t} ϕ_{t} {(ϕ_{t} - γ ϕ_{t + 1})}^{⊤} θ_{t}) \\ = & θ_{t} + α_{t} (b_{off, t} - A_{off, t} θ_{t}), \end{matrix}

(3)

Its key matrix is

\begin{matrix} A_{off} = lim_{t \to \infty} E [A_{off, t}] = lim_{t \to \infty} E_{μ} [ρ_{t} ϕ_{t} {(ϕ_{t} - γ ϕ_{t + 1})}^{⊤}] \\ = \sum_{s} d_{μ} (s) E_{μ} [ρ_{t} ϕ_{t} {(ϕ_{t} - γ ϕ_{t + 1})}^{⊤} | S_{t} = s] \\ = \sum_{s} d_{μ} (s) \sum_{a} μ (a | s) \sum_{s^{'}} T (s, a, s^{'}) \frac{π (a | s)}{μ (a | s)} ϕ (s) {(ϕ (s) - γ ϕ (s^{'}))}^{⊤} \\ = \sum_{s} d_{μ} (s) ϕ (s) {(ϕ (s) - γ \sum_{a} π (a | s) \sum_{s^{'}} T (s, a, s^{'}) ϕ (s^{'}))}^{⊤} \\ = \sum_{s} d_{μ} (s) ϕ (s) {(ϕ (s) - γ \sum_{s^{'}} {[P_{π}]}_{s s^{'}} ϕ (s^{'}))}^{⊤} \\ = Φ^{⊤} D_{μ} (I - γ P_{π}) Φ, \end{matrix}

(4)

\begin{matrix} b_{off} & = lim_{t \to \infty} E [b_{off, t}] = lim_{t \to \infty} E_{μ} [ρ_{t} r_{t + 1} ϕ_{t}] \\ = \sum_{s} d_{μ} (s) E_{μ} [ρ_{t} r_{t + 1} ϕ_{t} | S_{t} = s] \\ = \sum_{s} d_{μ} (s) \sum_{a} μ (a | s) \frac{π (a | s)}{μ (a | s)} r_{t + 1} ϕ (s) \\ = \sum_{s} d_{μ} (s) ϕ (s) \sum_{a} π (a | s) \sum_{s^{'}} T (s, a, s^{'}) R (s, a, s^{'}) \\ = Φ^{⊤} D_{μ} r_{π}, \end{matrix}

(5)

where

r_{π}

is an expected reward vector under policy

π

with each component being

r_{π} (s) = \sum_{a} \sum_{s^{'}} π (a | s) R (s, a, s^{'}) .

(6)

2.2.2. Retrace(0)

The update rule of Retrace(0) [16] is as follows:

\begin{matrix} θ_{t + 1} \dot{=} & θ_{t} + c_{t} α_{t} (r_{t + 1} + γ θ_{t}^{⊤} E_{π} [ϕ_{t + 1}] - θ_{t}^{⊤} ϕ_{t}) ϕ_{t} \\ = & θ_{t} + α_{t} (c_{t} r_{t + 1} ϕ_{t} - c_{t} ϕ_{t} {(ϕ_{t} - γ E_{π} [ϕ_{t + 1}])}^{⊤} θ_{t}) \\ = & θ_{t} + α_{t} (b_{Retrace (0), t} - A_{Retrace (0), t} θ_{t}), \end{matrix}

(7)

where

c_{t} = min (1, ρ_{t})

,

E_{π} [ϕ_{t + 1}] = \sum_{a} π (a | s_{t + 1}) ϕ (s_{t + 1})

. The key matrix of the expected Retrace’s update (7) is

\begin{matrix} A_{Retrace (0)} = lim_{t \to \infty} E [A_{Retrace (0), t}] = lim_{t \to \infty} E_{μ} [c_{t} ϕ_{t} {(ϕ_{t} - γ E_{π} [ϕ_{t + 1}])}^{⊤}] \\ = \sum_{s} d_{μ} (s) E_{μ} [c_{t} ϕ (s) {(ϕ (s) - γ E_{π} [ϕ (s^{'})])}^{⊤}] \\ = \sum_{s} d_{μ} (s) ϕ (s) \sum_{a} μ (a | s) min (1, \frac{π (a | s)}{μ (a | s)}) {(ϕ (s) - γ E_{π} [ϕ (s^{'})])}^{⊤} \\ = \sum_{s} d_{μ} (s) ϕ (s) \sum_{a} min (μ (a | s), π (a | s)) {(ϕ (s) - γ \sum_{s^{'}} {[P_{π}]}_{s s^{'}} ϕ (s^{'}))}^{⊤} \\ = Φ^{⊤} D_{μ} D_{c} (I - γ P_{π}) Φ, \end{matrix}

(8)

where

D_{c}

is the

n \times n

diagonal matrix with

d_{c}

on its diagonal, and each component of

d_{c}

is

d_{c} (s) = \sum_{a} min (μ (a | s), π (a | s)) .

(9)

\begin{matrix} b_{Retrace (0)} & = lim_{t \to \infty} E [b_{Retrace (0), t}] = lim_{t \to \infty} E_{μ} [c_{t} r_{t + 1} ϕ_{t}] \\ = \sum_{s} d_{μ} (s) E_{μ} [c_{t} r_{t + 1} ϕ_{t} | S_{t} = s] \\ = \sum_{s} d_{μ} (s) \sum_{a} μ (a | s) min (1, \frac{π (a | s)}{μ (a | s)}) r_{t + 1} ϕ (s) \\ = \sum_{s} d_{μ} (s) ϕ (s) \sum_{a} min (μ (a | s), π (a | s)) \sum_{s^{'}} T (s, a, s^{'}) R (s, a, s^{'}) \\ = Φ^{⊤} D_{μ} r_{c}, \end{matrix}

(10)

where

r_{c} = \sum_{a} min (μ (a | s), π (a | s)) \sum_{s^{'}} T (s, a, s^{'}) R (s, a, s^{'})

.

2.2.3. Naive Bellman Residual

The update rule of the naive Bellman residual [3] is as follows:

\begin{matrix} θ_{t + 1} \dot{=} & θ_{t} + ρ_{t} α_{t} (r_{t + 1} + γ θ_{t}^{⊤} ϕ_{t + 1} - θ_{t}^{⊤} ϕ_{t}) (ϕ_{t} - γ E_{π} [ϕ_{t + 1}]) \\ = & θ_{t} + α_{t} (ρ_{t} r_{t + 1} (ϕ_{t} - γ ϕ_{t + 1}) - ρ_{t} (ϕ_{t} - γ ϕ_{t + 1}) {(ϕ_{t} - γ E_{π} [ϕ_{t + 1}])}^{⊤} θ_{t}) \\ = & θ_{t} + α_{t} (b_{BR, t} - A_{BR, t} θ_{t}), \end{matrix}

(11)

Its key matrix is

\begin{matrix} A_{BR} = lim_{t \to \infty} E [A_{BR, t}] = lim_{t \to \infty} E_{μ} [ρ_{t} (ϕ_{t} - γ E_{π} [ϕ_{t + 1}]) {(ϕ_{t} - γ ϕ_{t + 1})}^{⊤}] \\ = \sum_{s} d_{μ} (s) E_{μ} [ρ_{t} (ϕ_{t} - γ E_{π} [ϕ_{t + 1}]) {(ϕ_{t} - γ ϕ_{t + 1})}^{⊤} | S_{t} = s] \\ = \sum_{s} d_{μ} (s) \sum_{a} μ (a | s) \sum_{s^{'}} T (s, a, s^{'}) \frac{π (a | s)}{μ (a | s)} (ϕ (s) - γ E_{π} [ϕ (s^{'})]) {(ϕ (s) - γ ϕ (s^{'}))}^{⊤} \\ = \sum_{s} d_{μ} (s) \sum_{a} \sum_{s^{'}} T (s, a, s^{'}) π (a | s) (ϕ (s) - γ E_{π} [ϕ (s^{'})]) {(ϕ (s) - γ ϕ (s^{'}))}^{⊤} \\ = \sum_{s} d_{μ} (s) (ϕ (s) - γ E_{π} [ϕ (s^{'})]) {(ϕ (s) - γ \sum_{a} π (a | s) \sum_{s^{'}} T (s, a, s^{'}) ϕ (s^{'}))}^{⊤} \\ = \sum_{s} d_{μ} (s) (ϕ (s) - γ E_{π} [ϕ (s^{'})]) {(ϕ (s) - γ \sum_{s^{'}} {[P_{π}]}_{s s^{'}} ϕ (s^{'}))}^{⊤} \\ = Φ^{⊤} {(I - γ P_{π})}^{⊤} D_{μ} (I - γ P_{π}) Φ, \end{matrix}

(12)

\begin{matrix} b_{BR} & = lim_{t \to \infty} E [b_{BR, t}] = lim_{t \to \infty} E_{μ} [ρ_{t} r_{t + 1} (ϕ_{t} - γ E_{π} [ϕ_{t + 1}])] \\ = \sum_{s} d_{μ} (s) E_{μ} [ρ_{t} r_{t + 1} (ϕ_{t} - γ E_{π} [ϕ_{t + 1}]) | S_{t} = s] \\ = \sum_{s} d_{μ} (s) \sum_{a} μ (a | s) \frac{π (a | s)}{μ (a | s)} r_{t + 1} (ϕ (s) - γ E_{π} [ϕ (s^{'})])) \\ = \sum_{s} d_{μ} (s) (ϕ (s) - γ E_{π} [ϕ (s^{'})])) \sum_{a} π (a | s) \sum_{s^{'}} T (s, a, s^{'}) R (s, a, s^{'}) \\ = Φ^{⊤} {(I - γ P_{π})}^{⊤} D_{μ} r_{π}, \end{matrix}

(13)

2.2.4. GTD

The update rule of the GTD [4] algorithm is as follows:

\begin{matrix} ω_{t + 1} & = ω_{t} + β_{t} (δ_{t} ϕ_{t} - ω_{t}), \\ θ_{t + 1} & = θ_{t} + α_{t} (ϕ_{t} - γ ϕ_{t}^{'}) ϕ_{t}^{⊤} ω_{t}, \end{matrix}

(14)

where the TD error

δ_{t} = r_{t} + {(γ ϕ_{t}^{'} - ϕ_{t})}^{⊤} θ_{t}

.

Let

g_{t} = {(ω_{t}^{⊤} / \sqrt{η}, θ_{t}^{⊤})}^{⊤}

; thus, we can obtain

g_{t + 1} = g_{t} + α_{t} \sqrt{η} (b_{G T D, t + 1} - A_{G T D, t + 1} g_{t}),

(15)

where

b_{G T D, t + 1}^{⊤} = (r_{t} ϕ_{t}^{⊤}, 0^{⊤})

and

A_{G T D, t + 1} = (\begin{matrix} \sqrt{η} I & ϕ_{t} {(ϕ_{t} - γ ϕ_{t}^{'})}^{⊤} \\ (γ ϕ_{t}^{'} - ϕ_{t}) ϕ_{t}^{⊤} & 0 \end{matrix}) .

(16)

Its key matrix is

A_{GTD} = lim_{t \to \infty} E [A_{GTD, t}] = (\begin{matrix} \sqrt{η} I & A_{off} \\ - A_{off}^{⊤} & 0 \end{matrix}),

(17)

b_{GTD} = lim_{t \to \infty} E [b_{GTD, t}] = (\begin{matrix} Φ^{⊤} D_{μ} r_{π} \\ 0 \end{matrix}),

(18)

2.2.5. GTD2

The update rule of the GTD2 [5] algorithm is as follows:

\begin{matrix} ω_{t + 1} & = ω_{t} + β_{t} (δ_{t} - ϕ_{t}^{⊤} ω_{t}) ϕ_{t}, \\ θ_{t + 1} & = θ_{t} + α_{t} (ϕ_{t} - γ ϕ_{t}^{'}) ϕ_{t}^{⊤} ω_{t}, \end{matrix}

(19)

where the TD error

δ_{t} = r_{t} + {(γ ϕ_{t}^{'} - ϕ_{t})}^{⊤} θ_{t}

.

Let

g_{t} = {(ω_{t}^{⊤} / \sqrt{η}, θ_{t}^{⊤})}^{⊤}

; thus, we can obtain

g_{t + 1} = g_{t} + α_{t} \sqrt{η} (b_{G T D 2, t + 1} - A_{G T D 2, t + 1} g_{t}),

(20)

where

b_{G T D 2, t + 1}^{⊤} = (r_{t} ϕ_{t}^{⊤}, 0^{⊤})

and

A_{G T D 2, t + 1} = (\begin{matrix} \sqrt{η} ϕ_{t} ϕ_{t}^{⊤} & ϕ_{t} {(ϕ_{t} - γ ϕ_{t}^{'})}^{⊤} \\ (γ ϕ_{t}^{'} - ϕ_{t}) ϕ_{t}^{⊤} & 0 \end{matrix}) .

Its key matrix is

A_{GTD 2} = lim_{t \to \infty} E [A_{GTD 2, t}] = (\begin{matrix} \sqrt{η} C & A_{off} \\ - A_{off}^{⊤} & 0 \end{matrix}),

(21)

b_{GTD 2} = lim_{t \to \infty} E [b_{GTD 2, t}] = (\begin{matrix} Φ^{⊤} D_{μ} r_{π} \\ 0 \end{matrix}),

(22)

where

C = E [ϕ ϕ^{⊤}]

.

2.2.6. TDC

The update rule of the TDC [5] algorithm is as follows:

\begin{matrix} ω_{t + 1} & = ω_{t} + β_{t} (δ_{t} - ϕ_{t}^{⊤} ω_{t}) ϕ_{t}, \\ θ_{t + 1} & = θ_{t} + α_{t} δ_{t} ϕ_{t} - α_{t} γ ϕ_{t}^{'} (ϕ_{t}^{⊤} ω_{t}), \end{matrix}

(23)

where the TD error

δ_{t} = r_{t} + {(γ ϕ_{t}^{'} - ϕ_{t})}^{⊤} θ_{t}

.

Its key matrix is

A_{TDC} = A_{off}^{⊤} C^{- 1} A_{off},

(24)

b_{TDC} = A_{off}^{⊤} C^{- 1} Φ^{⊤} D_{μ} r_{π},

(25)

2.2.7. ETD

The update rule of the ETD [2] algorithm is as follows:

\begin{matrix} θ_{t + 1} & ≐ θ_{t} + α_{t} F_{t} ρ_{t} (r_{t + 1} + γ θ_{t}^{⊤} ϕ_{t + 1} - θ_{t}^{⊤} ϕ_{t}) ϕ_{t} \\ = θ_{t} + α_{t} (F_{t} ρ_{t} r_{t + 1} ϕ_{t} - F_{t} ρ_{t} ϕ_{t} {(ϕ_{t} - γ ϕ_{t + 1})}^{⊤} θ_{t}) \\ = θ_{t} + α_{t} (b_{ETD, t} - A_{ETD, t} θ_{t}), \end{matrix}

(26)

where

F_{0} = 1

and

F_{t} ≐ γ ρ_{t - 1} F_{t - 1} + 1, \forall t > 0

.

The key matrix of ETD is

\begin{matrix} A_{ETD} = lim_{t \to \infty} E [A_{ETD, t}] \\ = lim_{t \to \infty} E_{μ} [F_{t} ρ_{t} ϕ_{t} {(ϕ_{t} - γ ϕ_{t + 1})}^{⊤}] \\ = \sum_{s} d_{μ} (s) lim_{t \to \infty} E_{μ} [F_{t} ρ_{t} ϕ_{t} {(ϕ_{t} - γ ϕ_{t + 1})}^{⊤} ∣ S_{t} = s] \\ = \sum_{s} \underset{f (s)}{\underset{︸}{d_{μ} (s) {lim}_{t \to \infty} E_{μ} [F_{t} ∣ S_{t} = s]}} E_{μ} [ρ_{t} ϕ_{t} {(ϕ_{t} - γ ϕ_{t + 1})}^{⊤} ∣ S_{t} = s] \\ = Φ^{⊤} D_{f} (I - γ P_{π}) Φ, \end{matrix}

(27)

\begin{matrix} b_{ETD} & = lim_{t \to \infty} E [b_{ETD, t}] = lim_{t \to \infty} E_{μ} [F_{t} ρ_{t} r_{t + 1} ϕ_{t}] \\ = Φ^{⊤} D_{f} r_{π}, \end{matrix}

(28)

where

D_{f}

is a diagonal matrix with a diagonal element approximated to

f = {(I - γ P_{π}^{⊤})}^{- 1} d_{μ}

.

2.2.8. MRetrace

MRetrace [6] is a modified version of Retrace [16] with a convergence guarantee. Its update rule is as follows:

\begin{matrix} θ_{t + 1} \dot{=} & θ_{t} + α_{t} ρ_{t} (r_{t + 1} + x_{t} γ θ_{t}^{⊤} ϕ_{t + 1} - θ_{t}^{⊤} ϕ_{t}) ϕ_{t} \\ = & θ_{t} + α_{t} (ρ_{t} r_{t + 1} ϕ_{t} - ρ_{t} ϕ_{t} {(ϕ_{t} - x_{t} γ ϕ_{t + 1})}^{⊤} θ_{t}) \\ = & θ_{t} + α_{t} (b_{MR, t} - A_{MR, t} θ_{t}), \end{matrix}

(29)

where

x_{t} \dot{=} \frac{1}{{max}_{a} ρ_{t}} = min_{a} \{\frac{1}{ρ_{t}}\} = min_{a} \{\frac{μ (a | s_{t})}{π (a | s_{t})}\},

(30)

b_{MR, t} = ρ_{t} r_{t + 1} ϕ_{t}

,

A_{MR, t} = ρ_{t} ϕ_{t} {(ϕ_{t} - x_{t} γ ϕ_{t + 1})}^{⊤}

.

\begin{matrix} b_{MR} & = lim_{t \to \infty} E [b_{t}] = lim_{t \to \infty} E_{μ} [ρ_{t} r_{t + 1} ϕ_{t}] \\ = \sum_{s} d_{μ} (s) E_{μ} [ρ_{t} r_{t + 1} ϕ_{t} | S_{t} = s] \\ = \sum_{s} d_{μ} (s) \sum_{a} μ (a | s) \frac{π (a | s)}{μ (a | s)} r_{t + 1} ϕ (s) \\ = \sum_{s} d_{μ} (s) ϕ (s) \sum_{a} π (a | s) \sum_{s^{'}} T (s, a, s^{'}) R (s, a, s^{'}) \\ = Φ^{⊤} D_{μ} r_{π}, \end{matrix}

(31)

The key matrix of MRetrace is

\begin{matrix} A_{MR} = lim_{t \to \infty} E [A_{t}] = lim_{t \to \infty} E_{μ} [ϕ_{t} {(ϕ_{t} - x_{t} γ E_{π} [ϕ_{t + 1}])}^{⊤}] \\ = \sum_{s} d_{μ} (s) E_{μ} [ϕ (s) {(ϕ (s) - x_{t} γ E_{π} [ϕ (s^{'})])}^{⊤}] \\ = \sum_{s} d_{μ} (s) ϕ (s) {(ϕ (s) - E_{μ} [x_{t} γ E_{π} [ϕ (s^{'})]])}^{⊤} \\ = \sum_{s} d_{μ} (s) ϕ (s) {(ϕ (s) - γ \sum_{a} μ (a | s) min_{b} \{\frac{μ (b | s)}{π (b | s)}\} E_{π} [ϕ (s^{'})])}^{⊤} \\ = \sum_{s} d_{μ} (s) ϕ (s) {(ϕ (s) - γ min_{b} \{\frac{μ (b | s)}{π (b | s)}\} \sum_{s^{'}} {[P_{π}]}_{s s^{'}} ϕ (s^{'}))}^{⊤} \\ = Φ^{⊤} D_{μ} (I - γ D_{x} P_{π}) Φ, \end{matrix}

(32)

where

D_{x}

is the

n \times n

diagonal matrix with

d_{x}

on its diagonal, and each component of

d_{x}

is

d_{x} (s) = {min}_{b} \{\frac{μ (b | s)}{π (b | s)}\}

.

3. Finite Sample Analysis

The measurement criteria of finite sample analysis and convergence rate analysis are equivalent. They are both concerned with the relationship between errors and the number of iteration rounds.

3.1. Convergence Rate of General Temporal Difference Learning Algorithm

Let us start with a finite sample analysis of a general temporal difference learning algorithm. For the i.i.d. sequence

{r_{t}

,

ϕ_{t}

,

ϕ_{t}^{'}}

, where

r_{t}

and

ϕ_{t}

are sampled from the Markov process with behavior policy

μ

,

ϕ_{t}^{'}

is sampled from target policy

π

. Suppose its update rule of parameter

θ

is defined as follows:

\begin{matrix} θ_{t + 1} = θ_{t} + α_{t} (b_{t} - A_{t} θ_{t}) = θ_{t} + α_{t} (h (θ_{t}) + M_{t + 1}), \end{matrix}

(33)

where

M_{t + 1} = (A - A_{t}) θ_{t} + b_{t} - b,

(34)

\begin{matrix} h (θ_{t}) & = b - A θ_{t} = A θ^{*} - A θ_{t} = - A (θ_{t} - θ^{*}), \end{matrix}

(35)

where

A = {lim}_{t \to \infty} E [A_{t}]

,

b = {lim}_{t \to \infty} E [b_{t}]

, and the fixed point is

θ^{*} = A^{- 1} b

.

A

and

b

are based on the i.i.d. sequence

{r_{t}

,

ϕ_{t}

,

ϕ_{t}^{'}}

.

Assumption 1.

The key matrix

A

of the general temporal difference learning algorithm is positive definite.

Assumption 2.

The sequences

{r_{t}

,

ϕ_{t}

,

ϕ_{t}^{'}}

have uniformly bounded second moments. Let

F_{t}

=

σ (θ_{1}, M_{1}, \dots, θ_{t - 1}, M_{t})

; then, fix some constant

C_{s} > 0

, such that the following holds:

E [| | M_{t + 1} {| |}^{2} | F_{t}] \leq C_{s} (1 + | | θ_{t} - θ^{*} {| |}^{2}) .

(36)

This assumption holds for any initial parameter vector

θ_{1}

.

Assumption 3.

Step-size sequence

α_{t}

satisfies

α_{t} \in (0, 1)

,

\sum_{t = 0}^{\infty} α_{t} = \infty,

and

\sum_{t = 0}^{\infty} α_{t}^{2} < \infty .

Let

λ_{min} (X)

and

λ_{max} (X)

denote the minimum and maximum eigenvalues of the matrix

X

.

Theorem 1.

(Convergence Rate in Expectation for General Temporal Difference Learning Algorithm). Assume that Assumptions 1–3 hold. For

t \geq 0,

we have

\begin{matrix} E ∥ θ_{t + 1} - θ^{*} ∥^{2} \leq & e^{λ_{0}^{t}} E {∥ θ_{0} - θ^{*} ∥}^{2} + C_{s} \sum_{i = 0}^{t} [e^{λ_{i + 1}^{t}}] α_{i}^{2}, \end{matrix}

(37)

where

λ_{i}^{t} = \{\begin{matrix} - λ_{min} (A + A^{⊤}) \sum_{k = i}^{t} α_{k} + λ_{max} (A^{⊤} A + C_{s} I) \sum_{k = i}^{t} α_{k}^{2}, & i \leq t, \\ 0, & i > t . \end{matrix}

(38)

Proof.

Note that the proof process is similar to the proof of Theorem 3.1 of [7].

Based on Definitions (33) and (35), we have

\begin{matrix} θ_{t + 1} - θ^{*} & = θ_{t} + α_{t} (h (θ_{t}) + M_{t + 1}) - θ^{*} \\ = θ_{t} - θ^{*} + α_{t} (- A (θ_{t} - θ^{*}) + M_{t + 1}) \\ = (I - α_{t} A) (θ_{t} - θ^{*}) + α_{t} M_{t + 1} . \end{matrix}

(39)

\begin{matrix} ∥ θ_{t} - θ^{*} ∥^{2} = & {(θ_{t + 1} - θ^{*})}^{⊤} (θ_{t + 1} - θ^{*}) \\ = & {[(I - α_{t} A) (θ_{t} - θ^{*}) + α_{t} M_{t + 1}]}^{⊤} [(I - α_{t} A) (θ_{t} - θ^{*}) + α_{t} M_{t + 1}] \\ = & {(θ_{t} - θ^{*})}^{⊤} {(I - α_{t} A)}^{⊤} (I - α_{t} A) (θ_{t} - θ^{*}) \\ + α_{t} {(θ_{t} - θ^{*})}^{⊤} {(I - α_{t} A)}^{⊤} M_{t + 1} + α_{t} M_{t + 1}^{⊤} (I - α_{t} A) (θ_{t} - θ^{*}) \\ + α_{t}^{2} {∥ M_{t + 1} ∥}^{2} . \end{matrix}

(40)

Taking conditional expectations on both sides, and using

E [M_{t + 1} | F_{t}] = 0

, we get

\begin{matrix} E [∥ θ_{t + 1} - θ^{*} ∥^{2} | F_{t}] = α_{t}^{2} E [∥ M_{t + 1} ∥^{2} | F_{t}] + {(θ_{t} - θ^{*})}^{⊤} {(I - α_{t} A)}^{⊤} (I - α_{t} A) (θ_{t} - θ^{*}) . \end{matrix}

(41)

Therefore, using Assumption 2,

E [∥ θ_{t + 1} - θ^{*} ∥^{2} | F_{t}] \leq {(θ_{t} - θ^{*})}^{⊤} Λ_{t} (θ_{t} - θ^{*}) + C_{s} α_{t}^{2},

(42)

where

Λ_{t} = {(I - α_{t} A)}^{⊤} (I - α_{t} A) + C_{s} α_{t}^{2} I .

(43)

Since

Λ_{t}

is a symmetric matrix and the sum of two positive definite matrices, all its eigenvalues are real and positive. Let

λ_{t} : = λ_{max} (Λ_{t})

; thus, we have

λ_{t} > 0

and

E [∥ θ_{t + 1} - θ^{*} ∥^{2} | F_{t}] \leq λ_{t} {∥ θ_{t} - θ^{*} ∥}^{2} + C_{s} α_{t}^{2} .

(44)

Taking the expectations on both sides, we have

E ∥ θ_{t + 1} - θ^{*} ∥^{2} \leq λ_{t} E {∥ θ_{t} - θ^{*} ∥}^{2} + C_{s} α_{t}^{2} .

(45)

Sequentially using the above inequality, we have

\begin{matrix} E ∥ θ_{t + 1} - θ^{*} ∥^{2} & \leq [\prod_{k = 0}^{t} λ_{k}] E {∥ θ_{0} - θ^{*} ∥}^{2} + C_{s} \sum_{i = 0}^{t} [\prod_{k = i + 1}^{t} λ_{k}] α_{t}^{2} . \end{matrix}

(46)

where we let

\prod_{k = t + 1}^{t} λ_{k} = 1

.

Based on Assumption 1,

A

is positive definite, and the matrices

(A + A^{⊤})

and

(A^{⊤} A + C_{s} I)

in (38) are positive definite. Thus, their minimum and maximum eigenvalues are strictly positive. Hence, using Weyl’s inequality, we have

\begin{matrix} λ_{t} & = λ_{max} ({(I - α_{t} A)}^{⊤} (I - α_{t} A) + C_{s} α_{t}^{2} I) \\ \leq λ_{max} (I - α_{t} (A + A^{⊤})) + α_{t}^{2} λ_{max} (A^{⊤} A + C_{s} I) \\ \leq 1 - α_{t} λ_{min} (A + A^{⊤}) + α_{t}^{2} λ_{max} (A^{⊤} A + C_{s} I) \\ \leq e^{[- α_{t} λ_{min} (A + A^{⊤}) + α_{t}^{2} λ_{max} (A^{⊤} A + C_{s} I)]} \end{matrix}

(47)

In the case of

λ_{t} > 0

,

\forall t > 0

, when

0 < i < t

, for the concatenated multiplication of (47) from i to t, we have

\begin{matrix} \prod_{k = i}^{t} λ_{k} & \leq \prod_{k = i}^{t} e^{[- α_{k} λ_{min} (A + A^{⊤}) + α_{k}^{2} λ_{max} (A^{⊤} A + C_{s} I)]} \\ = e^{\sum_{k = i}^{t} [- α_{k} λ_{min} (A + A^{⊤}) + α_{k}^{2} λ_{max} (A^{⊤} A + C_{s} I)]} \\ = e^{λ_{i}^{t}} \end{matrix}

(48)

The claim now follows. □

3.2. Convergence Rate of the MRetrace Algorithm

Assumption 4.

The Markov Chain (

s_{t}

) is aperiodic and irreducible; thus,

{lim}_{t \to \infty} P (s_{t} = s^{'} | s_{0} = s) = d_{μ} (s^{'})

exists and is unique.

This assumption implies that the state distribution vector

d_{μ}

of the behavior policy

μ

is the fixed point of

d_{μ} = P_{μ}^{⊤} d_{μ},

(49)

where the element of matrix

P_{μ}

is as follows:

{[P_{μ}]}_{s s^{'}} = \sum μ (a | s) T (s, a, s^{'}) .

(50)

Assumption 5.

{ϕ_{t}, r_{t}, E_{π} [ϕ_{t + 1}]}

is such that

E_{μ} [| | ϕ_{t} {| |}^{2} | s_{t_{1}}]

,

E_{μ} [r_{t}^{2} | s_{t_{1}}]

,

E_{π} [| | ϕ_{t + 1} {| |}^{2} | s_{t_{1}}]

are uniformly bounded.

Assumption 6.

The feature matrix Φ is column full rank.

Corollary 1.

(Convergence Rate in Expectation for MRetrace algorithm). Assume Assumptions 4–6 hold. Fix some constant

C_{s} > 0

, for

t \geq 0,

we have

\begin{matrix} E ∥ θ_{t + 1} - θ^{*} ∥^{2} \leq & e^{λ_{0}^{t}} E {∥ θ_{0} - θ^{*} ∥}^{2} + C_{s} \sum_{i = 0}^{t} [e^{λ_{i + 1}^{t}}] α_{i}^{2} . \end{matrix}

(51)

where

λ_{i}^{t} = \{\begin{matrix} - λ_{min} (A_{MR} + A_{MR}^{⊤}) \sum_{k = i}^{t} α_{k} + λ_{max} (A_{MR}^{⊤} A_{MR} + C_{s} I) \sum_{k = i}^{t} α_{k}^{2}, & i \leq t, \\ 0, & i > t . \end{matrix}

(52)

Proof.

All we need is to show that the MRetrace algorithm satisfies the assumptions of Theorem 1.

According to the proof of Theorem 1 of [6], given Assumptions 4 and 6, the matrix

A_{MR}

is positive definite. Under Assumption 5, there exists some constant

C_{s} > 0

,

E [| | M_{t + 1} {| |}^{2} | F_{t}] \leq C_{s} (1 + | | θ_{t} - θ^{*} {| |}^{2})

. Then, based on Assumption 3, the claim follows by directly applying Theorem 1. □

4. How to Compare?

Theorem 1 and Corollary 1 are essentially simplified versions of Theorem 3.1 of [7], but their advantage lies in facilitating the analysis of the main factor affecting convergence rates.

To ensure a fair comparison among different learning algorithms, we need the same setting for each algorithm.

Assumption 7.

Assume each algorithm shares the same feature matrix, the same behavior policy, the same target policy, the same initial parameters

θ_{0}

, the same constant

C_{s}

, and the same learning rate sequences

α_{t}

.

Corollary 2.

(The main factor affecting convergence rates). Assume Assumptions 1–3, 6 and 7. From the perspective of the expected convergence rate, the main factor that affects the convergence rate is the minimum eigenvalue

\frac{1}{2} λ_{min} (A + A^{⊤})

of the key matrix

A

. Furthermore, the larger the minimum eigenvalue of the key matrix, the faster the algorithm has a convergence rate. (This corollary is not actually what we discovered first. An anonymous expert reviewer of UAI2023 pointed out the role of the smallest eigenvalue. However, we did not find any relevant evidence or conclusions in the existing literature, so we formally stated this conclusion here).

Proof.

Based on Assumptions 1 and 2, we obtain Theorem 1 on the expected convergence rate. Checking the error bound (37) in Theorem 1, removing the identical settings for different algorithms, one can easily find that each term contains only a key term

e^{λ_{i}^{t}}

, where

i \in [0, t]

. When

i < t

,

\begin{matrix} λ_{i}^{t} = - λ_{min} (A + A^{⊤}) \sum_{k = i}^{t} α_{k} + λ_{max} (A^{⊤} A + C_{s} I) \sum_{k = i}^{t} α_{k}^{2} . \end{matrix}

(53)

Based on Assumption 3, we have

\sum_{k = i}^{t} α_{t} > \sum_{k = i}^{t} α_{t}^{2} .

Furthermore, fix variable i; thus, we have

lim_{t \to \infty} \sum_{k = i}^{t} α_{t} = \infty,

(54)

and

lim_{t \to \infty} \sum_{k = i}^{t} α_{t}^{2} < \infty .

(55)

That is

lim_{t \to \infty} \sum_{k = i}^{t} α_{k} ≫ lim_{t \to \infty} \sum_{k = i}^{t} α_{k}^{2} .

(56)

Therefore, the main factor in (53) is

- λ_{min} (A + A^{⊤}) \sum_{k = i}^{t} α_{k}

. Finally, based on the same learning rate sequence,

\frac{1}{2} λ_{min} (A + A^{⊤})

is the main factor that affects the convergence rate.

In (37), for a given t, a smaller bound indicates a faster convergence rate. In (37), all elements including all

e^{λ_{i}^{t}}

,

E ∥ θ_{0} - θ^{*} ∥^{2}

,

C_{s}

, and all

α_{i}

, are greater than zero, meaning that a smaller

e^{λ_{i}^{t}}

leads to a faster convergence rate. Therefore, the smaller the value of

λ_{i}^{t}

in (53), the faster the convergence rate. Hence, the larger the value of

\frac{1}{2} λ_{min} (A + A^{⊤})

, the faster the convergence rate. □

5. Numerical Analysis

To compare the expected convergence rates of various algorithms, what we are going to do next is to compute and compare the minimum eigenvalues of each algorithm in different environments and different policies based on Corollary 2. The environments include two-state counterexample [2], Baird’s counterexample [3,5], Random Walk with tabular feature [5], Random Walk with inverted feature [5], Random Walk with dependent feature [5], and Boyan Chain [5,17]. Furthermore, in Random Walk, the target policy takes the right action in 60% of the time and the behavior policy selects the right and left action with equal probability [18]. In the Boyan Chain, the target policy and the behavior policy are the same.

5.1. Example Settings

Two-state counterexample: The

θ \to 2 θ

problem has only two states. From each state, there are two actions, left and right, which take the agent to the left or right state. All rewards are zeros. The features

Φ = {(1, 2)}^{⊤}

are assigned to the left and the right state. The schematic of the environment is shown in Figure 1.

The behavior policy takes the equal probability to left or right in both states, i.e.,

P_{μ} = [\begin{matrix} 0.5 & 0.5 \\ 0.5 & 0.5 \end{matrix}] .

The target policy only selects action right in both states, i.e.,

P_{π} = [\begin{matrix} 0 & 1 \\ 0 & 1 \end{matrix}] .

The state distribution

d_{μ} = {(0.5, 0.5)}^{⊤}

,

d_{x} = d_{c} = {(0.5, 0.5)}^{⊤}

,

C = 2.5

,

f = {(0.5, 9.5)}^{⊤}

,

r_{π} = 0

,

r_{c} = 0

.

Baird’s counterexample: Baird’s counterexample has seven states, and the schematic of the environment is shown in Figure 2.

The feature matrix is of dimensions

(7 \times 8)

where each state is represented by an 8-dimensional feature, i.e.,

Φ = [\begin{matrix} 2 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 2 & 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 2 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 2 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 2 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 2 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 2 \end{matrix}] .

The behavior policy takes equal probabilities to each state, i.e.,

P_{μ} = [\begin{matrix} \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} \\ \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} \\ \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} \\ \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} \\ \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} \\ \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} \\ \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} & \frac{1}{7} \end{matrix}] .

The target policy only selects the action to state 7 in both states, i.e.,

P_{π} = [\begin{matrix} 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 \end{matrix}] .

The state distribution

d_{μ} = {(\frac{1}{7}, \frac{1}{7}, \frac{1}{7}, \frac{1}{7}, \frac{1}{7}, \frac{1}{7}, \frac{1}{7})}^{⊤},

d_{x} = d_{c} = {(\frac{1}{7}, \frac{1}{7}, \frac{1}{7}, \frac{1}{7}, \frac{1}{7}, \frac{1}{7}, \frac{1}{7})}^{⊤},

C = [\begin{matrix} 0.571 & 0 & 0 & 0 & 0 & 0 & 0.286 \\ 0 & 0.571 & 0 & 0 & 0 & 0 & 0.286 \\ 0 & 0 & 0.571 & 0 & 0 & 0 & 0.286 \\ 0 & 0 & 0 & 0.571 & 0 & 0 & 0.286 \\ 0 & 0 & 0 & 0 & 0.571 & 0 & 0.286 \\ 0 & 0 & 0 & 0 & 0 & 0.571 & 0.286 \\ 0.286 & 0.286 & 0.286 & 0.286 & 0.286 & 0.286 & 1.429 \end{matrix}] .

f = {(\frac{1}{7}, \frac{1}{7}, \frac{1}{7}, \frac{1}{7}, \frac{1}{7}, \frac{1}{7}, \frac{64}{7})}^{⊤},

r_{π} = {(0, 0, 0, 0, 0, 0, 0)}^{⊤},

r_{c} = {(0, 0, 0, 0, 0, 0, 0)}^{⊤} .

Random Walk: Random Walk is centered around a typical Markov Chain. This chain comprises five consecutive states, with two terminal states positioned at each extremity serving as absorptive endpoints.

The schematic of the environment is shown in Figure 3.

Specifically, its state–state transfer probability of behavior policy is

P_{μ} = [\begin{matrix} 0 & 0.6 & 0.4 & 0 & 0 \\ 0.4 & 0 & 0.6 & 0 & 0 \\ 0 & 0.4 & 0 & 0.6 & 0 \\ 0 & 0 & 0.4 & 0 & 0.6 \\ 0 & 0 & 0.6 & 0.4 & 0 \end{matrix}] .

The state–state transfer probability of the target policy is

P_{π} = [\begin{matrix} 0 & 0.5 & 0.5 & 0 & 0 \\ 0.5 & 0 & 0.5 & 0 & 0 \\ 0 & 0.5 & 0 & 0.5 & 0 \\ 0 & 0 & 0.5 & 0 & 0.5 \\ 0 & 0 & 0.5 & 0.5 & 0 \end{matrix}] .

The state distribution

d_{μ} = {(\frac{1}{9}, \frac{2}{9}, \frac{3}{9}, \frac{2}{9}, \frac{1}{9})}^{⊤}

,

d_{c} = (0.9

,

0.9

,

0.9

,

0.9

,

{0.9)}^{⊤}

,

d_{c} = (\frac{5}{6}

,

\frac{5}{6}

,

\frac{5}{6}

,

\frac{5}{6}

,

\frac{5}{6})^{⊤}

,

f = (0.773

,

1.84

,

3.333

,

2.56

,

{1.493)}^{⊤}

,

r_{π} = (- 0.4

, 0, 0, 0,

{0.6)}^{⊤}

,

r_{c} = (- 0.4

, 0, 0, 0,

{0.5)}^{⊤}

.

The Random Walk environment has three feature representations, which are called tabular features, inverted features, and dependent features. The feature matrix of Random Walk with tabular features is

Φ = [\begin{matrix} 1 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 1 \end{matrix}] .

C = [\begin{matrix} 0.111 & 0 & 0 & 0 & 0 \\ 0 & 0.222 & 0 & 0 & 0 \\ 0 & 0 & 0.333 & 0 & 0 \\ 0 & 0 & 0 & 0.222 & 0 \\ 0 & 0 & 0 & 0 & 0.111 \end{matrix}] .

The feature matrix of Random Walk with inverted features is

Φ = [\begin{matrix} 0 & 0.5 & 0.5 & 0.5 & 0.5 \\ 0.5 & 0 & 0.5 & 0.5 & 0.5 \\ 0.5 & 0.5 & 0 & 0.5 & 0.5 \\ 0.5 & 0.5 & 0.5 & 0 & 0.5 \\ 0.5 & 0.5 & 0.5 & 0.5 & 0 \end{matrix}] .

C = [\begin{matrix} 0.222 & 0.167 & 0.139 & 0.167 & 0.194 \\ 0.167 & 0.194 & 0.111 & 0.139 & 0.167 \\ 0.139 & 0.111 & 0.167 & 0.111 & 0.139 \\ 0.167 & 0.139 & 0.111 & 0.194 & 0.167 \\ 0.194 & 0.167 & 0.139 & 0.167 & 0.222 \end{matrix}] .

The feature matrix of Random Walk with dependent features is

Φ = [\begin{matrix} 1 & 0 & 0 \\ \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} & 0 \\ \frac{1}{\sqrt{3}} & \frac{1}{\sqrt{3}} & \frac{1}{\sqrt{3}} \\ 0 & \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \\ 0 & 0 & 1 \end{matrix}] .

C = [\begin{matrix} 0.333 & 0.222 & 0.111 \\ 0.222 & 0.333 & 0.222 \\ 0.111 & 0.222 & 0.333 \end{matrix}] .

Boyan Chain: The Boyan Chain consists of 13 states, each of which is represented by 4 state features. The feature matrix of the Boyan Chain with dependent features is

Φ = [\begin{matrix} 1 & 0 & 0 & 0 \\ 0.75 & 0.25 & 0 & 0 \\ 0.5 & 0.5 & 0 & 0 \\ 0.25 & 0.75 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0.75 & 0.25 & 0 \\ 0 & 0.5 & 0.5 & 0 \\ 0 & 0.25 & 0.75 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0.75 & 0.25 \\ 0 & 0 & 0.5 & 0.5 \\ 0 & 0 & 0.25 & 0.75 \\ 0 & 0 & 0 & 1 \end{matrix}] .

The state distribution

d_{μ} = (0.108

,

0.054

,

0.081

,

0.068

,

0.075

,

0.071

,

0.073

,

0.072

,

0.072

,

0.072

,

0.072

,

0.072

,

{0.108)}^{⊤},

d_{x}

= d_{c}

= (1

, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

{1)}^{⊤},

C = [\begin{matrix} 0.163 & 0.043 & 0 & 0 \\ 0.043 & 0.199 & 0.045 & 0 \\ 0 & 0.045 & 0.199 & 0.045 \\ 0 & 0 & 0.045 & 0.172 \end{matrix}] .

f = {(1.084, 0.542, 0.813, 0.678, 0.745, 0.712, 0.729, 0.72, 0.724, 0.722, 0.723, 0.723, 1.084)}^{⊤}

,

r_{π} = r_{c} = {(- 3, - 3, - 3, - 3, - 3, - 3, - 3, - 3, - 3, - 3, - 3, - 2, 0)}^{⊤} .

Then, the state–state transfer probability of behavior policy is the same as the state–state transfer probability of target policy, i.e.,

P_{μ} = P_{π}

, where the element of matrix

P_{μ}

is as follows, for

i, j \in [0, 12]

:

P_{μ} [i, j] = \{\begin{matrix} 0.5 & if i \leq 10 and j = i + 1 \\ 0.5 & if i \leq 10 and j = i + 2 \\ 1 & if i = 11 and j = 12 \\ 1 & if i = 12 and j = 0 \\ 0 & otherwise . \end{matrix}

The schematic of the environment is shown in Figure 4.

5.2. Results and Analysis

Based on Table 1, first, we need to set

Φ

,

P_{μ}

,

P_{π}

,

D_{c}

,

r_{c}

,

r_{π}

, C,

D_{f}

, and

D_{x}

for each setting. Then, based on the property

d_{μ} = P_{μ}^{⊤} d_{μ}

, we compute the eigenvector of the matrix

P_{μ}^{⊤}

with eigenvalue 1.0, and we unitize it to obtain

d_{μ}

and

D_{μ}

. (It is important to note that when there are absorbing states in a Markov Chain, the probability distribution is 1.0 only at the absorbing states and 0 at all other states. Therefore, we adopt a restart approach, jumping directly back to the initial states, thus forming a probability transition matrix

P_{μ}

without absorbing states.) After all, we compute the key matrix for each algorithm and its minimum eigenvalue. Note that the step-size ratio

η

of the auxiliary parameter to the learning parameter is usually set to

η \geq 1.0

.

The minimum eigenvalues in several examples for each algorithm are summarized in Table 2. We can find the following: (1) In Baird’s counterexample and a two-state counterexample, the minimum eigenvalues of key matrices in off-policy TD and Retrace are both less than 0, indicating that they will diverge in these two counterexamples, which is consistent with the existing research [2]. Additionally, compared with off-policy TD, Retrace does have some mitigation towards divergence but cannot avoid it.

(2) In Baird’s counterexample, the minimum eigenvalues of GTD2, TDC, ETD, and MRetrace are all less than zero, indicating that their key matrices are not positive definite. This seems inconsistent with our understanding. The main reason is that in Baird’s counterexample, the feature matrix is

7 \times 8

, and is not column full rank, which is inconsistent with the assumption in their theorems. Note that the absolute value of the minimum eigenvalue is very small.

(3) Except for Baird’s counterexample, the minimum eigenvalues of GTD and GTD2 are all zeros. The reason is that matrices

\frac{1}{2} λ_{min} (A + A^{⊤})

of GTD and GTD2 are positive semi-definite. In the context of this paper, we are unable to distinguish which one is faster between GTD and GTD2 in numerical analysis.

(4) The minimum eigenvalue of TDC is higher than GTD and GTD2, which is consistent with the literature.

(5) Except for Baird’s counterexample, the minimum eigenvalue of ETD is the largest.

(6) Except for Baird’s counterexample, MRetrace has the second-largest minimum eigenvalues.

(7) All the minimum eigenvalues of BR are greater than 0, making it the only one to remain positive definite in all examples. However, except for Baird’s counterexample, its minimum eigenvalue is not large.

(8) The Boyan Chain is an on-policy setting; off-policy TD is implemented in on-policy; and off-policy TD, Retrace, and MRetrace have the same minimum eigenvalue of 0.024. BR and TDC have the same minimum eigenvalue of 0.002, indicating that BR and TDC are not suitable for on-policy learning.

(9) It is surprising that in an on-policy setting, the minimum eigenvalue of ETD is larger than that of on-policy TD. This implies that in terms of the expected convergence rate, ETD is the most recommended option for on-policy learning.

To the best of our knowledge, this is the first time that various temporal difference learning algorithms have been compared for their convergence rates in a very intuitive numerical analysis manner.

In summary, in the expected sense, ETD has the fastest convergence rate, followed by MRetrace.

6. Experimental Studies

In Section 3, we proposed Corollary 2, and in Section 4, we compared the sizes of the minimum eigenvalues of different algorithms for various environment settings. This means that we carried out theoretical analysis combined with numerical analysis, but whether the analytical results reflect the actual situation needs to be experimentally verified. Therefore, we adopted the same environment settings as in Section 4.

Each algorithm runs independently 20 times, with 1000 steps per run, and calculates the mean and standard deviation.

To compare convergence rates, we need to observe the trend of

| θ_{t} - θ^{*} |

over time step t. Note that according to Table 1, different algorithms have different optimal solutions

θ^{*} = A^{- 1} b

, so we need to calculate their optimal solutions separately for each algorithm and then measure the errors.

The learning rate is set to satisfy Assumption 3,

α_{t} = α_{0} \times \frac{1}{t + 1}

, where

α_{0}

is an initial learning rate. We set

η = 4

for GTD and GTD2 in all environments. In two-state counterexamples,

α_{0} = 0.1

, and

γ = 0.9

. In Baird’s counterexamples,

α_{0} = 0.05

, and

γ = 0.99

. In Random Walk,

α_{0} = 0.25

, and

γ = 0.9

. In a Boyan Chain,

α_{0} = 0.25

, and

γ = 0.9

.

The learning curves of different algorithms in different environments are shown in Figure 5, where each curve displays the mean and standard deviation of the errors. We can find the following: (1) In Random Walk with dependent features, the convergence rate analysis is consistent with the learning curves of each algorithm. They have the same order: ETD > MRetrace > off-policy TD > Retrace > TDC ≥ BR > GTD2 ≥ GTD.

(2) Off-policy TD and Retrace diverge in both counterexamples, but Retrace diverges more slowly compared with off-policy TD. This is consistent with the numerical analysis in Section 4.

(3) In Baird’s counterexample, GTD descends faster than GTD2 and TDC. This may be related to the numerical analysis showing that the minimum eigenvalues of GTD2 and TDC are less than 0. Therefore, this is consistent with the analysis.

(4) In Baird’s counterexample, ETD diverges. On the one hand, this is related to high variances of ETD reported in the literature [6,19]; on the other hand, it is also related to the numerical analysis, showing that the minimum eigenvalue is less than 0.

(5) In a two-state counterexample, compared to BR, ETD, and MRetrace, the algorithms GTD, GTD2, and TDC converge very slowly. Additionally, BR converges slower than MRetrace and ETD. These observations are consistent with our numerical analysis. However, ETD is slower than MRetrace, which is attributed to the high variance of ETD.

(6) In Random Walk and the Boyan Chain, TDC converges much faster than GTD and GTD2, which is consistent with our numerical analysis. This observation also aligns with existing research.

(7) In Random Walk and the Boyan Chain, the convergence rate of ETD is remarkably fast, which aligns with our numerical analysis. ETD is reported to be suitable for off-policy learning, but its exceptional performance in on-policy learning is somewhat incredible.

(8) In the Boyan Chain, off-policy TD is implemented as on-policy TD. The learning curves of TD, Retrace, and MRetrace overlap. This is consistent with our numerical analysis.

In conclusion, the convergence rates of the algorithms in the experiments align with the numerical analysis. Moreover, any inconsistencies have interpretable underlying reasons.

7. Conclusions and Future Work

Based on the proposed convergence rate for constructing general off-policy temporal difference learning algorithms, this paper proved that the primary determinant influencing convergence rate is the minimum eigenvalue of the key matrix. Focusing on this factor will be more conducive to the development of faster converging off-policy learning algorithms.

The limitations of this paper include the following aspects:

(1): This paper assumes that the learning rates of all algorithms are the same; however, in reality, different algorithms have different ranges of applicable learning rates.
(2): This paper does not consider the scenario of a fixed learning rate.
(3): This paper focuses on learning prediction and does not address learning control.

Future works need to address the above limitations and explore how to design faster algorithms based on the conclusions of this paper.

In the end, we discovered a contradiction. Sutton et al. [4] proved that

A_{GTD} = (\begin{matrix} \sqrt{η} I & A_{off} \\ - A_{off}^{⊤} & 0 \end{matrix})

is positive definite. Note that in the paper,

G = (\begin{matrix} - \sqrt{η} I & - A_{off} \\ A_{off}^{⊤} & 0 \end{matrix})

is proved to be negative definite [4]. According to Theorem A.3 of [20], the positive definiteness of square matrix

A

is equivalent to the positive definiteness of

(A + A^{⊤})

; thus,

(A_{GTD} + A_{GTD}^{⊤})

is positive definite. However, our calculations in this paper showed that

(A_{GTD} + A_{GTD}^{⊤}) = (\begin{matrix} 2 \sqrt{η} I & 0 \\ 0 & 0 \end{matrix})

is positive semi-definite, not positive definite. Therefore, the conclusion that

A_{GTD}

is positive definite is questionable.

Author Contributions

Conceptualization, X.C.; methodology, X.C.; software, W.Q.; formal analysis, W.Q.; investigation, Y.G.; discussion, S.Y.; writing—review and editing, X.C.; supervision, W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the National Natural Science Foundation, China (Nos. 62276142, 62206133, and 62202240).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of this study; in the collection, analyses, or interpretation of data; in the writing of this manuscript; or in the decision to publish the results.

References

Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; The MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Sutton, R.S.; Mahmood, A.R.; White, M. An emphatic approach to the problem of off-policy temporal-difference learning. J. Mach. Learn. Res. 2016, 17, 2603–2631. [Google Scholar]
Baird, L. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Machine Learning, Tahoe City, CA, USA, 9–12 July 1995; pp. 30–37. [Google Scholar]
Sutton, R.S.; Maei, H.R.; Szepesvári, C. A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation. Adv. Neural Inf. Process. Syst. 2008, 21, 1609–1616. [Google Scholar]
Sutton, R.; Maei, H.; Precup, D.; Bhatnagar, S.; Silver, D.; Szepesvári, C.; Wiewiora, E. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 993–1000. [Google Scholar]
Chen, X.; Ma, X.; Li, Y.; Yang, G.; Yang, S.; Gao, Y. Modified retrace for off-policy temporal difference learning. In Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, Pittsburgh, PA, USA, 31 July–4 August 2023; pp. 303–312. [Google Scholar]
Dalal, G.; Szörényi, B.; Thoppe, G.; Mannor, S. Finite sample analyses for TD (0) with function approximation. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LO, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Dalal, G.; Thoppe, G.; Szörényi, B.; Mannor, S. Finite sample analysis of two-timescale stochastic approximation with applications to reinforcement learning. In Proceedings of the Conference On Learning Theory, PMLR, Stockholm, Sweden, on 5–9 July 2018; pp. 1199–1233. [Google Scholar]
Gupta, H.; Srikant, R.; Ying, L. Finite-time performance bounds and adaptive learning rate selection for two time-scale reinforcement learning. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 4704–4713. [Google Scholar]
Xu, T.; Zou, S.; Liang, Y. Two time-scale off-policy TD learning: Non-asymptotic analysis over Markovian samples. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 10634–10644. [Google Scholar]
Dalal, G.; Szorenyi, B.; Thoppe, G. A tale of two-timescale reinforcement learning with the tightest finite-time bound. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–8 February 2020; Volume 34, pp. 3701–3708. [Google Scholar]
Durmus, A.; Moulines, E.; Naumov, A.; Samsonov, S.; Scaman, K.; Wai, H.T. Tight high probability bounds for linear stochastic approximation with fixed stepsize. Adv. Neural Inf. Process. Syst. 2021, 34, 30063–30074. [Google Scholar]
Xu, T.; Liang, Y. Sample complexity bounds for two timescale value-based reinforcement learning algorithms. In Proceedings of the International Conference on Artificial Intelligence and Statistics, PMLR, Virtual, 13–15 April 2021; pp. 811–819. [Google Scholar]
Zhang, S.; Des Combes, R.T.; Laroche, R. On the convergence of SARSA with linear function approximation. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 41613–41646. [Google Scholar]
Wang, S.; Si, N.; Blanchet, J.; Zhou, Z. A finite sample complexity bound for distributionally robust q-learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 3370–3398. [Google Scholar]
Munos, R.; Stepleton, T.; Harutyunyan, A.; Bellemare, M. Safe and efficient off-policy reinforcement learning. Adv. Neural Inf. Process. Syst. 2016, 29, 1054–1062. [Google Scholar]
Boyan, J.A. Technical update: Least-squares temporal difference learning. Mach. Learn. 2002, 49, 233–246. [Google Scholar] [CrossRef]
Ghiassian, S.; Patterson, A.; Garg, S.; Gupta, D.; White, A.; White, M. Gradient temporal-difference learning with regularized corrections. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 3524–3534. [Google Scholar]
Zhang, S.; Whiteson, S. Truncated emphatic temporal difference methods for prediction and control. J. Mach. Learn. Res. 2022, 23, 6859–6917. [Google Scholar]
Sutton, R.S. Learning to predict by the methods of temporal differences. Mach. Learn. 1988, 3, 9–44. [Google Scholar] [CrossRef]

Figure 1. Two-state counterexample, where the blue solid arrows represent the behavior policy

μ

, and the red dashed arrows represent the target policy

π

.

Figure 1. Two-state counterexample, where the blue solid arrows represent the behavior policy

μ

, and the red dashed arrows represent the target policy

π

.

Figure 2. Baird’s counterexample, where the probability of the solid action and the dashed action in behavior policy

μ

and target policy

π

are

μ (d a s h e d ∣ \cdot) = \frac{6}{7}

,

μ (s o l i d ∣ \cdot) = \frac{1}{7}

and

π (s o l i d ∣ \cdot) = 1

.

Figure 2. Baird’s counterexample, where the probability of the solid action and the dashed action in behavior policy

μ

and target policy

π

are

μ (d a s h e d ∣ \cdot) = \frac{6}{7}

,

μ (s o l i d ∣ \cdot) = \frac{1}{7}

and

π (s o l i d ∣ \cdot) = 1

.

Figure 3. Random Walk. All walks begin in state 3. Under the behavior policy, take either a left or right action with a probability of 0.5 in each state. Under the target policy, take either a left or right action with a probability of 0.4 or 0.6 in each state.

Figure 4. Boyan Chain. In state 0–10, each solid action is taken with probability 0.5. In states 11 and 12, the probability of a solid action is 1.

Figure 5. Comparisons of learning curves in different environments. (a) Two-state counterexamples; (b) Baird’s counterexamples; (c) Random Walk with tabular features; (d) Random Walk with inverted features; (e) Random Walk with dependent features; (f) Boyan Chain.

Table 1. The general solution expressions for each algorithm (

θ = A^{- 1} b

).

Table 1. The general solution expressions for each algorithm (

θ = A^{- 1} b

).

Algorithm	Key Matrix A	Positive Definite	b
Off-policy TD	$A_{off} = Φ^{⊤} D_{μ} (I - γ P_{π}) Φ$	×	$b_{off} = Φ^{⊤} D_{μ} r_{π}$
Retrace	$Φ^{⊤} D_{μ} D_{c} (I - γ P_{π}) Φ$	×	$Φ^{⊤} D_{μ} r_{c}$
BR	$Φ^{⊤} {(I - γ P_{π})}^{⊤} D_{μ} (I - γ P_{π}) Φ$	✓	$Φ^{⊤} {(I - γ P_{π})}^{⊤} D_{μ} r_{π}$
GTD	$\begin{matrix} (\frac{\begin{matrix} \sqrt{η} I & A_{off} \end{matrix}}{\begin{matrix} - A_{off}^{⊤} & 0 \end{matrix}}) \end{matrix}$	✓	$(\frac{\begin{matrix} b_{off} \end{matrix}}{0})$
GTD2	$(\frac{\begin{matrix} \sqrt{η} C & A_{off} \end{matrix}}{\begin{matrix} - A_{off}^{⊤} & 0 \end{matrix}})$	✓	$(\frac{\begin{matrix} b_{off} \end{matrix}}{0})$
TDC	$A_{off}^{⊤} C^{- 1} A_{off}$	✓	$A_{off}^{⊤} C^{- 1} b_{off}$
ETD	$Φ^{⊤} D_{f} (I - γ P_{π}) Φ$	✓	$Φ^{⊤} D_{f} r_{π}$
MRetrace	$Φ^{⊤} D_{μ} (I - γ D_{x} P_{π}) Φ$	✓	$b_{off}$

Table 2. Minimum eigenvalues

\frac{1}{2} λ_{min} (A + A^{⊤})

of various algorithms on several examples.

Table 2. Minimum eigenvalues

\frac{1}{2} λ_{min} (A + A^{⊤})

of various algorithms on several examples.

Algorithm	Two-State	Baird’s	Random Walk			Boyan Chain
Algorithm	Two-State	Baird’s	Tabular	Inverted	Dependent	Boyan Chain
Off-policy TD	$- 0.2$	$- 0.791$	$0.018$	$0.017$	$0.07$	$0.024$
Retrace	$- 0.1$	$- 0.113$	$0.017$	$0.015$	$0.063$	$0.024$
BR	$0.34$	9.673 × 10⁻¹⁷	$0.002$	$0.007$	$0.033$	$0.002$
GTD	0	0	0	0	0	0
GTD2	0	−1.077 × 10⁻¹⁷	0	0	0	0
TDC	$0.016$	$- 0.002$	$0.002$	$0.007$	$0.011$	$0.002$
ETD	$3.4$	−2.82 × 10⁻¹⁶	$0.195$	$0.165$	$0.76$	$0.245$
MRetrace	$1.15$	−2.141 × 10⁻¹⁷	$0.046$	$0.02$	$0.094$	$0.024$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, X.; Qin, W.; Gong, Y.; Yang, S.; Wang, W. On Convergence Rate of MRetrace. Mathematics 2024, 12, 2930. https://doi.org/10.3390/math12182930

AMA Style

Chen X, Qin W, Gong Y, Yang S, Wang W. On Convergence Rate of MRetrace. Mathematics. 2024; 12(18):2930. https://doi.org/10.3390/math12182930

Chicago/Turabian Style

Chen, Xingguo, Wangrong Qin, Yu Gong, Shangdong Yang, and Wenhao Wang. 2024. "On Convergence Rate of MRetrace" Mathematics 12, no. 18: 2930. https://doi.org/10.3390/math12182930

APA Style

Chen, X., Qin, W., Gong, Y., Yang, S., & Wang, W. (2024). On Convergence Rate of MRetrace. Mathematics, 12(18), 2930. https://doi.org/10.3390/math12182930

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

On Convergence Rate of MRetrace

Abstract

1. Introduction

2. Background

2.1. Markov Decision Process

2.2. Learning Algorithms and Their Key Matrices

2.2.1. Off-Policy TD

2.2.2. Retrace(0)

2.2.3. Naive Bellman Residual

2.2.4. GTD

2.2.5. GTD2

2.2.6. TDC

2.2.7. ETD

2.2.8. MRetrace

3. Finite Sample Analysis

3.1. Convergence Rate of General Temporal Difference Learning Algorithm

3.2. Convergence Rate of the MRetrace Algorithm

4. How to Compare?

5. Numerical Analysis

5.1. Example Settings

5.2. Results and Analysis

6. Experimental Studies

7. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI