The Adaptive Optimal Output Feedback Tracking Control of Unknown Discrete-Time Linear Systems Using a Multistep Q-Learning Approach

Dong, Xunde; Lin, Yuxin; Suo, Xudong; Wang, Xihao; Sun, Weijie

doi:10.3390/math12040509

Open AccessArticle

The Adaptive Optimal Output Feedback Tracking Control of Unknown Discrete-Time Linear Systems Using a Multistep Q-Learning Approach

by

Xunde Dong

¹

,

Yuxin Lin

¹

,

Xudong Suo

²,

Xihao Wang

¹ and

Weijie Sun

^3,*

¹

School of Automation Science and Engineering, South China University of Technology, Guangzhou 510641, China

²

Intelligent Mobile Robot Research Institute (Zhongshan), Zhongshan 528478, China

³

School of Automation Science and Engineering, Key Laboratory of Autonomous Systems and Networked Control, Ministry of Education, Guangdong Engineering Technology Research Center of Unmanned Aerial Vehicle System, South China University of Technology, Guangzhou 510641, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(4), 509; https://doi.org/10.3390/math12040509

Submission received: 2 January 2024 / Revised: 23 January 2024 / Accepted: 31 January 2024 / Published: 6 February 2024

(This article belongs to the Section Computational and Applied Mathematics)

Download

Browse Figures

Versions Notes

Abstract

:

This paper investigates the output feedback (OPFB) tracking control problem for discrete-time linear (DTL) systems with unknown dynamics. To solve this problem, we use an augmented system approach, which first transforms the tracking control problem into a regulation problem with a discounted performance function. The solution to this problem is derived using a Bellman equation, based on the Q-function. In order to overcome the challenges of unmeasurable system state variables, we employ a multistep Q-learning algorithm that surpasses the advantages of the policy iteration (PI) and value iteration (VI) techniques and state reconstruction methods for output feedback control. As such, the requirement for an initial stabilizing control policy for the PI method is removed and the convergence speed of the learning algorithm is improved. Finally, we demonstrate the effectiveness of the proposed scheme using a simulation example.

Keywords:

tracking; Q-learning; optimal control; output feedback; UPS

MSC:

49M25

1. Introduction

The optimization of performance costs has always been a crucial concern in controller design problems as it can lead to energy savings and, subsequently, have a positive impact on the environment. The development of practical requirements has significantly contributed to the advancement of optimal control [1,2,3,4]. The key challenge in optimal control lies in solving the Riccati equation for linear systems. In the case of linear systems, computationally efficient iterative algorithms [5,6] can be employed to obtain the solution to the Riccati equation. However, this method is only applicable when a comprehensive understanding of the system dynamics is available. In control engineering, online learning controllers have commonly been designed without complete knowledge of the system dynamics [7,8,9,10,11]. Notably, a data-based approach was proposed in [12] for analyzing the controllability and observability of discrete-time linear (DTL) systems without the precise knowledge of system parameters.

Reinforcement learning (RL) is a powerful method for optimizing rewards via interactions with the environment [13]. Utilizing RL techniques, controller performance can be enhanced based on reward signals [14] and controller parameters can be updated to achieve optimal design criteria for adaptive control. Consequently, RL has provided valuable insights into the field of control systems [14,15], augmented by the introduction of the adaptive dynamic programming (ADP) approach, which aims to achieve optimal performance indices for (partially) model-free scenarios [15,16,17,18,19]. Extensive research has been conducted on developing optimal control schemes based on the ADP concept, particularly for applications in linear quadratic regulator (LQR) and linear quadratic tracking (LQT) problems, which was outlined comprehensively in [2,20,21,22,23] and other related references. It is worth noting that learning schemes in reinforcement learning generally involve two iterative steps: policy evaluation and policy update (with the latter focusing on policy improvement). However, it is essential to acknowledge that reinforcement learning based on value function approximation (VFA) introduces deliberate exploration noise to fully investigate systems, thereby undermining the algorithm’s convergence [23,24,25]. Furthermore, the policy iteration (PI) scheme within the adaptive dynamic programming (ADP) framework necessitates an initially admissible policy, which demands a priori knowledge of unknown systems to design robust controllers [22,26]. To overcome this requirement, recent studies have adopted value iteration (VI) methods [23,27,28] within value function approximation (VFA) schemes. Recently, event-triggered control approaches have also been applied to solve the adaptive optimal output regulation problem using PI and VI methods with one-step learning [29].

Most current studies in the field of control engineering have relied on the ability to measure the complete state information of systems [23,30], which is often challenging to achieve in practical engineering applications [31]. As such, the development of output feedback learning controllers has become essential. In the literature, dynamic output feedback controllers have been investigated [32], which rely on the Q-learning algorithm to solve the LQR control problem for discrete-time linear systems. Additionally, a state parameterization method for reconstructing system states based on filtered input and output signals has been proposed. In contrast, static output feedback designs are popular due to their simplicity and have been used to solve the LQR problem for continuous-time linear systems [33]. However, obtaining static output feedback controllers requires not only the complete state variable information of systems during the learning phase but also model-free state estimation techniques based on neural networks [34,35]. An alternative approach, first proposed in [24], is to use the measurements of past inputs, outputs, and reference trajectories in a system as substitutes for the unmeasurable system state to learn the output feedback LQR controller. This approach has also been extended to solve the output feedback LQT problem by employing the VFA technique [25]. Furthermore, model-free state reconstruction techniques have recently been applied to solve output feedback Q-learning PI schemes for

H_{\infty}

control problems [36,37].

In this paper, we propose a tracking control approach that utilizes a static output feedback multistep Q-learning algorithm in conjunction with state reconstruction techniques. A separate adaptation mechanism was introduced in [38] to estimate unknown feedforward tracking terms. However, the static OPFB design we propose owes its popularity to its simplicity in terms of structure.

The key contributions of this work can be summarized as follows:

Compared to the results reported in [23,39], the proposed approach does not apply an actor–critic structure, which are dependent on actor and critic NNs, to approximate control policy or value function. Moreover, the proposed model-free learning approach removes the requirement for the measurability of system state variables by collecting past input, output, and reference trajectory data. This is particularly advantageous in practical scenarios in which obtaining full state information may be challenging or costly;
VFA-based learning [23,39] can ruin algorithm convergence due to the exploration noise that is intentionally added to evaluated policies to sufficiently excite systems. However, we apply the Q-learning scheme [40], which creates no biases in the estimated parameters of Q-function Bellman equations;
Using the proposed multistep Q-learning technique [41], which surpasses the advantages of PI and VI methods, we are able to remove the requirement for an initial stabilizing control strategy. Moreover, this combination improves the convergence speed of the algorithm, leading to more efficient control performance.

The rest of this paper is organized as follows: Section 1 formulates the problem statement, Section 2 presents the proposed methodology, Section 3 displays the simulation results, and finally, Section 4 concludes the paper with some discussion and future research directions.

2. Problem Statement

This section will first review the problem of infinite-horizon LQT for DTL systems. Then, we will present some fundamental results for solving a discrete-time Bellman equation.

Consider a time-invariant DTL system described by the following state and output equations:

\begin{matrix} x_{k + 1} & = A x_{k} + B u_{k} \\ y_{k} & = C x_{k} \end{matrix}

(1)

where

x_{k} \in R^{n}

,

u_{k} \in R^{m}

, and

y_{k} \in R^{p}

represent the state, input, and output, respectively. The matrices A, B, and C are constant matrices, where the pairs

(A, B)

and

(A, C)

are controllable and observable, respectively.

The reference trajectory is generated by the exogenous system:

r_{k + 1} = F r_{k}

(2)

where

r_{k} \in R^{p}

and F is a constant matrix.

The tracking error is defined as follows:

e_{k} = y_{k} - r_{k} .

(3)

The goal is to create an optimal control policy,

u_{k}

, that allows the output,

y_{k}

, to track the reference trajectory,

r_{k}

, in an optimal way. This is achieved by minimizing the following discounted performance index:

J (x_{k}, r_{k}) = \frac{1}{2} \sum_{i = k}^{\infty} γ^{i - k} (e_{i}^{T} Q e_{i} + u_{i}^{T} R u_{i})

(4)

where Q and R are positive definite weighting matrices, and

0 < γ \leq 1

represents the discount factor.

Remark 1.

As stated in [40], the discount factor γ in (4) allows for a more general solution to the LQT problem compared to the standard setting. Importantly, the matrix F need not be stable, thus permitting a broader range of permissible reference signals for the tracking control problem with the quadratic performance index. Additionally, this framework allows for simultaneous optimization of both feedback and feedforward components of the control input, leading to a causal solution to the infinite-horizon LQT problem. It is worth noting that the use of the discount factor γ does not sacrifice generality, as one can set

γ = 1

when F is Hurwitz, reducing the LQT problem to an LQR problem with the specified output trajectory exponentially decaying to zero.

2.1. Offline Solution for LQT

By denoting

X_{k} = {[\begin{matrix} x_{k}^{T} & r_{k}^{T} \end{matrix}]}^{T}

, we obtain the following augmented system:

\begin{matrix} X_{k + 1} & = T X_{k} + B_{1} u_{k} \\ e_{k} & = C_{1} X_{k} \end{matrix}

(5)

where

T = [\begin{matrix} A & 0 \\ 0 & F \end{matrix}], B_{1} = [\begin{matrix} B \\ 0 \end{matrix}]

, and

C_{1} = [\begin{matrix} C & - I \end{matrix}]

.

It can be shown by Lemma 1 of [40] that, with the choice of

u_{k} = - K X_{k}

, where

K = [\begin{matrix} K_{x} & K_{r} \end{matrix}]

, the discounted performance index (4) can be expressed in a quadratic form as follows:

V (x_{k}, r_{k}) = V (X_{k}) = \frac{1}{2} X_{k}^{T} P X_{k}

(6)

where

P = P^{T} > 0

.

Using Formula (4), the cost function can be expressed as follows:

J (x_{k}, r_{k}) = \frac{1}{2} (e_{k}^{T} Q e_{k} + u_{k}^{T} R u_{k}) + \frac{1}{2} \sum_{i = k + 1}^{\infty} γ^{i - (k + 1)} (e_{i}^{T} Q e_{i} + u_{i}^{T} R u_{i})

(7)

Using Equation (6), the cost function

J (x_{k}, r_{k})

can be rewritten as

V (x_{k}, r_{k})

, which can be expressed as follows: H = r + Cv

V (x_{k}, r_{k}) = \frac{1}{2} e_{k}^{T} Q e_{k} + \frac{1}{2} u_{k}^{T} R u_{k} + γ V (x_{k + 1}, r_{k + 1})

(8)

Substituting Equation (6) into Equation (8) yields the LQT Bellman equation for P:

X_{k}^{T} P X_{k} = X_{k}^{T} Π X_{k} + u_{k}^{T} R u_{k} + γ X_{k + 1}^{T} P X_{k + 1}

(9)

where

Π = [\begin{matrix} C^{T} Q C & - C^{T} Q \\ - Q C & Q \end{matrix}]

.

Define the LQT Hamiltonian as

\frac{1}{2} H (X_{k}, u_{k}) = \frac{1}{2} X_{k}^{T} Π X_{k} + \frac{1}{2} u_{k}^{T} R u_{k} + γ V (X_{k + 1}) - V (X_{k})

(10)

By solving the stationary condition [40,42], i.e.,

\frac{\partial H (X_{k}, u_{k})}{\partial u_{k}} = 0

(11)

we can find the optimal control input

u_{k} = - K X_{k} = - K_{x} x_{k} - K_{r} r_{k}

(12)

where

K = {(R + γ B_{1}^{T} P B_{1})}^{- 1} γ B_{1}^{T} P T

and P satisfies the augmented algebraic Riccati equation (ARE):

\begin{matrix} Π - P + γ T^{T} P T - γ^{2} T^{T} P B_{1} {(R + γ B_{1}^{T} P B_{1})}^{- 1} B_{1}^{T} P = 0 \end{matrix}

(13)

Remark 2.

The augmented ARE (13) has a unique, positive definite solution P if the pair

(A, \sqrt{Q} C)

is observable and

γ^{1 / 2} F

is stable [25]. Additionally, a lower bound has been established for the discount factor to ensure the stability of the augmented system [43].

A direct solution to (13) is challenging due to the nonlinear relationship in the unknown parameter. Instead, we substitute (12) into (9) to obtain the augmented LQT Lyapunov equation:

Π - P + K^{T} R K + γ {(T - B_{1} K)}^{T} P (T - B_{1} K) = 0 .

(14)

To address this issue, an offline PI algorithm [5] has been proposed as an iterative approach to compute the solution to (14). However, it requires complete knowledge of the augmented system dynamics. To overcome this limitation, a Q-learning scheme [40] was developed to solve the model-free LQT problem.

2.2. Q-Function Bellman Equation

Let

Z_{k} = {[\begin{matrix} X_{k}^{T} & u_{k}^{T} \end{matrix}]}^{T}

; then, the discrete-time Q-function can be defined as follows:

Q (Z_{k}) = \frac{1}{2} X_{k}^{T} Π X_{k} + \frac{1}{2} u_{k}^{T} R u_{k} + γ V (X_{k + 1})

(15)

By substituting the augmented system dynamics (5) into (15), we obtain:

Q (Z_{k}) = \frac{1}{2} Z_{k}^{T} \tilde{H} Z_{k}

(16)

where

\tilde{H} = [\begin{matrix} Π + γ T^{T} P T & γ T^{T} P B_{1} \\ γ B_{1}^{T} P T & R + γ B_{1}^{T} P B_{1} \end{matrix}] \equiv [\begin{matrix} {\tilde{H}}_{X X} & {\tilde{H}}_{X u} \\ {\tilde{H}}_{u X} & {\tilde{H}}_{u u} \end{matrix}]

(17)

and

\tilde{H}

is a kernel matrix and

\tilde{H} = {\tilde{H}}^{T}

.

By applying

\frac{\partial Q (Z_{k})}{\partial u_{k}} = 0

, we can solve for

u_{k}

as follows:

u_{k} = - {({\tilde{H}}_{u u})}^{- 1} {\tilde{H}}_{u X} X_{k}

(18)

Furthermore, noticing that

Q (Z_{k}) = V (X_{k})

leads to the Q-function Bellman equation:

Z_{k}^{T} \tilde{H} Z_{k} = X_{k}^{T} Π X_{k} + u_{k}^{T} R u_{k} + γ Z_{k + 1}^{T} \tilde{H} Z_{k + 1}

(19)

This equation expresses the connection between the Q-function and the kernel matrix

\tilde{H}

.

2.3. PI-Based Q-Learning for LQT

Based on the Q-function Bellman equation (19), the PI-based Q-learning solution for the LQT problem can be implemented using Algorithm 1, without relying on the system dynamics [40].

Algorithm 1 PI Q-learning Algorithm for LQT.

Initialization:

Start with an admissible control policy

u_{k}^{0}

with

{\tilde{H}}^{0}

.

Procedure:

1: Policy Evaluation: For

j = 0, 1, \dots

, collect samples under

u_{k}^{j}

to solve

{\tilde{H}}^{j + 1}

using the Q-function Bellman equation:

Z_{k}^{T} {\tilde{H}}^{j + 1} Z_{k} = X_{k}^{T} Π X_{k} + {(u_{k}^{j})}^{T} R (u_{k}^{j}) + γ Z_{k + 1}^{T} {\tilde{H}}^{j + 1} Z_{k + 1}

2: Policy Improvement: Compute the improved control policy as follows:

u_{k}^{j + 1} = - {({\tilde{H}}_{u u}^{j + 1})}^{- 1} {\tilde{H}}_{u X}^{j + 1} X_{k}

3: Stopping Criterion: Stop the iteration if

∥ {\tilde{H}}^{j + 1} - {\tilde{H}}^{j} ∥ < ε

for some specified

small positive number

ε

. Otherwise, let

j = j + 1

and go back to iteration.

End Procedure

Algorithm 1 performs repeated iterations between policy evaluation and policy improvement until convergence. In contrast to the offline algorithm [40], Algorithm 1 conducts the policy improvement step using the learned kernel matrix

{\tilde{H}}^{j + 1}

. This allows finding the optimal policy even under completely unknown dynamic conditions.

3. Methods

3.1. Multistep Q-Learning

Lemma 1.

[24] When the pair (A, C) of the DTL system (1) is observable, the state

X_{k}

of the augmented system can be reconstructed from the past input, output, and reference signal trajectories:

X_{k} = [\begin{matrix} M_{u} & M_{y} & M_{r} \end{matrix}] [\begin{matrix} {\bar{u}}_{k - 1, k - N} \\ {\bar{y}}_{k - 1, k - N} \\ r_{k - N} \end{matrix}]

(20)

where

{\bar{u}}_{k - 1, k - N} = {[u_{k - 1}^{T}, u_{k - 2}^{T}, \dots, u_{k - N}^{T}]}^{T}

and

{\bar{y}}_{k - 1, k - N} = [y_{k - 1}^{T}, y_{k - 2}^{T}, \dots, y_{k - N}^{T}]

,

N \leq n

are the sequences of input and output signals over the time interval

[k - N, k - 1]

, respectively, and

M_{u} = [\begin{matrix} U_{N} - A^{N} W_{N}^{+} D_{N} \\ 0 \end{matrix}], M_{y} = [\begin{matrix} A^{N} W_{N}^{+} \\ 0 \end{matrix}], M_{r} = [\begin{matrix} 0 \\ F^{N} \end{matrix}]

U_{N} = [\begin{matrix} B_{1} & A B_{1} & A^{2} B_{1} & \dots & A^{N - 1} B_{1} \end{matrix}]

W_{N} = {[\begin{matrix} {(C A^{N - 1})}^{T} & {(C A^{N - 2})}^{T} & \dots & C A & C \end{matrix}]}^{T}

D_{N} = [\begin{matrix} 0 & C B & C A B & \dots & C A^{N - 2} B \\ 0 & 0 & C B & \dots & C A^{N - 3} B \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & 0 & 0 & C B \\ 0 & 0 & 0 & 0 & 0 \end{matrix}]

W_{N}^{+} = {(W_{N}^{T} W_{N})}^{- 1} W_{N}^{T}

Lemma 1 states that the Q-function Bellman equation (19) can be transformed by using the past input, output, and reference trajectory sequences. By substituting Equation (20) into Equation (16), we obtain

\begin{matrix} Q (Z_{k}) = \frac{1}{2} Z_{k}^{T} \tilde{H} Z_{k} = \frac{1}{2} {[\begin{matrix} {\bar{u}}_{k - 1, k - N} \\ {\bar{y}}_{k - 1, k - N} \\ r_{k - N} \\ u_{k} \end{matrix}]}^{T} H [\begin{matrix} {\bar{u}}_{k - 1, k - N} \\ {\bar{y}}_{k - 1, k - N} \\ r_{k - N} \\ u_{k} \end{matrix}] \\ \overset{Δ}{=} \frac{1}{2} z_{k}^{T} H z_{k} \end{matrix}

(21)

where

z_{k} = [\begin{matrix} {\bar{u}}_{k - 1, k - N} \\ {\bar{y}}_{k - 1, k - N} \\ r_{k - N} \\ u_{k} \end{matrix}]

H = H^{T} = [\begin{matrix} H_{\bar{u} \bar{u}} & H_{\bar{u} \bar{y}} & H_{\bar{u} r} & H_{\bar{u} u} \\ H_{\bar{y} \bar{u}} & H_{\bar{y} \bar{y}} & H_{\bar{y} r} & H_{\bar{y} u} \\ H_{r \bar{u}} & H_{r \bar{y}} & H_{r r} & H_{r u} \\ H_{u \bar{u}} & H_{u \bar{y}} & H_{u r} & H_{u u} \end{matrix}]

\begin{matrix} H_{\bar{u} \bar{u}} = M_{u}^{T} (Π + γ T^{T} P T) M_{u} = M_{u}^{T} {\tilde{H}}_{X X} M_{u} \\ H_{\bar{u} \bar{y}} = M_{u}^{T} (Π + γ T^{T} P T) M_{y} = M_{u}^{T} {\tilde{H}}_{X X} M_{y} \\ H_{\bar{u} r} = M_{u}^{T} (Π + γ T^{T} P T) M_{r} = M_{u}^{T} {\tilde{H}}_{X X} M_{r} \\ H_{\bar{u} u} = γ M_{u}^{T} T^{T} P B_{1} = M_{u}^{T} {\tilde{H}}_{X u} \\ H_{\bar{y} \bar{y}} = M_{y}^{T} (Π + γ T^{T} P T) M_{y} = M_{y}^{T} {\tilde{H}}_{X X} M_{y} \\ H_{\bar{y} r} = M_{y}^{T} (Π + γ T^{T} P T) M_{r} = M_{y}^{T} {\tilde{H}}_{X X} M_{r} \\ H_{\bar{y} u} = γ M_{y}^{T} T^{T} P B_{1} = M_{y}^{T} {\tilde{H}}_{X u} \\ H_{r r} = M_{r}^{T} (Π + γ T^{T} P T) M_{r} = M_{r}^{T} {\tilde{H}}_{X X} M_{r} \\ H_{r u} = γ M_{r}^{T} T^{T} P B_{1} = M_{r}^{T} {\tilde{H}}_{X u} \\ H_{u u} = R + γ B_{1}^{T} P B_{1} = {\tilde{H}}_{u u} \end{matrix}

According to the principle of optimality, the optimal control policy should satisfy

\frac{\partial Q (z_{k})}{\partial u_{k}} =

0. Solving for

u_{k}

from Equation (21) yields the optimal control policy

u_{k}^{*}

as:

\begin{matrix} u_{k}^{*} = - {(H_{u u})}^{- 1} (H_{u \bar{u}} {\bar{u}}_{k - 1, k - N} + H_{u \bar{y}} {\bar{y}}_{k - 1, k - N} + H_{u r} r_{k - N}) \\ = - {(H_{u u})}^{- 1} [\begin{matrix} H_{u \bar{u}} & H_{u \bar{y}} & H_{u r} \end{matrix}] [\begin{matrix} {\bar{u}}_{k - 1, k - N} \\ {\bar{y}}_{k - 1, k - N} \\ r_{k - N} \end{matrix}] \\ = - K^{*} [\begin{matrix} {\bar{u}}_{k - 1, k - N} \\ {\bar{y}}_{k - 1, k - N} \\ r_{k - N} \end{matrix}] \end{matrix}

(22)

where

K^{*} = {(H_{u u})}^{- 1} [\begin{matrix} H_{u \bar{u}} & H_{u \bar{y}} & H_{u r} \end{matrix}]

.

By substituting Equation (21) into Equation (19) with the utility function

r (τ_{k}, u_{k}) = τ_{k}^{T} Γ τ_{k} + u_{k}^{T} R u_{k}

, we have the Q-function Bellman equation incorporating input, output, and reference trajectory sequences, which is expressed as follows:

z_{k}^{T} H z_{k} = r (τ_{k}, u_{k}) + γ z_{k + 1}^{T} H z_{k + 1}

(23)

where

τ_{k} = [\begin{matrix} y_{k} \\ r_{k} \end{matrix}]

and

Γ = [\begin{matrix} Q & - Q \\ - Q & Q \end{matrix}]

.

Define the optimal value function

V^{*} (z_{k}) ≜ V_{u^{*}} (z_{k})

. According to the optimal control theory [1],

V^{*} (z_{k})

satisfies the following Bellman equation:

V^{*} (z_{k}) = min_{u (z)} \{r (τ_{k}, u_{k}) + γ V^{*} (z_{k + 1})\}

(24)

and the optimal control is

u_{k}^{*} = a r g min_{u (z)} \{r (τ_{k}, u_{k}) + γ V^{*} (z_{k + 1})\}

(25)

It is known that the policy evaluation step in the VI scheme is expressed as follows [23]:

z_{k}^{T} H^{j + 1} z_{k} = r (τ_{k}, u_{k}^{j}) + γ z_{k + 1}^{T} H^{j} z_{k + 1}

(26)

Transforming Equation (26) yields

\begin{matrix} z_{k}^{T} H^{j + 1} z_{k} & = r (τ_{k}, u_{k}^{j}) + γ z_{k + 1}^{T} H^{j} z_{k + 1} \\ = r (τ_{k}, u_{k}^{j}) + γ (r (τ_{k + 1}, u_{k + 1}^{j}) + γ z_{k + 2}^{T} H^{j} z_{k + 2}) \\ = r (τ_{k}, u_{k}^{j}) + γ r (τ_{k + 1}, u_{k + 1}^{j}) + γ^{2} (r (τ_{k + 2}, u_{k + 2}^{j}) + γ z_{k + 3}^{T} H^{j} z_{k + 3}) \\ ⋮ \\ = \sum_{i = k}^{k + N_{j} - 1} γ^{i - k} r (τ_{i}, u_{i}^{j}) + γ^{N_{j}} z_{k + N_{j}}^{T} H^{j} z_{k + N_{j}} \end{matrix}

(27)

Thus, the convergence of the VI method [23] can be accelerated by introducing a multistep utility function in the policy evaluation. The resulting multistep Q-learning VI algorithm based on output feedback is described as follows:

Step 1. Initialization: Set $j = 0$ and iterate from any initial control policy $u_{k}^{0}$ , which does not need to be stable or controllable, and $H^{0}$ .
Step 2. Multistep policy evaluation: Use the Q-function Bellman equation to solve $H^{j + 1}$ , where

$z_{k}^{T} H^{j + 1} z_{k} = \sum_{i = k}^{k + N_{j} - 1} γ^{i - k} r (τ_{i}, u_{i}^{j}) + γ^{N_{j}} z_{k + N_{j}}^{T} H^{j} z_{k + N_{j}}$

(28)
Step 3. Policy improvement: Update the control policy $u_{k}^{j + 1}$ as follows:

$u_{k}^{j + 1} = - {(H_{u u}^{j + 1})}^{- 1} (H_{u \bar{u}}^{j + 1} {\bar{u}}_{k - 1, k - N} + H_{u \bar{y}}^{j + 1} {\bar{y}}_{k - 1, k - N} + H_{u r}^{j + 1} r_{k - N})$

(29)
Step 4. Termination condition: Check if $∥H^{j + 1} - H^{j}∥ \leq l$ , where l is a very small threshold with pre-set algorithmic accuracy. If this condition is satisfied, terminate the iteration and obtain the optimal control policy $u_{k}^{j + 1}$ . Otherwise, return to Step 2 and repeat the iteration.

Remark 3.

As indicated in Equation (28), it is observed that the resultant value function in each policy evaluation step is the sum of the one-step utility function and the previous value function. When

N_{j} = 1

, Equation (28) simplifies to the policy evaluation step, which uses the VI framework for one-step learning [23,28]. The difference in policy evaluation leads to contrasting merits in value iteration. Unlike the traditional value iteration method, the proposed multistep Q-learning VI algorithm takes advantage of a finite-sum utility function instead of a one-step calculation as in value iteration. Consequently, the proposed algorithm leads to an improvement on the learning convergent speed, as demonstrated in the simulation example.

3.2. Adjustment Rules for Step Size $N_{j}$

During each iteration of the multistep policy evaluation (28), the step size is adjusted. The value iteration algorithm [23] employs a one-step policy evaluation to eliminate the requirement for an initial stabilizing control policy. On the other hand, the policy iteration algorithm uses an infinite-step strategy evaluation to speed up convergence [39]. However, the speed of convergence of the multistep Q-learning algorithm depends on the chosen step size. Initially, a small step size is used in the iteration to avoid the need for an initial stabilizing control policy. Then, the step size is gradually increased to speed up convergence. To adaptively adjust the step length, we use the following rule [41]:

N_{j} = 1 + [β \sqrt{j}]

(30)

where

β \geq 0, [\cdot]

means rounding down. When

β = 0

and

N_{j} = 1

, it is equivalent to the VI method with one-step policy evaluation [23].

3.3. Implementation

By using the least squares method, the linear parametric expression of

z_{k}^{T} H^{j + 1} z_{k}

is given as follows:

z_{k}^{T} H^{j + 1} z_{k} = {({\bar{H}}^{j + 1})}^{T} \bar{z} (k)

(31)

where

{\bar{H}}^{j + 1} = vec (H^{j + 1}) \in R^{l (l + 1)) / 2} \equiv {[H_{1 l}^{j + 1}, 2 H_{12}^{j + 1}, \dots, 2 H_{1 l}^{j + 1}, H_{21}^{j + 1}, \dots, 2 H_{2 l}^{j + 1}, \dots, H_{l l}^{j + 1}]}^{T}

Here,

H_{i k}^{j + 1}

represents the element in the i-th row and k-th column of matrix

H^{j + 1}

, where

i, k = 1, 2, \dots, l

, and

l = m N + p N + m

. The Kronecker product

{\bar{z}}_{k} = z_{k} \otimes z_{k}

is defined as

[z_{1}^{2}, z_{1} z_{2}, \dots, z_{1} z_{l}, z_{2}^{2}, z_{2} z_{3}, \dots, z_{2} z_{l}, \dots, z_{l}^{2}] \in R^{l (l + 1) / 2}

.

Combining Equation (31) with Equation (28), it can be simplified as:

{({\bar{H}}^{j + 1})}^{T} {\bar{z}}_{k} = \sum_{i = k}^{k + N_{j} - 1} γ^{i - k} r (τ_{i}, u_{i}^{j}) + γ^{N_{j}} {({\bar{H}}^{j})}^{T} {\bar{z}}_{k + N_{j}}

(32)

The symmetric matrix

H^{j + 1}

has dimensions of

l \times l

, resulting in a total of

l (l + 1) / 2

independent elements. Consequently, Equation (32) requires collecting at least

L \geq l (l + 1) / 2

sets of

{\bar{z}}_{k}

for solving.

The least squares expression for Equation (28) is

{\bar{H}}^{j + 1} = {\{{\{Φ^{j}\}}^{T} \{Φ^{j}\}\}}^{- 1} {\{Φ^{j}\}}^{T} \{Υ^{j} + γ^{N_{j}} Ψ^{j} {\bar{H}}^{j}\}

(33)

where

Φ^{j} = [\begin{matrix} {\bar{z}}_{k} \\ {\bar{z}}_{k + 1} \\ ⋮ \\ {\bar{z}}_{k + L - 1} \end{matrix}] \in R^{L \times l (l + 1) / 2}

Υ^{j} = [\begin{matrix} \sum_{i = k}^{k + N_{j} - 1} γ^{i - k} r (τ_{i}, u_{i}^{j}) \\ \sum_{i = k}^{k + N_{j}} γ^{i - k} r (τ_{i}, u_{i}^{j}) \\ ⋮ \\ \sum_{i = k}^{k + N_{j} + L - 1} γ^{i - k} r (τ_{i}, u_{i}^{j}) \end{matrix}] \in R^{L \times 1}

Ψ^{j} = [\begin{matrix} {\bar{z}}_{k + N_{j}} \\ {\bar{z}}_{k + N_{j} + 1} \\ ⋮ \\ {\bar{z}}_{k + N_{j} + L - 1} \end{matrix}] \in R^{L \times l (l + 1) / 2)}

Remark 4.

When

k \leq N

, the input–output data

{\bar{u}}_{k - 1, k - N}

and

{\bar{y}}_{k - 1, k - N}

will be unavailable. To address this issue, the internal model principle can be utilized to collect the missing data. Additionally, the internal model principle allows for asymptotic tracking control in the presence of small variations in system parameters, resulting in data that contain more intrinsic information for learning the optimal control solution.

Remark 5.

In Equation (33), the vector

{\bar{H}}^{j + 1}

represents the jth estimated value of

H^{j + 1}

under the current control policy. By using the components of

{\bar{H}}^{j + 1}

, we can infer the corresponding components of matrix

H^{j + 1}

and use them, along with the policy update step (29), to solve for the next step’s control policy. This updated policy can then be used to gather a new set of data for each iteration until obtaining an optimal control policy. To ensure uniqueness in solving Equation (33), a persistent excitation condition is proposed in the literature [20,21], where probing noise

w_{k}

is added to the control input, ensuring that

Φ^{j}

is full rank and

{(Φ^{j})}^{T} (Φ^{j})

is invertible. However, using the VFA method may result in bias in finding an optimal solution [23,25]. On the other hand, the Q-learning approach does not produce bias during the parameter estimation process and, hence, does not lead to bias in finding an optimal solution.

3.4. Convergence Analysis

In reference [41], a multistep Q-learning algorithm based on state feedback is proposed for solving the optimal output regulation problem of DTL systems, and the convergence of the proposed algorithm is derived. This paper investigates a multistep Q-learning algorithm based on output feedback to solve the optimal output tracking control problem of discrete-time linear systems. Unlike the optimal output regulation problem studied in reference [41], this paper introduces a discount factor

γ

into the performance index function of the optimal output tracking control problem, resulting in changes to the corresponding Bellman equation. The system state is reconstructed using input, output, and reference signals. Therefore, it is necessary to verify the convergence of the output feedback multistep Q-learning algorithm for solving the optimal output tracking control problem of DTL systems.

The convergence of the algorithm illustrated in Section 2.1 can be proven by first noticing that Equation (28) can be rewritten as follows:

Q^{j + 1} (z_{k}) = \frac{1}{2} \sum_{i = k}^{k + N_{j} - 1} γ^{i - k} r (τ_{i}, u_{i}^{j}) + γ^{N_{j}} Q^{j} (z_{k + N_{j}})

(34)

Here, we are ready to obtain the following theorem, which indicates that the policy matrix

H^{j + 1}

converges to the optimal value

H^{*}

.

Theorem 1.

Let

\{Q^{j} (z_{k})\}, w h e r e Q^{j} (z_{k}) = \frac{1}{2} z_{k}^{T} H^{j} z_{k}

, be the sequence generated by the multistep

Q

-learning algorithm. If

N_{j} \geq 1

and

Q^{0} (z_{k}) \geq min_{u (z)} \{\frac{1}{2} r (τ_{k}, u_{k}) + γ Q^{0} (z_{k + 1})\}

(35)

holds, then

(i): For any j,

$Q^{j + 1} (z_{k}) \leq min_{u (z)} \{\frac{1}{2} r (τ_{k}, u_{k}) + γ Q^{j} (z_{k + 1})\} \leq Q^{j} (z_{k})$

(36)

holds;
(ii): ${lim}_{j \to \infty} Q^{j} (z_{k}) = Q^{*} (z_{k})$ , where $Q^{\infty} (z_{k})$ is the optimal solution to the Q-function Bellman equation.

Proof .

(i) We will use mathematical induction to prove the result (36). From Equations (34) and (35), we have

\begin{matrix} Q^{1} (z_{k}) & = \frac{1}{2} \sum_{i = k}^{k + N_{0} - 1} γ^{i - k} r (τ_{i}, u_{i}^{0}) + γ^{N_{0}} Q^{0} (z_{k + N_{0}}) \\ = \frac{1}{2} \sum_{i = k}^{k + N_{0} - 2} γ^{i - k} r (τ_{i}, u_{i}^{0}) + \frac{1}{2} γ^{N_{0} - 1} r (τ_{k + N_{0} - 1}, u_{k + N_{0} - 1}^{0}) + γ^{N_{0}} Q^{0} (z_{k + N_{0}}) \\ = \frac{1}{2} \sum_{i = k}^{k + N_{0} - 2} γ^{i - k} r (τ_{i}, u_{i}^{0}) + γ^{N_{0} - 1} min_{u (z)} \{\frac{1}{2} r (τ_{k + N_{0} - 1}, u_{k + N_{0} - 1}) + γ Q^{0} (z_{k + N_{0}})\} \\ \leq \frac{1}{2} \sum_{i = k}^{k + N_{0} - 2} γ^{i - k} r (τ_{i}, u_{i}^{0}) + γ^{N_{0} - 1} Q^{0} (z_{k + N_{0} - 1}) \\ ⋮ \\ \leq \frac{1}{2} r (τ_{k}, u_{k}^{0}) + γ Q^{0} (z_{k + 1}) \\ = min_{u (z)} \{\frac{1}{2} r (τ_{k}, u_{k}) + γ Q^{0} (z_{k + 1})\} \leq Q^{0} (z_{k}) \end{matrix}

(37)

which means that Equation (36) holds for

j = 0

.

Next, assume that Equation (36) is satisfied for

j - 1

, i.e.,

Q^{j} (z_{k}) \leq min_{u (z)} \{\frac{1}{2} r (τ_{k}, u_{k}) + γ Q^{j - 1} (z_{k + 1})\} \leq Q^{j - 1} (z_{k})

Then,

\begin{matrix} Q^{j} (z_{k}) & = \frac{1}{2} \sum_{i = k}^{k + N_{j - 1} - 1} γ^{i - k} r (τ_{i}, u_{i}^{j - 1}) + γ^{N_{j - 1}} Q^{j - 1} (z_{k + N_{j - 1}}) \\ \geq \frac{1}{2} \sum_{i = k}^{k + N_{j - 1} - 1} γ^{i - k} r (τ_{i}, u_{i}^{j - 1}) + γ^{N_{j - 1}} min_{u (z)} \{\frac{1}{2} r (τ_{k + N_{j - 1}}, u_{k + N_{j - 1}}) + γ Q^{j - 1} (z_{k + N_{j - 1} + 1})\} \\ = \frac{1}{2} r (τ_{k}, u_{k}^{j - 1}) + \frac{1}{2} \sum_{i = k + 1}^{k + N_{j - 1} - 1} γ^{i - k} r (τ_{i}, u_{i}^{j - 1}) + γ^{N_{j - 1}} Q^{j - 1} (z_{k + N_{j - 1} + 1}) \\ = \frac{1}{2} r (τ_{k}, u_{k}^{j - 1}) + γ \{\sum_{i = k + 1}^{k + N_{j - 1}} \frac{1}{2} γ^{i - (k + 1)} r (τ_{i}, u_{i}^{j - 1}) + γ^{N_{j - 1}} Q^{j - 1} (z_{k + N_{j - 1} + 1})\} \\ = \frac{1}{2} r (τ_{k}, u_{k}^{j - 1}) + γ Q^{j} (z_{k + 1}) \geq min_{u (z)} \{\frac{1}{2} r (τ_{k}, u_{k}) + γ Q^{j} z_{k + 1}\} \end{matrix}

(38)

Using Equations (34) and (38), we have

\begin{matrix} Q^{j + 1} (z_{k}) & = \sum_{i = k}^{k + N_{j} - 1} \frac{1}{2} γ^{i - k} r (τ_{i}, u_{i}^{j}) + γ^{N_{j}} Q^{j} (z_{k + N_{j}}) \\ = \sum_{i = k}^{k + N_{j} - 2} \frac{1}{2} γ^{i - k} r (τ_{i}, u_{i}^{j}) + γ^{N_{j} - 1} min_{u (z)} \{\frac{1}{2} r (τ_{k + N_{j} - 1}, u_{k_{+ N_{j} - 1}}) + γ Q^{j} (z_{k + N_{j}})\} \\ \leq \sum_{i = k}^{k + N_{j} - 2} \frac{1}{2} γ^{i - k} r (τ_{i}, u_{i}^{j}) + γ^{N_{j} - 1} Q^{j} (z_{k + N_{j} - 1}) \\ ⋮ \\ \leq \frac{1}{2} r (τ_{k}, u_{k}^{j}) + γ Q^{j} (z_{k + 1}) \\ = min_{u (z)} \{\frac{1}{2} r (τ_{k}, u_{k}) + γ Q^{j} (z_{k + 1})\} \end{matrix}

(39)

Thus, Equation (36) holds for all j.

(ii): According to the conclusion (36), $\{Q^{j} (z_{k})\}$ is a monotonically non-increasing sequence with a lower bound of $Q^{j} (z_{k}) \geq 0$ . For a bounded monotone sequence, we can always have a limit denoted by $Q^{\infty} (z_{k}) ≜ {lim}_{j \to \infty} Q^{j} (z_{k})$ . Take the limit of Equation (36):

$Q^{\infty} (z_{k}) \leq min_{u (z)} \{\frac{1}{2} r (τ_{k}, u_{k}) + γ Q^{\infty} (z_{k + 1})\} \leq Q^{\infty} (z_{k})$

(40)

Here, we have

$Q^{\infty} (z_{k}) = \frac{1}{2} r (τ_{k}, u_{k}) + γ Q^{\infty} (z_{k + 1})$

(41)

Notice that Equation (41) is the solution to the Q-function Bellman equation (23). As a result, $Q^{\infty} (z_{k}) = Q^{*} (z_{k})$ .

□

Remark 6.

It is worth noting the choice of the initial value function

Q^{0}

in multistep Q-learning algorithms. According to Theorem 1,

Q^{0}

must satisfy Condition (35). However, (35) is only a sufficient condition, not a necessary one. Therefore, in practical systems,

Q^{0}

can be a positive definite function over a large range and can be chosen through trial and error.

4. Simulation Experiment

4.1. Controlled Object

To validate the proposed algorithm, a simulation experiment was conducted on a single-phase voltage source uninterruptible power supply (UPS) inverter, an essential component of the smart grid. With the development of new energy technologies, it is important for the control engineer to design a controller that makes UPS provide efficient and stable sinusoidal output voltages with optimal performance even in the presence of unknown loads. The circuit diagram of the single-phase voltage source UPS inverter is illustrated in Figure 1.

The dynamic equations of the inverter can be expressed as follows:

\begin{matrix} C_{f} \frac{d v_{o}}{d t} = i_{L} - i_{o} \\ L_{f} \frac{d i_{L}}{d t} + r i_{L} = u V_{s} - v_{o} \end{matrix}

(42)

where

L_{f}

represents the filter inductance;

C_{f}

represents the filter capacitance; r represents the inductance resistance;

i_{L}

represents the filter inductance current;

v_{0}

represents the output voltage of the inverter, which is the output voltage of the pulse width modulation (PWM) inverter bridge represented as

u V_{s}

;

i_{o} = \frac{v_{o}}{R_{o}}

represents the output current; and

R_{o}

represents the resistance value of the power grid.

Choosing

v_{0}

and

i_{L}

as the system state variables,

u V_{s}

as the system input, and

v_{0}

as the system output, we can obtain the state space representation of the single-phase voltage source UPS inverter as follows:

\begin{matrix} \dot{x} = \bar{A} x + \bar{B} u = [\begin{matrix} - \frac{1}{R_{o} C_{f}} & \frac{1}{C_{f}} \\ - \frac{1}{L_{f}} & - \frac{r}{L_{f}} \end{matrix}] x + [\begin{matrix} 0 \\ \frac{1}{L_{f}} \end{matrix}] u \\ y = \bar{C} x = [\begin{matrix} 1 & 0 \end{matrix}] x \end{matrix}

(43)

In the above equations, the inverter model parameters are as follows:

L_{f} = 3.56

mH

,

C_{f} = 9.92

μF

,

r = 0.4

Ω

, and

R_{o} = 50

Ω

. The initial values for the capacitor voltage and inductor current are set as

x_{0} = {[0, 0]}^{T}

.

By discretizing Equation (43), the state space representation of the discrete-time system can be obtained as follows:

\begin{matrix} x_{k + 1} = A x_{k} + B u_{k} \\ y_{k} = C x_{k} \end{matrix}

(44)

where

A = e^{\bar{A} T}

,

B = \int_{0}^{T} e^{\bar{A} τ} d τ \bar{B}

, and

C = \bar{C}

. The sampling interval is

T = 10^{- 4} s

. Substituting the inverter model parameters into the equations, we have:

A = [\begin{matrix} 0.6969 & 806545 \\ - 0.0241 & 0.8603 \end{matrix}], B = [\begin{matrix} 0.1290 \\ 0.0267 \end{matrix}], C = [\begin{matrix} 1 & 0 \end{matrix}]

(45)

A sinusoidal signal is a typical type of reference signal in power electronics control. The state space representation of a continuous-time system with a sinusoidal signal of magnitude

220 \sqrt{2} V

and frequency

f = 50 Hz

is given by:

\begin{matrix} {\dot{x}}_{d} = [\begin{matrix} 0 & 2 π f \\ - 2 π f & 0 \end{matrix}] x_{d} = [\begin{matrix} 0 & 100 π \\ - 100 π & 0 \end{matrix}] x_{d} \\ r_{d} = [\begin{matrix} 1 & 0 \end{matrix}] x_{d} \end{matrix}

(46)

where the initial state

x_{d} (0) = {[0, 1]}^{T}

. The state space expression for the DTL system corresponding to Equation (46) can be described in the form shown in Equation (2), where

F = [\begin{matrix} 0.9995 & 0.0314 \\ - 0.0314 & 0.9995 \end{matrix}]

4.2. Experiment

We select

Q = 0.1, R = 0.001

, and

γ = 0.009

as the parameter values. For each step, we set

β = 4

for

N_{j}

. The detection noise

w_{k}

is defined as follows:

w_{k} = 0.001 (7 sin (k) + 5 cos (2 k) + 9 sin (8 k) + 2 cos (6 k))

(47)

The controlled system used in the simulation experiment has 28 independent variables in

{\bar{H}}^{j + 1}

, as shown in Equation (33). Therefore, a minimum of 28 sets of data are required for each iteration. In this simulation, we collected 30 sets of data. The following are the simulation results obtained using MATLAB.

The tracking curve of the output feedback multistep Q-learning algorithm is depicted in Figure 2. In this figure,

r_{d}

represents the sinusoidal reference signal, and

y_{o}

represents the actual output of the controlled system. After a certain number of iterations, the system output

y_{o}

successfully tracks the reference signal

r_{d}

. The corresponding tracking error curve is illustrated in Figure 3. From this figure, it can be observed that the tracking error reaches zero at 0.012 s, achieving the tracking goal. Figure 4 illustrates the norm of the difference between the Q-function matrices after two consecutive iterations in the output feedback multistep Q-learning algorithm. In the figure, it is shown that

|H^{j + 1} - H^{j}| < 10^{- 6}

in the third iteration, indicating that the desired control accuracy has been achieved, and the algorithm has successfully converged. The simulation for the one-step learning case when

N_{j} = 1

has also been conducted for clarity. It can be observed from Figure 5 that

|H^{j + 1} - H^{j}| \approx 4.12 \times 10^{- 5}

for the third iteration, which is larger than the value of

9.01 \times 10^{- 7}

in Figure 4. Therefore, the proposed multistep Q-learning scheme improves the learning convergence speed.

Figure 6 shows the variations in each component of the control gain K during each iteration process of the output feedback multistep Q-learning algorithm. Through iterations, it converges to

K^{3} = [\begin{matrix} 0.03769 & 0.02188 & 0.1879 & - 0.1472 & - 0.1163 & - 0.01107 \end{matrix}]

with

H^{3} = [\begin{matrix} 0.001758 & 0.001603 & 0.0056 & - 0.01079 & - 0.01316 & - 0.0008375 & 3.563 e - 5 \\ 0.01603 & 0.001473 & 0.01897 & - 0.009919 & - 0.01215 & - 0.00077 & 2.243 e - 5 \\ 0.02056 & 0.01897 & 0.2449 & - 0.1277 & - 0.1569 & - 0.009916 & 0.0001926 \\ - 0.01079 & - 0.00919 & - 0.1277 & 0.06677 & 0.08181 & 0.005183 & - 0.0001508 \\ - 0.01316 & - 0.01215 & - 0.1569 & 0.8181 & 0.1005 & 0.006352 & - 0.001192 \\ - 0.008375 & - 0.0077 & - 0.009916 & 0.05183 & 0.006352 & 0.004024 & - 1.135 e - 5 \\ 3.863 e - 5 & 2.243 e - 5 & 0.001926 & - 0.001508 & - 0.001192 & - 1.135 e - 5 & 0.001025 \end{matrix}]

By solving the algebraic Riccati equation (13), the theoretical optimal values for the control gain K and the matrix H can be obtained as follows:

K^{*} = [\begin{matrix} 0.0377 & 0.0219 & 0.1879 & - 0.1472 & - 0.1163 & - 0.0111 \end{matrix}]

and

H^{*} = [\begin{matrix} 0.001758 & 0.001603 & 0.02056 & - 0.01079 & - 0.01316 & - 0.0008375 & 3.828 e - 5 \\ 0.001603 & 0.001473 & 0.01897 & - 0.009919 & - 0.01215 & - 0.00077 & 2.221 e - 5 \\ 0.02056 & 0.01897 & 0.2449 & - 0.1277 & - 0.1569 & - 0.009916 & 0.0001909 \\ - 0.01079 & - 0.009919 & - 0.1277 & 0.06677 & 0.08181 & 0.005183 & - 0.0001495 \\ - 0.01316 & - 0.01215 & - 0.1569 & 0.08181 & 0.1005 & 0.006352 & - 0.0001181 \\ - 0.0008375 & - 0.00077 & - 0.009916 & 0.005183 & 0.006352 & 0.0004024 & - 1.125 e - 5 \\ 3.828 e - 5 & 2.221 e - 5 & 0.0001909 & - 0.0001495 & - 0.0001181 & - 1.125 e - 5 & 0.001016 \end{matrix}]

The Q function matrix

H^{3}

obtained using the proposed learning algorithm is observed to be nearly equal to the theoretical optimal value

H^{*}

, indicating the effectiveness of the proposed output feedback multistep Q-learning algorithm for model-free tracking control.

Figure 7 shows the input signal trajectory of the actual tracking control system, which is a sinusoidal signal after the first four iterations. Figure 8 depicts the waveform of the excitation noise signal, which becomes zero at 0.0091 s, indicating the end of the algorithm learning phase without further noise excitation input. Figure 9 illustrates the tracking error of the system as the value of the system resistance parameter varies within the range of 40

Ω \leq R_{o} \leq 60

Ω

. With the change in resistance values, the system maintains its ability to achieve asymptotic tracking, demonstrating the adaptive characteristic of the algorithm. Figure 10 shows the variation in the step size

N_{j}

in the multistep Q-learning algorithm. It can be observed that

N_{j}

gradually increases as the iterations increase.

5. Conclusions

In this paper, we investigate a value iteration (VI)-based multistep Q-learning algorithm for model-free optimal tracking controller design of unknown discrete-time linear (DTL) systems. By utilizing the augmented system approach, we transform this problem into a regulation problem with a discounted performance function, that depends on the Q-function Bellman equation. To solve the Bellman equation, we employe the VI learning mechanism and develop a multistep Q-learning algorithm that eliminates the need for an initial admissible policy and only requires measurements of past input, output, and reference trajectory data. As a result, our proposed approach offers a novel solution that does not require state measurements and has improved convergence learning speed. To validate the effectiveness of the proposed design, we demonstrate its application through a simulation example. Future work will involve extending the proposed multistep Q-learning scheme to unknown discrete-time systems with time delays and/or sampling errors. Additionally, it would be interesting to explore how to balance the computational demands of the algorithm with the available arithmetic power in practical experimental platforms.

Author Contributions

Conceptualization, W.S. and X.D.; methodology, W.S.; software, W.S. and X.S.; validation, X.D.; formal analysis, X.D.; investigation, X.D., Y.L. and X.W.; writing—original draft preparation, X.D.; writing—review and editing, W.S.; visualization, X.D., Y.L. and X.W.; supervision, W.S.; project administration, W.S.; funding acquisition, X.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (grant numbers 62003141 and U20A20224), the Natural Science Foundation of Guangdong Province, China (grant number 2021A1515011598), and the Fundamental Research Funds for the Central Universities (grant number 2022ZYGXZR023).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lewis, F.L.; Vrabie, D.; Syrmos, V.L. Optimal Control, 3rd ed.; Wiley: Hoboken, NJ, USA, 2012. [Google Scholar]
Luo, R.; Peng, Z.; Hu, J. On model identification based optimal control and its applications to multi-agent learning and control. Mathematics 2023, 11, 906. [Google Scholar] [CrossRef]
Chen, Y.H.; Chen, Y.Y. Trajectory tracking design for a swarm of autonomous mobile robots: A nonlinear adaptive optimal approach. Mathematics 2022, 10, 3901. [Google Scholar] [CrossRef]
Banholzer, S.; Herty, M.; Pfenninger, S.; Zügner, S. Multiobjective model predictive control of a parabolic advection-diffusion-reaction equation. Mathematics 2020, 8, 777. [Google Scholar] [CrossRef]
Hewer, G. An Iterative Technique for the Computation of the Steady State Gains for the Discrete Optimal Regulator. IEEE Trans. Autom. Control 1971, 16, 382–384. [Google Scholar] [CrossRef]
Lancaster, P.; Rodman, L. Algebraic Riccati Equations; Oxford University Press: Oxford, UK, 1995. [Google Scholar]
Dai, S.; Wang, C.; Wang, M. Dynamic Learning From Adaptive Neural Network Control of a Class of Nonaffine Nonlinear Systems. IEEE Trans. Neural Netw. Learn. Syst. 2014, 25, 111–123. [Google Scholar]
He, W.; Dong, Y.; Sun, C. Adaptive Neural Impedance Control of a Robotic Manipulator With Input Saturation. IEEE Trans. Syst. Man, Cybern. Syst. 2016, 46, 334–344. [Google Scholar] [CrossRef]
Luy, N.T. Robust adaptive dynamic programming based online tracking control algorithm for real wheeled mobile robot with omni-directional vision system. Trans. Inst. Meas. Control. 2017, 39, 832–847. [Google Scholar] [CrossRef]
He, W.; Meng, T.; He, X.; Ge, S.S. Unified iterative learning control for flexible structures with input constraints. Automatica 2018, 96, 326–336. [Google Scholar] [CrossRef]
Radac, M.B.; Precup, R.E. Data-Driven model-free tracking reinforcement learning control with VRFT-based adaptive actor-critic. Appl. Sci. 2019, 9, 1807. [Google Scholar] [CrossRef]
Wang, Z.; Liu, D. Data-Based Controllability and Observability Analysis of Linear Discrete-Time Systems. IEEE Trans. Neural Netw. 2011, 22, 2388–2392. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
Sutton, R.S.; Barto, A.G.; Williams, R.J. Reinforcement learning is direct adaptive optimal control. IEEE Control Syst. Mag. 1992, 12, 19–22. [Google Scholar]
Lewis, F.L.; Vrabie, D. Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circuits Syst. Mag. 2009, 9, 32–50. [Google Scholar] [CrossRef]
Wang, F.; Zhang, H.; Liu, D. Adaptive Dynamic Programming: An Introduction. IEEE Comput. Intell. Mag. 2009, 4, 39–47. [Google Scholar] [CrossRef]
Jiang, Z.P.; Jiang, Y. Robust adaptive dynamic programming for linear and nonlinear systems: An overview. Eur. J. Control 2013, 19, 417–425. [Google Scholar] [CrossRef]
Zhang, K.; Zhang, H.; Cai, Y.; Su, R. Parallel Optimal Tracking Control Schemes for Mode-Dependent Control of Coupled Markov Jump Systems via Integral RL Method. IEEE Trans. Autom. Sci. Eng. 2020, 17, 1332–1342. [Google Scholar] [CrossRef]
Zhang, K.; Zhang, H.; Mu, Y.; Liu, C. Decentralized Tracking Optimization Control for Partially Unknown Fuzzy Interconnected Systems via Reinforcement Learning Method. IEEE Trans. Fuzzy Syst. 2020, 29, 917–926. [Google Scholar] [CrossRef]
Vrabie, D.; Pastravanu, O.; Abou-Khalaf, M.; Lewis, F.L. Adaptive optimal control for continuous-time linear systems based on policy iteration. Automatica 2009, 45, 477–484. [Google Scholar] [CrossRef]
Jiang, Y.; Jiang, Z.P. Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics. Automatica 2012, 48, 2699–2704. [Google Scholar] [CrossRef]
Modares, H.; Lewis, F.L. Linear Quadratic Tracking Control of Partially-Unknown Continuous-Time Systems Using Reinforcement Learning. IEEE Trans. Autom. Control. 2014, 59, 3051–3056. [Google Scholar] [CrossRef]
Li, X.; Xue, L.; Sun, C. Linear quadratic tracking control of unknown discrete-time systems using value iteration algorithm. Neurocomputing 2018, 314, 86–93. [Google Scholar] [CrossRef]
Lewis, F.L.; Vamvoudakis, K.G. Reinforcement Learning for Partially Observable Dynamic Processes: Adaptive Dynamic Programming Using Measured Output Data. IEEE Trans. Syst. Man, Cybern. Part B (Cybern.) 2011, 41, 14–25. [Google Scholar] [CrossRef]
Kiumarsi, B.; Lewis, F.L.; Naghibi-Sistani, M.B.; Karimpour, A. Optimal Tracking Control of Unknown Discrete-Time Linear Systems Using Input-Output Measured Data. IEEE Trans. Cybern. 2015, 45, 2770–2779. [Google Scholar] [CrossRef]
Gao, W.; Huang, M.; Jiang, Z.; Chai, T. Sampled-data-based adaptive optimal output-feedback control of a 2-degree-of-freedom helicopter. IET Control. Theory Appl. 2016, 10, 1440–1447. [Google Scholar] [CrossRef]
Xiao, G.; Zhang, H.; Zhang, K.; Wen, Y. Value iteration based integral reinforcement learning approach for H_∞ controller design of continuous-time nonlinear systems. Neurocomputing 2018, 285, 51–59. [Google Scholar] [CrossRef]
Chen, C.; Sun, W.; Zhao, G.; Peng, Y. Reinforcement Q-Learning Incorporated With Internal Model Method for Output Feedback Tracking Control of Unknown Linear Systems. IEEE Access 2020, 8, 134456–134467. [Google Scholar] [CrossRef]
Zhao, F.; Gao, W.; Liu, T.; Jiang, Z.P. Adaptive optimal output regulation of linear discrete-time systems based on event-triggered output-feedback. Automatica 2022, 137, 110103. [Google Scholar] [CrossRef]
Radac, M.B.; Lala, T. Learning Output Reference Model Tracking for Higher-Order Nonlinear Systems with Unknown Dynamics. Algorithms 2019, 12, 121. [Google Scholar] [CrossRef]
Shi, P.; Shen, Q.K. Observer-based leader-following consensus of uncertain nonlinear multi-agent systems. Int. J. Robust Nonlinear Control. 2017, 27, 3794–3811. [Google Scholar] [CrossRef]
Rizvi, S.A.A.; Lin, Z. Output Feedback Q-Learning Control for the Discrete-Time Linear Quadratic Regulator Problem. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 1523–1536. [Google Scholar] [CrossRef]
Zhu, L.M.; Modares, H.; Peen, G.O.; Lewis, F.L.; Yue, B. Adaptive suboptimal output-feedback control for linear systems using integral reinforcement learning. IEEE Trans. Control. Syst. Technol. 2015, 23, 264–273. [Google Scholar] [CrossRef]
Moghadam, R.; Lewis, F.L. Output-feedback H_∞ quadratic tracking control of linear systems using reinforcement learning. Int. J. Adapt. Control. Signal Process. 2019, 33, 300–314. [Google Scholar] [CrossRef]
Valadbeigi, A.P.; Sedigh, A.K.; Lewis, F.L. H_∞ Static Output-Feedback Control Design for Discrete-Time Systems Using Reinforcement Learning. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 396–406. [Google Scholar] [CrossRef]
Rizvi, S.A.A.; Lin, Z. Output feedback Q-learning for discrete-time linear zero-sum games with application to the H-infinity control. Automatica 2018, 95, 213–221. [Google Scholar] [CrossRef]
Peng, Y.; Chen, Q.; Sun, W. Reinforcement Q-Learning Algorithm for H_∞ Tracking Control of Unknown Discrete-Time Linear Systems. IEEE Trans. Syst. Man Cybern. Syst. 2019, 50, 4109–4122. [Google Scholar] [CrossRef]
Rizvi, S.A.A.; Lin, Z. Experience replay-based output feedback Q-learning scheme for optimal output tracking control of discrete-time linear systems. Int. J. Adapt. Control. Signal Process. 2019, 33, 1825–1842. [Google Scholar] [CrossRef]
Luo, B.; Liu, D.; Huang, T.; Liu, J. Output Tracking Control Based on Adaptive Dynamic Programming with Multistep Policy Evaluation. IEEE Trans. Syst. Man Cybern. Syst. 2019, 49, 2155–2165. [Google Scholar] [CrossRef]
Kiumarsi, B.; Lewis, F.L.; Modares, H.; Karimpour, A.; Naghibi-Sistani, M.B. Reinforcement Q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics. Automatica 2014, 50, 1167–1175. [Google Scholar] [CrossRef]
Luo, B.; Wu, H.N.; Huang, T. Optimal output regulation for model-free quanser helicopter with multistep Q-learning. IEEE Trans. Ind. Electron. 2017, 65, 4953–4961. [Google Scholar] [CrossRef]
Lewis, F.L.; Vrabie, D.; Vamvoudakis, K.G. Reinforcement Learning and Feedback Control: Using Natural Decision Methods to Design Optimal Adaptive Controllers. IEEE Control. Syst. Mag. 2012, 32, 76–105. [Google Scholar]
Kiumarsi, B.; Lewis, F.L. Output synchronization of heterogeneous discrete-time systems: A model-free optimal approach. Automatica 2017, 84, 86–94. [Google Scholar] [CrossRef]

Figure 1. Circuit diagram of a single-phase voltage source UPS inverter.

Figure 2. Effects of reference trajectory tracking with the multistep Q-learning algorithm.

Figure 3. Tracking error curve.

Figure 4. The norm of the difference between the Q-function matrices using multistep learning.

Figure 5. The norm of the difference between the Q-function matrices using one-step learning.

Figure 6. Convergence of control gain K with number of iterations.

Figure 7. Input signal of the actual tracking control systems.

Figure 8. Excitation noise signal.

Figure 9. Tracking error under multistep Q-learning with different resistance values

R_{o}

.

Figure 9. Tracking error under multistep Q-learning with different resistance values

R_{o}

.

Figure 10. Variation of

N_{j}

with number of iterations.

Figure 10. Variation of

N_{j}

with number of iterations.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dong, X.; Lin, Y.; Suo, X.; Wang, X.; Sun, W. The Adaptive Optimal Output Feedback Tracking Control of Unknown Discrete-Time Linear Systems Using a Multistep Q-Learning Approach. Mathematics 2024, 12, 509. https://doi.org/10.3390/math12040509

AMA Style

Dong X, Lin Y, Suo X, Wang X, Sun W. The Adaptive Optimal Output Feedback Tracking Control of Unknown Discrete-Time Linear Systems Using a Multistep Q-Learning Approach. Mathematics. 2024; 12(4):509. https://doi.org/10.3390/math12040509

Chicago/Turabian Style

Dong, Xunde, Yuxin Lin, Xudong Suo, Xihao Wang, and Weijie Sun. 2024. "The Adaptive Optimal Output Feedback Tracking Control of Unknown Discrete-Time Linear Systems Using a Multistep Q-Learning Approach" Mathematics 12, no. 4: 509. https://doi.org/10.3390/math12040509

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Adaptive Optimal Output Feedback Tracking Control of Unknown Discrete-Time Linear Systems Using a Multistep Q-Learning Approach

Abstract

1. Introduction

2. Problem Statement

2.1. Offline Solution for LQT

2.2. Q-Function Bellman Equation

2.3. PI-Based Q-Learning for LQT

3. Methods

3.1. Multistep Q-Learning

3.2. Adjustment Rules for Step Size $N_{j}$

3.3. Implementation

3.4. Convergence Analysis

4. Simulation Experiment

4.1. Controlled Object

4.2. Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

The Adaptive Optimal Output Feedback Tracking Control of Unknown Discrete-Time Linear Systems Using a Multistep Q-Learning Approach

Abstract

1. Introduction

2. Problem Statement

2.1. Offline Solution for LQT

2.2. Q-Function Bellman Equation

2.3. PI-Based Q-Learning for LQT

3. Methods

3.1. Multistep Q-Learning

3.2. Adjustment Rules for Step Size N j

3.3. Implementation

3.4. Convergence Analysis

4. Simulation Experiment

4.1. Controlled Object

4.2. Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.2. Adjustment Rules for Step Size $N_{j}$