Quadratic Tracking Control of Linear Stochastic Systems with Unknown Dynamics Using Average Off-Policy Q-Learning Method

Hao, Longyan; Wang, Chaoli; Shi, Yibo

doi:10.3390/math12101533

Open AccessArticle

Quadratic Tracking Control of Linear Stochastic Systems with Unknown Dynamics Using Average Off-Policy Q-Learning Method

by

Longyan Hao

^†

,

Chaoli Wang

^*,† and

Yibo Shi

Department of Control Science and Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2024, 12(10), 1533; https://doi.org/10.3390/math12101533

Submission received: 14 April 2024 / Revised: 11 May 2024 / Accepted: 13 May 2024 / Published: 14 May 2024

(This article belongs to the Special Issue Dynamics and Control of Complex Systems and Robots)

Download

Browse Figures

Versions Notes

Abstract

:

This article investigates the optimal tracking control problem for data-based stochastic discrete-time linear systems. An average off-policy Q-learning algorithm is proposed to solve the optimal control problem with random disturbances. Compared with the existing off-policy reinforcement learning (RL) algorithm, the proposed average off-policy Q-learning algorithm avoids the assumption of an initial stability control. First, a pole placement strategy is used to design an initial stable control for systems with unknown dynamics. Second, the initial stable control is used to design a data-based average off-policy Q-learning algorithm. Then, this algorithm is used to solve the stochastic linear quadratic tracking (LQT) problem, and a convergence proof of the algorithm is provided. Finally, numerical examples show that this algorithm outperforms other algorithms in a simulation.

Keywords:

average off-policy Q-learning; linear quadratic tracking; data based control; reinforcement learning; stochastic linear system

MSC:

93E35

1. Introduction

The optimal control problem (OCP) requires that a model of the system is known, and the solution of the problem is offline, which can not adapt to an unknown dynamic system. Therefore, data-driven reinforcement learning methods have emerged and are widely used to solve the OCP for discrete-time [1,2,3] and continuous-time uncertain systems [4,5,6,7,8].

Reinforcement learning (RL) has been applied for solving optimal control problems in uncertain environments. Reinforcement learning (RL) methods are categorized into two classes: on-policy and off-policy. On-policy methods evaluate or improve the same policy that is used to make decisions. Off-policy methods, on the other hand, evaluate one policy, while following another policy. In other words, in off-policy methods, the policy that is used to generate data, called the behavior policy, may in fact be unrelated to the policy that is being evaluated and improved, called the estimation policy or target policy [9].

Regarding an on-policy algorithm, the iterated control gain will be returned to the system for real-time data generation along the system trajectory [8]. Regarding the collection of off-policy algorithm data, in the early stages, the data are collected over a period of time and reused for an iterative solution of the control gain. That is, the control gain obtained from each subsequent iteration is no longer returned to the system to generate new data [10]. Comparing on-policy and off-policy learning strategies, it can be seen that the former spreads the computational burden over different iteration time points, at the price of a longer learning process. The latter can achieve faster learning by making full use of online measurements, at the expense of a heavier computational effort at a single iteration time point. Comparing the on-policy and the off-policy learning strategies, it can be seen that the former spreads the computational burden over different iteration time points, at the price of a longer learning process. The latter can achieve faster learning by making full use of online measurements, at the expense of heavier computational effort at a single iteration time point [11].

When a linear system contains external stochastic disturbances, the OCP is solved for linear systems with additive noise and state-dependent noise when the system model are known [12]. But for situations where there are partial unknowns in the system model, reference [13] solved the optimal control problem with control-dependent noise. Furthermore, for cases where the system dynamics are completely unknown, reference [14] discussed the OCP of linear systems with an external stochastic disturbance. However, reference [14] required that random disturbances were measurable, which is not very practical in reality. In [15], the OCP was studied when the system was unknown, and the problem of an un-measurable stochastic disturbance mentioned above was solved. In the literature above, when solving external stochastic disturbances, the performance index for selecting the optimal control problem is the average performance index function [16]. By utilizing an average cost function, Reference [17] tackled the output regulation problem for linear systems with unknown dynamics. Data-driven average Q-learning algorithms can also handle linear quadratic control problems, especially when there are un-measurable stochastic disturbances [18,19].

The field of modern control has focused much of its research efforts on the stochastic optimization problem [20,21]. It is commonly known that the solvability of the stochastic algebra equation (SAE) is equal to the feasibility of the stochastic linear quadratic (SLQ) optimal control problem [22]. The nonlinear features of the SAE make it difficult to obtain an analytical solution. For example, optimal control is obtained by solving the SAE using linear matrix inequality [23]. Liu [24] adopted the Lagrange multiplier theorem to obtain the condition of well-posedness and a state feedback solution for the SLQ problem.

However, the aforementioned literature did not take into account the tracking problem for unknown linear systems. The LQT problem of linear discrete time systems was solved when the system is unknown in [25]. However, that paper [25] did not consider external noises in the system. In reference [18], the authors considered the case of external noises to address the optimal control problem (OCP), but the tracking control problem was not taken into account. In reference [26], the Q-learning method was employed to address the LQT problem in the presence of partially unknown system dynamics. However, the specific case of the LQT problem with stochastic disturbances in the system was not considered in [26]. In [27], an off-policy algorithm was proposed to solve the LQT problem of linear systems with stochastic disturbance, but the algorithm still required the assumption of initial admissible control, which is difficult to choose with unknown dynamics.

In comparison to the aforementioned existing references, we introduce a novel approach called the average off-policy Q-learning (AOPQ) algorithm to tackle the LQT problem in unknown linear systems subjected to external stochastic disturbances. Our first contribution is that we introduce an average off-policy Q-learning algorithm to solve the LQT problem with stochastic disturbance. The second contribution is that we propose a method for solving an initial stable control when the system is unknown, thus avoiding the assumption of an initial stable control. The contributions of this paper are mainly reflected in the following aspects:

In reference [25,28], they studied the LQT problem of linear systems, but did not consider the problems of stochastic disturbance. Due to stochastic disturbances that may occur error in state and output information, the standard Q-learning algorithm cannot be used directly.
We proposed the AOPQ algorithm to overcome this problem.
In reference [29], the authors investigated the set-point tracking problem in linear systems with disturbance when system models are unknown. However, this tracking signal is limited and can only track constant values and the proposed algorithm requires the assumption of an initial stable control. The tracking signal studied in this paper does not have this limitation and provides a data-driven method for providing an initial stable control.
In contrast to reference [30], this paper applies off- policy Q-learning to solve the LQR problem of unknown discrete systems. It provides a data-based method for designing an initial stable control, which can obtain an initial stable controller using a pole placement strategy, with the system coefficient matrix constructed from data. However, this does not take into account the situation when the system output is tracking an external reference. For LQT problems with external stochastic disturbances, this paper solves the LQT problem of a linear discrete system using an average off-policy Q-learning algorithm.

This paper’s organizational structure is as follows: The LQT control problem for a stochastic linear discrete system is described in the next section. In Section 3, a model-free AOPQ algorithm is proposed for the stochastic LQT problem. In Section 4, simulation experiments were conducted to verify the effectiveness of the AOPQ algorithm. The conclusions are presented in Section 5.

Notations:

P > 0 (P \geq 0) \in R^{n \times n}

is (semi) positive definite.

T r a c e (P)

is the trace of

P \in R^{n \times n}

. Let

v e c (C) = {[c_{1}^{T}, c_{2}^{T}, \dots, c_{q}^{T}]}^{T}

denote the vectorization of matrix

A = [c_{1}, \dots, c_{q}], c_{i} \in R^{n}, i = 1, \dots, q

. Let

v e c s (P) = {[p_{11}, \dots, p_{1 n}, p_{22}, \dots, p_{2 n}, \dots, p_{n n}]}^{T}

denote the vectorization of the upper-triangular part of a symmetric matrix

P \in R^{n \times n}

. Let

v e c v (x) = {[x_{1}^{2}, 2 x_{1} x_{2}, \dots, 2 x_{1} x_{n}, x_{2}^{2}, \dots, 2 x_{2} x_{n}, \dots, x_{n}^{2}]}^{T}

be the quadratic vector of the vector

x \in R^{n}

.

x^{T} P x = v e c s {(P)}^{T} v e c v (x)

. ⊗ stands for the Kronecker product.

2. LQT Problem with Stochastic Disturbance

Problem Description

Consider a linear stochastic dynamical system

\begin{matrix} x_{k + 1} & = A x_{k} + B u_{k} + D ω_{k} \\ y_{k} & = C x_{k} \end{matrix}

(1)

where

x_{k} \in R^{n}

is the state vector,

u_{k} \in R^{m}

is the control input,

ω_{k} \in R^{p}

is the stochastic process disturbance from a Gaussian distribution

N (0, W_{ω})

, and

y_{k} \in R^{q}

is the system output. The system dynamics are

A \in R^{n \times n}, B \in R^{n \times m}

are unknown, and

C \in R^{q \times n}

,

D \in R^{n \times p}

are unknown.

Remark 1.

In practical application scenarios, discrete linear stochastic systems find extensive practical utilization, such as in population models and climate models [31]. In [32], ecological balance was studied, mainly in a mathematical model of population ecology, where the population size is not large, or births or deaths occur in discrete time.In economic models [33], the monetary multiplier in macroeconomic models can be regarded as a stochastic variable. In [34], when the system matrix was unknown, data-based methods were used to study certain problems. With the development of reinforcement learning technology, these methods have been widely used in practice, mainly in self-driving vehicles, distributed sensor networks, and agile robotics.

Remark 2.

We consider random disturbances

ω_{k}

that follow a Gaussian distribution

N (0, W_{ω})

, where the mean is

E (ω_{k}) = 0

and the variance is

D (ω_{k})) = W_{ω}

of

ω_{k}

.

E (X_{0} ω_{k}) = 0

, the initial state

X_{0}

is independent of the stochastic disturbance input sequence

ω_{k}

. In practical application scenarios, discrete linear stochastic systems find extensive practical utilization, such as in population models and climate models [31]. In practice, this type of random interference will appear in the design of autonomous vehicles, distributed sensor networks, and agile robots [34].

Assume 1.

The pair

(A, B)

is stabilizable, and the pair

(A, C)

is observable.

Assume 2.

Assume that the reference trajectory

r_{k}

is generated by the following system:

r_{k + 1} = F r_{k}

(2)

where

F \in R^{q \times q}

is a constant matrix.

Remark 3.

In order to design an initial stable control based on the data in average off-policy Q-learning algorithom when the system is unknown, it is assumed that the command generation system in (2) provides a constrained trajectory for tracking, by assuming that the constant matrix F is stable.

Remark 4.

In this article, we consider random disturbances

ω_{k}

that follow a Gaussian distribution

N (0, W_{ω})

, where the mean is

E (ω_{k}) = 0

and the variance is

D (ω_{k})) = W_{ω}

of

ω_{k}

.

E (X_{0} ω_{k}) = 0

, the initial state

X_{0}

is independent of the stochastic disturbance input sequence

ω_{k}

.

For simplicity, we define the augment system

\begin{matrix} X_{k + 1} & = \bar{A} X_{k} + \bar{B} u_{k} + \bar{D} ω_{k} \\ y_{k} & = C x_{k} \end{matrix}

(3)

where

X_{k + 1} = [\begin{matrix} x_{k + 1} \\ r_{k + 1} \end{matrix}], \bar{A} = [\begin{matrix} A & 0 \\ 0 & F \end{matrix}], \bar{B} = [\begin{matrix} B \\ 0 \end{matrix}], \bar{D} = [\begin{matrix} D \\ 0 \end{matrix}]

Problem 1.

Consider the augmented dynamical system in (3). The so-called stochastic linear quadratic tracking problem refers to designing a control policy

u_{k} = L_{1} x_{k} + L_{2} r_{k} \equiv L X_{k}

such that the system output

y_{k}

can track the external reference signal

r_{k}

and makes the following average performance index function achieve a minimum value (4).

The average performance index function [35] is defined by

J (x_{k}, r_{k}) = E [\sum_{t = k}^{+ \infty} (O_{t}) - β (L)) ∣ x_{k}]

(4)

where

β (L)

associated with the control policy

u_{k} = L X_{k}

is defined as

\begin{matrix} β (L) = lim_{τ \to \infty} \frac{1}{τ} E [\sum_{t = 1}^{τ} O_{t}] \end{matrix}

(5)

\begin{matrix} O_{t} & = {(y_{t} - r_{t})}^{T} R_{1} (y_{t} - r_{t}) + u_{t}^{T} R_{2} u_{t} \end{matrix}

(6)

where

R_{1} \in R^{q \times q} > 0, R_{2} \in R^{m \times m} > 0

are symmetric positive matrices.

Remark 5.

In solving the LQT problem, the setting of the performance indicator function can choose either an average performance indicator [35] or a discounted performance indicator [28]. In the case of discounting, it is necessary to choose a discount factor to prevent stability loss, but the discount factor cannot affect unmeasurable disturbance data, so the cost function of discounts has certain limitations, such as

H_{\infty}

tracking. Because the system state is affected by an external stochastic disturbance, the influence of the external stochastic disturbance on the value function is reduced by subtracting the average cost in the defined value function.

According to the augment system (3), the average performance index function (5) associated with the control policy

u_{k} = L X_{k}

can be expressed as

\begin{matrix} β (L) = lim_{τ \to \infty} \frac{1}{τ} E [\sum_{t = 1}^{τ} O_{t}] \end{matrix}

(7)

\begin{matrix} O_{t} & = X_{t}^{T} {\bar{R}}_{1} X_{t} + u_{t}^{T} R_{2} u_{t} \end{matrix}

(8)

where

{\bar{R}}_{1} = {[C^{T} - I]}^{T} R_{1} [C - I] \in R^{(n + q) \times (n + q)}

. Thus, the average performance index function (4) can be expressed as

J (x_{k}, r_{k}) \equiv V (X_{k}) = E [\sum_{t = k}^{+ \infty} (O_{t} - β (L)) ∣ X_{k}]

(9)

Due to the influence of noise on the system state, we eliminate the impact of estimated costs by subtracting the average cost. Thus, the remaining cost in (9) is only due to control issues. Equation (9) can be described as

\begin{matrix} V (X_{k}) = & E [\sum_{t = k}^{+ \infty} (O_{t} - β (L)) ∣ X_{k}] \\ = & E [((O_{t} - β (L)) + \sum_{t = k + 1}^{+ \infty} (O_{t} - β (L)) ∣ X_{k}] \\ = & E [((O_{t} - β (L)) ∣ X_{k}] + E [\sum_{t = k + 1}^{+ \infty} (O_{t} - β (L)) ∣ X_{k}] \end{matrix}

(10)

Because the first term on the right side of Equation (10) does not contain a random variable, thus

E [(O_{t} - β (L)) ∣ X_{k}] = (O_{t} - β (L))

. According to the property of conditional expectation, i.e., the double expectation formula, second term on the right side of Equation (10) can be represented by the following equation:

\begin{matrix} E [\sum_{t = k + 1}^{+ \infty} (O_{t} - β (L)) ∣ X_{k}] = & E [E [\sum_{t = k + 1}^{+ \infty} (O_{t} - β (L)) ∣ X_{k}] ∣ X_{k + 1}] \\ = & E [E [\sum_{t = k + 1}^{+ \infty} (O_{t} - β (L)) ∣ X_{k + 1}] ∣ X_{k}] \\ = & E [V (X_{k + 1}) ∣ X_{k}] \end{matrix}

(11)

Based on the descriptions of (11) , we can obtain the form of the Bellman equation for Equation (9)

\begin{matrix} V (X_{k}) = & E [\sum_{t = k}^{+ \infty} (O_{t} - β (L)) ∣ X_{k}] \\ = & (O_{k} - β (L)) + E [V (X_{k + 1}) ∣ X_{k}] \end{matrix}

(12)

Based on the augmented system (3), the LQT problem can be converted into an optimal regulation problem. This problem can be rephrased as designing a controller

u_{k}^{*} = L^{*} X_{k}

to minimize the following average performance index functions.

\begin{matrix} V^{*} (X_{k}) & = m i n_{L} E [\sum_{t = k}^{+ \infty} (O_{t} - β (L)) ∣ X_{k}] \end{matrix}

(13)

where

O_{t} = X_{t}^{T} {\bar{R}}_{1} X_{t} + u_{t}^{T} R_{2} u_{t}

β (L) = lim_{τ \to \infty} \frac{1}{τ} E [\sum_{t = 1}^{τ} O_{t}]

In the following lemma, the Bellman Equation (12) can be proven to be quadratic under the control policy

u_{k} = L X_{k}

.

Lemma 1

(See [19]). Consider system (3). Assume that the policy gain L is stabilizing. P is the solution to the following stochastic ARE equation:

\begin{matrix} {(\bar{A} + \bar{B} L)}^{T} P (\bar{A} + \bar{B} L) - P + {\bar{R}}_{1} + L^{T} R_{2} L = 0 \end{matrix}

(14)

and the average cost is

β (L) = Tr ({\bar{D}}^{T} P \bar{D} W_{ω})

, where

W_{ω}

is the covariance of the stochastic noise

ω_{k}

. Then,

V (X_{k}) = X_{k}^{T} P X_{k}

.

Lemma 2

(See [18]). Consider problem 1 and system (3). The optimal control gain

L^{*}

is

L^{*} = - {(R_{2} + {\bar{B}}^{T} P^{*} \bar{B})}^{- 1} {\bar{B}}^{T} P^{*} \bar{A}

(15)

where

P^{*} > 0

is the solution to the stochastic ARE

\begin{matrix} {\bar{A}}^{T} P^{*} \bar{A} & - P^{*} - {\bar{A}}^{T} P^{*} \bar{B} {(B^{T} P^{*} \bar{B} + R_{2})}^{- 1} {\bar{B}}^{T} P^{*} \bar{A} + {\bar{R}}_{1} = 0 . \end{matrix}

(16)

the average cost associated with

K^{*}

is given by

\begin{matrix} β (L^{*}) = & Tr ({\bar{D}}^{T} P^{*} \bar{D} W_{ω}) \end{matrix}

(17)

Based on Lemma 2, the optimal solution to Problem 1 can be found from the stochastic ARE in (16) . Even in the case of known system dynamics, the ARE is a nonlinear equation that is challenging to solve directly. To address this, Algorithm 1 can be employed to iteratively solve the ARE equation. Consequently, Algorithm 1 can be used to iteratively solve the control gain L.

Algorithm 1 PI Algorithm with System Model [2]

1:: Initialization: Set $i = 1$ , select a stabilizing gain $L^{1}$
2:: Policy evaluation: Solution $P^{i}$ based on the following equation

${(\bar{A} + \bar{B} L^{i})}^{T} P^{i} (\bar{A} + \bar{B} L^{i}) - P^{i} + {\bar{R}}_{1} + L^{i T} R_{2} L^{i} = 0$

(18)
3:: Policy improvement:

$\begin{matrix} L^{i + 1} & = - {(R_{2} + {\bar{B}}^{T} P^{i} \bar{B})}^{- 1} \bar{B} P^{i} \bar{A} \end{matrix}$

(19)
4:: Stop if:

$\begin{matrix} ∣ L^{i + 1} - L^{i} ∣ \leq ε \end{matrix}$

(20)

for a small positive $ε > 0$ , otherwise set $j = j + 1$ and go to step 2.

Algorithm 1 provides an iterative solution to the optimal control gain when the model is known. Below we demonstrate that, under the optimal control

u_{k}^{*} = L^{*} X_{k}

, the tracking error can converge to the neighborhood of origin.

Corollary 1.

Consider system (3). Assume

P^{*}

is the solution of ARE (16), then, under

u_{k}^{*} = L^{*} X_{k}

, the error

e_{k} = y_{k} - r_{k}

converges to a neighborhood of the origin, i.e., the

e_{k}

is uniformly ultimately bounded (UUB) with bound

∥ e_{k} ∥ \leq \sqrt{β (L) / ∥ R_{1} ∥}

. Moreover,

u_{k}^{*} = L^{*} X_{k}

minimizes the average performance index function (9).

Proof.

According to (1) and (2),

\begin{matrix} e_{k} = y_{k} - r_{k} = [C - I] [\begin{matrix} x_{k} \\ u_{k} \end{matrix}] \end{matrix}

(21)

\begin{matrix} X_{k + 1} & = \bar{A} X_{k} + \bar{B} u_{k} + \bar{D} ω_{k} \\ = (\bar{A} + \bar{B} L) X_{k} + \bar{D} ω_{k} \end{matrix}

(22)

\begin{matrix} V (X_{k}) & = E [\sum_{t = k}^{+ \infty} (O_{t} - β (L)) ∣ X_{k}] \\ = (X_{k}^{T} {\bar{R}}_{1} X_{k} + X_{k}^{T} L^{T} R_{2} L X_{k} - β (L)) + E [V (X_{k + 1}) ∣ X_{k}] \end{matrix}

(23)

Based on (23), we can derive the following equation:

\begin{matrix} V (X_{k}) - V (X_{k + 1}) & = E [X_{k}^{T} P X_{k} - X_{k + 1}^{T} P X_{k + 1} ∣ X_{k}] \\ = X_{k}^{T} {\bar{R}}_{1} X_{k} + u_{k}^{T} R_{2} u_{k} - β (L) \\ = e_{k}^{T} R_{1} e_{k} + u_{k}^{T} R_{2} u_{k} - β (L) \end{matrix}

(24)

so

\begin{matrix} V (X_{k + 1}) - V (X_{k}) = - (e_{k}^{T} R_{1} e_{k} + u_{k}^{T} R_{2} u_{k} - β (L)) \end{matrix}

(25)

if we want

V (X_{k + 1}) - V (X_{k}) < 0

, then the condition

e_{k}^{T} R_{1} e_{k} - β (K) > 0

, i.e.,

∥ e_{k} ∥ > \sqrt{\frac{β (L)}{∥ R_{1} ∥}}

must be met. Thus,

e_{k}

is UUB [17] and

∥ e_{k} ∥ > \sqrt{\frac{β (L)}{∥ R_{1} ∥}}

. □

Remark 6.

Referring to reference [17], under the designed control law

u_{k}^{*} = L^{*} X_{k}

, we determined that the tracking error of the system can converge to neighborhood of the origin, and the size of this neighborhood can be adjusted by adjusting the parameter matrix

\bar{R}

.

In the analysis presented in the Section 2, Algorithm 1 was utilized to solve the stochastic LQT problem when the model is known. However, in practical applications, it is often unrealistic to have complete knowledge of the system model. In the next section, we propose Algorithm 2 to address the problem of a stochastic LQT when the system model is unknown.

3. Solving Stochastic LQT Problems with Unknown System

In this section, it is useful to define the Quality (Q) function in parallel to the value function (9). AOPQ algorithms have been proposed to solve stochastic LQT problems with an unknown system.

3.1. Model Free Average Off-Policy Q-Learning Algorithm

Within this subsection, we establish the Q function and present the Bellman equation, which is rooted in the Q function, as a means to address the stochastic LQT problem.

Define the Q function under the control policy

u_{k} = L X_{k}

Q (X_{k}, u_{k}) = O_{k} - β (L) + E [V (X_{k + 1}) ∣ Z_{k}]

(26)

where

Z_{k} = {[\begin{matrix} X_{k}^{T} u_{k}^{T} \end{matrix}]}^{T}

. The value function

V (X_{k})

can be derived by choosing

u_{k} = L X_{k}

from the

Q (X_{k}, u_{k})

function [18]

V (X_{k}) = Q (X_{k}, L X_{k}) = Q (Z_{k})

(27)

Remark 7.

In this article, we mainly apply the mean and variance of random disturbances, as well as the properties of conditional expectations, which are described as follows:

E [a | Y] = a

;

E [a X + b Z | Y] = a E [X | Y] + b E [Z | Y]

;

E [X | Y] = E [X]

, if X and Y are independent;

E [E [X | Y]] = E [X]

;

E [E [X | Y, Z] | Y] = E [X | Y]

, where

X, Y, Z

are random variables ,

a, b \in R

[36]. Applying the above properties, Equations (10)–(12) in the article derive the Bellman equation for optimal control problems with random disturbances. Furthermore, the extended average Q-learning method can be used to solve stochastic optimal control problems.

Lemma 3.

Consider the system in (3) and (26). Suppose that the control policy gain L is stabilizing.

G \geq 0

is a solution to the equation

[\begin{matrix} {\bar{A}}^{T} \\ {\bar{B}}^{T} \end{matrix}] [\begin{matrix} I & L^{T} \end{matrix}] G [\begin{matrix} I \\ L \end{matrix}] [\begin{matrix} \bar{A} & \bar{B} \end{matrix}] - G + [\begin{matrix} {\bar{R}}_{1} & 0 \\ 0 & R_{2} \end{matrix}] = 0

(28)

and the average cost is represented by

β (L) = Tr ({\bar{D}}^{T} [\begin{matrix} I L^{T} \end{matrix}] G [\begin{matrix} I \\ L \end{matrix}] \bar{D} W_{ω})

. Then, we can obtain

Q (Z_{k}) = Z_{k}^{T} G Z_{k}

.

Proof.

Based on the definition of

Z_{k} = [\begin{matrix} X_{k} \\ u_{k} \end{matrix}]

and control input

u_{k} = L X_{k}

, the

Z_{k + 1} = [\begin{matrix} X_{k + 1} \\ L X_{k + 1} \end{matrix}]

can be obtained. Next, set

P = [I L^{T}] G [\begin{matrix} I \\ L \end{matrix}]

the

β (L)

can be rewritten as

t r (\bar{P} W_{ω})

. Using

u_{k} = L X_{k} + σ_{k}

, the system (3) can be restated as

\begin{matrix} X_{k + 1} & = \bar{A} X_{k} + \bar{B} u_{k} + \bar{D} ω_{k} \\ = (\bar{A} + \bar{B} L) X_{k} + \bar{D} ω_{k} + \bar{B} σ_{k} \\ = L X_{k} + \bar{D} ω_{k} + \bar{B} σ_{k} \end{matrix}

(29)

where

σ_{k}

is the detection noise. Based on (27), this gives the following

\begin{matrix} E [V (X_{k + 1}) ∣ Z_{k}] = E [Q (X_{k + 1}) ∣ Z_{k}] \end{matrix}

(30)

thereby,

\begin{matrix} E [Q (X_{k + 1}) ∣ Z_{k}] \\ = E [Z_{k + 1}^{T} G Z_{k + 1} ∣ Z_{k}] \\ = E [[X_{k + 1}^{T} {(L X_{k + 1})}^{T}] G {[X_{k + 1}^{T} {(L X_{k + 1})}^{T}]}^{T} ∣ Z_{k}] \\ = E [X_{k + 1}^{T} P X_{k + 1} ∣ Z_{k}] \end{matrix}

(31)

By plugging (29) into (31), this gives the following equation:

\begin{matrix} E [X_{k + 1}^{T} P X_{k + 1} ∣ Z_{k}] \\ = & E [{(L X_{k} + \bar{D} ω_{k} + \bar{B} σ_{k})}^{T} P (L X_{k} + \bar{D} ω_{k} + \bar{B} σ_{k}) ∣ Z_{k}] \\ = & E [X_{k}^{T} L^{T} P L X_{k} ∣ Z_{k}] + 2 E [X_{k}^{T} L^{T} P \bar{B} σ_{k} ∣ Z_{k}] \\ + E [σ_{k}^{T} {\bar{B}}^{T} P \bar{B} σ_{k} ∣ Z_{k}] + E [ω_{k}^{T} \bar{D} M \bar{D} ω_{k} ∣ Z_{k}] \end{matrix}

(32)

thus, we have the following

\begin{matrix} E [X_{k + 1}^{T} P X_{k + 1} ∣ Z_{k}] \\ = & X_{k}^{T} L^{T} P L X_{k} + 2 X_{k}^{T} L^{T} P \bar{B} σ_{k} + σ_{k}^{T} {\bar{B}}^{T} P \bar{B} σ_{k} + t r (\bar{P} W_{ω}) \\ = & (X_{k}^{T} {(\bar{A} + \bar{B} K)}^{T} P (\bar{A} + \bar{B} K) X_{k} + 2 X_{k}^{T} (\bar{A} + \bar{B} K) P \bar{B} σ_{k} \\ + σ_{k}^{T} {\bar{B}}^{T} P \bar{B} σ_{k} + t r (\bar{P} W_{ω}) \end{matrix}

(33)

where

\bar{P} = {\bar{D}}^{T} P \bar{D}

. Because of

u_{k} = K X_{k} + σ_{k}

, the above equation can be written as

\begin{matrix} E [X_{k + 1}^{T} P X_{k + 1} ∣ Z_{k}] \\ = & [X_{k}^{T} u_{k}^{T}] [\begin{matrix} {\bar{A}}^{T} P \bar{A} & {\bar{A}}^{T} P \bar{B} \\ {\bar{B}}^{T} P \bar{A} & {\bar{B}}^{T} P \bar{B} \end{matrix}] [\begin{matrix} X_{k} \\ u_{k} \end{matrix}] + t r (\bar{P} W_{ω}) \\ = & Z_{k}^{T} [\begin{matrix} {\bar{A}}^{T} P \bar{A} & {\bar{A}}^{T} P \bar{B} \\ {\bar{B}}^{T} P \bar{A} & {\bar{B}}^{T} P \bar{B} \end{matrix}] Z_{k} + t r (\bar{P} W_{ω}) \end{matrix}

(34)

By plugging (34) into (26), we obtain the following equation

\begin{matrix} Q (Z_{k}) = & O_{k} - β (L) + E [V (X_{k + 1}, L) ∣ Z_{k}] \\ = & X_{k}^{T} {\bar{R}}_{1} X_{k} + u_{k}^{T} R_{2} u_{k} - β (L) + Z_{k}^{T} [\begin{matrix} {\bar{A}}^{T} P \bar{A} & {\bar{A}}^{T} P \bar{B} \\ {\bar{B}}^{T} P \bar{A} & {\bar{B}}^{T} P \bar{B} \end{matrix}] Z_{k} \\ + t r (\bar{P} W_{ω}) \\ = & Z_{k}^{T} [\begin{matrix} {\bar{R}}_{1} & 0 \\ 0 & R_{2} \end{matrix}] Z_{k} - β (L) \\ + Z_{k}^{T} [\begin{matrix} {\bar{A}}^{T} P \bar{A} & {\bar{A}}^{T} P \bar{B} \\ {\bar{B}}^{T} P \bar{A} & {\bar{B}}^{T} P \bar{B} \end{matrix}] Z_{k} + t r (\bar{P} W_{ω}) \\ = & Z_{k}^{T} G Z_{k} \end{matrix}

(35)

thus, (28) is satisfied. □

Remark 8.

In the proof of Lemma 3, we added detection noise after the applied control input, to ensure that the sampled data of the system meet the persistent excitation condition [37]. However, this also leads to the G being biased compared to not adding detection noise.

According to Lemma 3, equation of Q function defined (26) can be rewritten as

\begin{matrix} Z_{k}^{T} G Z_{k} = X_{k}^{T} {\bar{R}}_{1} X_{k} + u_{k}^{T} R_{2} u_{k} - β (L) + E [V (X_{k + 1}) ∣ Z_{k}] \end{matrix}

(36)

where

G = [\begin{matrix} Q + {\bar{A}}^{⊤} P \bar{A} & {\bar{A}}^{⊤} P \bar{B} \\ {\bar{B}}^{⊤} P \bar{A} & R + {\bar{B}}^{⊤} P \bar{B} \end{matrix}] .

Due to the definition of the Q function (26), the minimization problem has been solved

\begin{matrix} μ (X_{k}) : = arg min_{u} Q (X_{k}, u_{k}) = arg min_{u} Z_{k}^{⊤} G Z_{k} \end{matrix}

(37)

By solving (37), we obtain an expression for

μ

as follows:

μ_{k} = μ (X_{k}) = - G_{u u}^{- 1} G_{u x} X_{k}

(38)

where

G_{u u} \in R^{m \times m}

and

G_{u x} \in R^{m \times n}

are the sub-matrices of G defined by

G : = [\begin{matrix} G_{x x} & G_{u x}^{⊤} \\ G_{u x} & G_{u u} \end{matrix}]

However, the

β (L)

and

E [V (X_{k + 1}) ∣ Z_{k}]

in (36) are hard to obtain for unknown system dynamics. Therefore, we propose to estimate this using historical data over a period of time. Based on (7),

\hat{β} (L)

is an estimate of

β (L)

\begin{matrix} \hat{β} (L) = \frac{1}{N} \sum_{k = 1}^{N} O_{k} = \frac{1}{N} \sum_{k = 1}^{N} X_{k}^{T} {\bar{R}}_{1} X_{k} + u_{k}^{T} R_{2} u_{k} \end{matrix}

(39)

Using the above definition, (36) can be written in the following form:

\begin{matrix} Z_{k}^{T} G Z_{k} - Z_{k + 1}^{T} G Z_{k + 1} + Z_{k + 1}^{T} G Z_{k + 1} - E [Z_{k + 1}^{T} G Z_{k + 1} ∣ Z_{k}] \\ = Z_{k}^{T} Q Z_{k} - \hat{β} (L) . \end{matrix}

(40)

Based on the Kronecker product, we have

(Z_{k}^{T} \otimes Z_{k}^{T} - Z_{k + 1}^{T} \otimes Z_{k + 1}^{T}) v e c (G) + ε_{1} = Z_{k}^{T} Q Z_{k} - \hat{β} (L) .

(41)

where

ε_{1} = Z_{k + 1}^{T} G Z_{k + 1} - E [Z_{k + 1}^{T} G Z_{k + 1} ∣ Z_{k}]

. Define

Φ_{k} = Z_{k}^{T} \otimes Z_{k} - Z_{k + 1}^{T} \otimes Z_{k + 1}

, the (41) can be written as

Φ_{k} v e c (G) + ε_{1} = Z_{k}^{T} Q Z_{k} - \hat{β} (L)

(42)

Because of the presence of

ε_{1}

and the difference between

\hat{β} (L)

and

β (L)

, we obtain

\hat{G}

as the estimate of G. Thus, this yields

v e c (\hat{G}) = {({(Φ_{k})}^{T} Φ_{k})}^{- 1} {(Φ_{k})}^{T} (Z_{k}^{T} Q Z_{k} - \hat{β} (L)) .

(43)

Based on

\hat{G}

, this gives the following:

L = {({\hat{G}}_{u u})}^{- 1} {\hat{G}}_{u x} .

(44)

Based on Equations (40) and (44), an average off-policy Q-learning algorithm can be constructed to solve the LQT problem of stochastic linear systems. However, many iterative algorithms require the assumption of an initial stable control [28], which is somewhat complicated when the system model is unknown. Therefore, in the following sections, we first obtain an initial stable control based on the pole assignment method, and based on this design, we provide a data-driven average off-policy Q-learning algorithm.

3.2. Data-Driven Average Off-Policy Q-Learning Algorithm

In this subsection, the purpose of the data-driven control is to construct controllers directly from data, without the need for (explicit) identification of system models. The coefficient matrix of the system is constructed based on persistent excitation (PE) data, and pole placement strategy is used to let the constructed system matrix obtain the initial stable control [30].

Based on the overview of Willems’ lemma [38], the Hankel matrix of sequence

{X_{k}}_{k = 0}^{N - 1}

is defined as

H_{S} (X_{[0, N - 1]}) = [\begin{matrix} X_{0} & X_{1} & \dots & X_{N - S} \\ X_{1} & X_{2} & \dots & X_{N - S + 1} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ X_{S - 1} & X_{S} & \dots & X_{N - 1} \end{matrix}] \in R^{L (n + p) \times (N - S + 1)}

Definition 1.

Let

N \geq (m + 1) S - 1

. An input sequence

{u_{k}}_{k = 0}^{N - 1}

is PE of the order S if the Hankel matrix

H_{S} (u [0, N - 1])

has full row rank

m S

.

Remark 9.

Based on the definition of Hankel matrix, we can obtain a matrix

H_{S} (u [0, N - 1]) \in R^{m S \times (N - S + 1)}

containing the input sequence

{u_{k}}_{k = 0}^{N - 1}

.

Apply the control input by adding detection noise to the system (3), collect data on PE related to system dynamics, and thus derive the following lemma.

Lemma 4

([30]). Apply the PE input

{u_{k}}_{k = 0}^{N - 1}

to the controllable system (3) and collect data

{X_{k}}_{k = 0}^{N - 1}

from the controllable system (3), building matrix

H_{1} (X [0, N - L]) = [\begin{matrix} X_{0} & X_{1} & \dots & X_{N - S} \end{matrix}] \in R^{(n + p) \times (N - S + 1)}

H_{S} (u_{[0, N - 1]}) = [\begin{matrix} u_{0} & u_{1} & \dots & u_{N - S} \\ u_{1} & u_{2} & \dots & u_{N - S + 1} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ u_{S - 1} & u_{S} & \dots & X_{N - 1} \end{matrix}] \in R^{m S \times (N - S + 1)}

Then, the rank condition holds

r a n k ([\begin{matrix} H_{1} (X [0, N - L]) \\ H_{S} (u [0, N - 1]) \end{matrix}]) = (n + p) + m S .

(45)

Using the above data representation, we present a pole placement strategy to design an initial stable control.

Step 1: Define the data to construct the coefficient matrices

\hat{A}

and

\hat{B}

of the system;

Collect

N \geq (m + 1) (n + p + 1) - 1

samples of data

{\{X_{k}, u_{k}\}}_{k = 0}^{N - 1}

by applying a PE input to the system. Define the matrices

g_{0} (x) = [\begin{matrix} X_{0} & X_{1} & \dots & X_{N - 2} \end{matrix}] \in R^{(n + p) \times (N - 1)}

g_{1} (x) = [\begin{matrix} X_{1} & X_{2} & \dots & X_{N - 1} \end{matrix}] \in R^{(n + p) \times (N - 1)}

g_{0} (u) = [\begin{matrix} u_{0} & u_{1} & \dots & u_{N - 2} \end{matrix}] \in R^{m \times (N - 1)} .

Let H and T be the Moore–Penrose pseudoinverse of

g_{0} (x)

and a basis for the nullspace of

g_{0} (x)

, respectively. Define the matrices

\hat{A} = g_{1} (x) H

and

\hat{B} = g_{1} (x) T

.

Remark 10.

According to Willems’ lemma [38] allow a data-based representation of a controllable linear system. Nonetheless, the primary challenge encountered when applying this procedure is the matrix

\hat{B}

lacking a full column rank. To overcome this issue, one can resolve it by excluding the linearly dependent columns of

\hat{B}

and constructing a matrix

{\hat{B}}_{H}

that possesses full column rank using the remaining columns. Thus,

r a n k ({\hat{B}}_{H}) = r a n k (\hat{B})

. The full-rank matrix

{\hat{B}}_{H}

can be used to solve the pole-placement problem for linear systems [30].

Step 2: Calculate the transformation matrix

P_{1}

to convert to the controllable standard form [39]. The linear transformation matrix

P_{1}

can transform the system

(\hat{A}, {\hat{B}}_{H})

into a controllable standard form

({\hat{A}}_{C}, {\hat{B}}_{C})

, and calculate the controllable standard form state feedback gain matrix

{\hat{L}}_{C}

;

Step 3: All the eigenvalues of

({\hat{A}}_{C} - {\hat{B}}_{C}) {\hat{L}}_{C}

are equal to zero [30]. Using the transformation matrix

P_{1}

, compute the matrix

{\hat{L}}_{H} = {\hat{L}}_{C} P_{1}

.

Step 4: Construct the matrix

\bar{L}

by adding zero rows to

{\hat{L}}_{H}

, in the same row positions as the position of the columns of

\hat{B}

that were removed to form

{\hat{B}}_{H}

, such that

\hat{B} \bar{L} = {\hat{B}}_{H} {\hat{L}}_{H}

Remark 11.

For the matrix

(\hat{A}, \hat{B})

constructed using PE data, the control gain matrix

\bar{L}

needs to be designed. When we have determined a feedback matrix

{\hat{L}}_{H}

for the system

(\hat{A}, {\hat{B}}_{H})

, we only need to add zero-rows to

{\hat{L}}_{H}

in the appropriate positions to make

\hat{B} \bar{L} = {\hat{B}}_{H} {\hat{L}}_{H}

hold.

Step 5: Calculate the equivalent state feedback gain matrix of the original system

L^{1} = - g_{0} (u) (H - T \bar{L})

;

Remark 12.

Based on the aforementioned steps, we can obtain an initial stable control in the case of an unknown system, due to the fact that the control designed using the pole assignment strategy can be regarded as a deadbeat control and is stable [30].

In the next section, Algorithm 2 which is a data-based algorithm, starts with an initial stability control and proceeds to iteratively solve for the optimal control gain L that minimizes the average performance index function (4).

In order to better explain Algorithm 2, we can divide it into the following three steps: First of all, prove that

{\hat{G}}^{i}

can be uniquely determined in each iteration. Lemma 5 explains that the solutions obtained from the Bellman Equation (26) based on the average off-policy Q-learning method are equivalent to those obtained from the model-based Bellman Equation (18). Equation (46) of Algorithm 2 is derived from the transformation of Equation (26), and we show that the procedure in Algorithm 2 guarantees the existence of the solution

{\hat{G}}^{i}

. Second,

L^{i}

generates a stabilizing control policy at every iteration. Lemma 6 proves that as long as the estimation error bound is small enough, the iterative control gain solved by Algorithm 2 will be a stable feedback gain. Finally, Algorithm 2 converges on the limit to the solution of the LQT problem. In Theorem 1 of the article, this conclusion is guaranteed. By combining the three steps above, it can be ensured that Algorithm 2 generates stabilizing policy gains.

In order to have a more intuitive understanding of the Algorithm 2, we have provided a flowchart of the Algorithm 2, as shown in Figure 1 below.

Algorithm 2 aims to optimize the control strategy based on the available data, rather than relying solely on a known system model. By iteratively updating the control gain, Algorithm 2 seeks to improve the system’s performance and achieve the desired tracking objectives in the presence of stochastic disturbances. Based on the analysis of Algorithm 2 above, the control gain L was designed. The following subsection analyzes the convergence of Algorithm 2.

Algorithm 2 Average Off-policy Q-learning Algorithm

1:: Choose of initial stability control $L^{i}, i = 1$ : From the following subsection.
2:: Data collection:
we execute $L^{i} X_{k}$ and collect $N \geq (n + 1) m + n$ samples of the data ${\{X_{k}, u_{k}\}}_{k = 0}^{N - 1}$ . We use the average cost from N samples

$\begin{matrix} \hat{β} (L^{i}) = \frac{1}{N} \sum_{k = 1}^{N} O_{k} = \frac{1}{N} \sum_{k = 1}^{N} X_{k}^{T} {\bar{R}}_{1} X_{k} + u_{k}^{T} R_{2} u_{k} \end{matrix}$

(46)
3:: Policy evaluation:
Based on Equations (40), the average-cost estimator of $G^{i}$ has follows

$v e c ({\hat{G}}^{i}) = {({(Φ_{k})}^{T} Φ_{k})}^{- 1} {(Φ_{k})}^{T} (Z_{k}^{T} Q Z_{k} - \hat{β} (L^{i})) .$

(47)

where $Φ_{k} = Z_{k}^{T} \otimes Z_{k} - Z_{k + 1}^{T} \otimes Z_{k + 1}$
4:: Policy improvement:
Update the feedback policy matrix as

$L^{i + 1} = {({\hat{G}}_{u u}^{i})}^{- 1} {\hat{G}}_{u x}^{i} .$

(48)
5:: Stop if:

$\begin{matrix} ∣ L^{i + 1} - L^{i} ∣ \leq ε \end{matrix}$

(49)

for a small positive $ε > 0$ , otherwise set $i = i + 1$ and go to step 2.

3.3. Convergence Analysis of Algorithm 2

In this subsection, we demonstrate that the control gain obtained through the iterative solution of Algorithm 2 can converge to the optimal solution.

Lemma 5

([19]). Consider system (3). Assume that the policy gain L is stabilizing. Then, (26) is equivalent to the model-based Bellman Equation (18).

Remark 13.

Lemma 5 explains that the solutions obtained from the Bellman Equation (26) based on the average off-policy Q-learning method are equivalent to those obtained from the model-based Bellman Equation (18)

Lemma 6

([19]). Assume that the estimation errors

∥ {\hat{G}}^{i + 1} - G^{i + 1} ∥ \leq ϱ

, ϱ are small positive constant. Then, Algorithm 2 produces stabilizing policy gains

L^{i + 1}, i = 2, \dots, N

, i.e.,

ρ (\bar{A} + \bar{B} L^{i + 1}) < 1

.

Remark 14.

Lemma 6 proves that as long as the estimation error bound is small enough, the iterative control gain solved by Algorithm 2 will be a stable feedback gain.

Theorem 1.

Assume that the estimation errors in

∥ {\hat{G}}^{i + 1} - G^{i + 1} ∥ \leq ϱ

, ϱ are small positive constant. Then, the sequence of

G^{i}, i = 1, \dots, N

associated with the controller gain

L^{i}, i = 1, \dots, N

generated in Algorithm 2 is converging

G^{*} \leq G^{i + 1} \leq G^{i}

.

Proof.

Based on Lemma 5, the solutions obtained from the Bellman Equation (26) based on the average non policy Q-learning method are equivalent to those obtained from the model-based Bellman Equation (18). Therefore, the following only needs to prove the convergence of

P^{i}

.

Let

P^{i} > 0

, and the iterative form of the Bellman Equation (18) is as follows:

P^{i} = {(\bar{A} + \bar{B} L^{i})}^{T} P^{i} (\bar{A} + \bar{B} L^{i}) + L^{i T} R_{2} L^{i} + {\bar{R}}_{1} .

(50)

Since

\bar{A} + \bar{B} L^{i}

is stable, the unique positive definite solution of (50) may be written as [2]

P^{i} = \sum_{k = 0}^{+ \infty} {({(\bar{A} + \bar{B} L^{i})}^{T})}^{k} (L^{i T} R_{2} L^{i} + {\bar{R}}_{1}) {(\bar{A} + \bar{B} L^{i})}^{k} .

(51)

Define two iteration indices i and j. Using (50), we have the following:

\begin{matrix} P^{i} - P^{j} = & {(\bar{A} + \bar{B} L^{i})}^{T} P^{i} (\bar{A} + \bar{B} L^{i}) + L^{i T} R_{2} L^{i} \\ - {(\bar{A} + \bar{B} L^{j})}^{T} P^{j} (\bar{A} + \bar{B} L^{j}) - L^{j T} R_{2} L^{j} \\ = & {(\bar{A} + \bar{B} L^{j})}^{T} P^{i} (\bar{A} + \bar{B} L^{j}) + L^{i T} R_{2} L^{i} \\ - L^{j T} {\bar{B}}^{T} P^{i} (\bar{A} + \bar{B} L^{j}) - {\bar{A}}^{T} P^{i} L^{j} \\ + L^{i T} {\bar{B}}^{T} P^{i} (\bar{A} + \bar{B} L^{i}) + {\bar{A}}^{T} P^{i} L^{i} \\ - {(\bar{A} + \bar{B} L^{j})}^{T} P^{j} (\bar{A} + \bar{B} L^{j}) - L^{j T} R_{2} L^{j} \\ = & {(\bar{A} + \bar{B} L^{j})}^{T} (P^{i} - P^{j}) (\bar{A} + \bar{B} L^{j}) \\ + {(L^{i} - L^{j})}^{T} (R_{2} + {\bar{B}}^{T} P^{i} \bar{B}) (L^{i} - L^{j}) \\ + {(L^{i} - L^{j})}^{T} [(R_{2} + {\bar{B}}^{T} P^{i} \bar{B}) L^{j} + {\bar{B}}^{T} P^{i} \bar{A}] \\ + [L^{j T} (R_{2} + {\bar{B}}^{T} P^{i} \bar{B}) + {\bar{A}}^{T} P^{i} B] (L^{i} - L^{j}) \\ = & \sum_{k = 0}^{+ \infty} {({(\bar{A} + \bar{B} L^{j})}^{T})}^{k} \\ ({(L^{i} - L^{j})}^{T} (R_{2} + {\bar{B}}^{T} P^{i} \bar{B}) (L^{i} - L^{j}) \\ + {(L^{i} - L^{j})}^{T} [(R_{2} + {\bar{B}}^{T} P^{i} \bar{B}) L^{j} + {\bar{B}}^{T} P^{i} \bar{A}] \\ + [L^{j T} (R_{2} + {\bar{B}}^{T} P^{i} \bar{B}) + {\bar{A}}^{T} P^{i} \bar{B}] (L^{i} - L^{j})) {(\bar{A} + \bar{B} L^{j})}^{k} \end{matrix}

(52)

Based on (19), this gives the following:

L^{i + 1} = - {(R_{2} + {\bar{B}}^{T} P^{i} \bar{B})}^{- 1} \bar{B} P^{i} \bar{A}

(53)

By setting

j = i + 1

in (52) and using (53), we can obtain the following:

P^{i} - P^{i + 1}

\begin{matrix} [(R_{2} + {\bar{B}}^{T} P^{i} \bar{B}) L^{j} + {\bar{B}}^{T} P^{i} \bar{A}] = [L^{j T} (R_{2} + {\bar{B}}^{T} P^{i} \bar{B}) + {\bar{A}}^{T} P^{i} \bar{B}] = 0 \end{matrix}

(54)

so, we can obtain

P^{i} - P^{i + 1}

\begin{matrix} P^{i} - P^{i + 1} = & \sum_{k = 0}^{+ \infty} {({(\bar{A} + \bar{B} K^{j})}^{T})}^{k} {(L^{i} - L^{j})}^{T} (R_{2} + {\bar{B}}^{T} P^{i} \bar{B}) (L^{i} - L^{j}) {(\bar{A} + \bar{B} L^{j})}^{k} \geq 0 \end{matrix}

(55)

Thus, we have

P^{i + 1} \leq P^{i}

. By repeating this procedure for

i = 1, \dots

, we can see that

P^{*} \leq

P^{i + 1} \leq P^{i} \leq P^{1}

. Thus, according to

P^{i + 1} \leq P^{i}

and Lemma 5, we can obtain

G^{*} \leq G^{i + 1} \leq G^{i}

. □

4. Simulation Experiment

4.1. Example 1

In this section, we considered the simulation from reference [25] and considered adding external stochastic disturbances on this basis, to demonstrate the effectiveness of the designed AOPQ algorithm through simulation of the following system. Consider the following system [34]:

\begin{matrix} x_{k + 1} & = A x_{k} + B u_{k} + D ω_{k} \end{matrix}

(56)

\begin{matrix} y_{k} & = C x_{k} \end{matrix}

(57)

where

A = [\begin{matrix} 2.17 & 1.7 \\ 0.42 & 2.8 \end{matrix}], B = [\begin{matrix} - 1 \\ - 1 \end{matrix}]

C = [\begin{matrix} 1 & 0.01 \\ 0.01 & 1 \end{matrix}], D = [\begin{matrix} - 1 & - 2 \end{matrix}]

The reference system is

\begin{matrix} r_{k + 1} = F r_{k} \end{matrix}

(58)

where

F = [\begin{matrix} 1 & - 1.1 \\ - 0.88 & 2.87 \end{matrix}]

Set

N = 200

,

R = 3

,

R_{d} = 1

, and

Q = 5 \times I_{2 \times 2}

. The initial gains

\begin{matrix} L_{0} = [\begin{matrix} 1.1 & 2.5 & - 2.4 & - 1.99 \end{matrix}] \end{matrix}

(59)

First, using model-based Algorithm 1, we have

\begin{matrix} L^{*} = [\begin{matrix} 0.0946 & - 0.2009 & 0.0928 & - 0.1863 \end{matrix}] \end{matrix}

(60)

Next, using the data-based Algorithm 2, we can obtain the iterative control strategy converging to the optimal strategy. The following figures show the simulation results better.

Figure 2 shows the convergence of the parameter estimates G under the AOPQ algorithm. Figure 3 shows the tracking errors of the system output. According to the conclusion of Collary 1 in this paper, the tracking error converges to a neighborhood of origin, so that the error will be restrained by the boundaries given by

∥ e_{k} ∥ \leq \sqrt{β (L) / ∥ R_{1} ∥}

, where we denote the upper bound of

∥ ω_{k} ∥

. From Figure 4, we can be conclude that the output of the system can track the reference trajectory well, and the designed controller can converge to the optimum.

4.2. Example 2

Consider a class of DT systems with random disturbance

\begin{matrix} x_{k + 1} & = A x_{k} + B u_{k} + D ω_{k} \end{matrix}

(61)

\begin{matrix} y_{k} & = C x_{k} \end{matrix}

(62)

where

A = [\begin{matrix} 1 & 0.08 & - 9 \\ 0.07 & 0.9 & - 0.1 \\ 0 & 0 & 0.9 \end{matrix}], B = [\begin{matrix} - 1.7 \\ 0 \\ 0.5 \end{matrix}],

C = [\begin{matrix} 1 & 0 & 0 \end{matrix}], D = [\begin{matrix} 0.1 & 0 & 0 \end{matrix}]

The reference system is

\begin{matrix} r_{k + 1} = F r_{k} \end{matrix}

(63)

where

F = [\begin{matrix} 1 \end{matrix}]

The initial states are set as

x_{0} = {[0 0 0]}^{T}

. The initial gains

L_{0} = [\begin{matrix} - 5 & 0 & 0 & 5 \end{matrix}]

.

Through Figure 5, it can be concluded that the output of the system can track the different reference trajectory well, and the designed controller cannot converge to the optimal.

We compared the convergence of the parameter matrix estimation value

\hat{G}

by adding different detection signals in the simulation experiments.

Case 1:

\begin{matrix} ε_{k} = s i n (2 k) + s i n (5 k) + s i n (9 k) + s i n (11 k) \end{matrix}

(64)

Figure 6 shows the learning results of Case 1, including the convergence process of the control gain

\hat{G}

.

Case 2:

\begin{matrix} ε_{k} = 2 s i n (2 k) + 2 s i n (5 k) + 2 s i n (9 k) + 2 s i n (11 k) \end{matrix}

(65)

Figure 7 shows the learning results of Case 2, including the convergence process of the control gain

\hat{G}

. In case 1 and case 2, different magnitudes of probing noise were considered with fixed frequencies.

Case 3:

\begin{matrix} ε_{k} = 10 \times (r a n d () - 0.5) \end{matrix}

(66)

where

r a n d (\cdot)

is a MATLAB function that returns a single uniformly distributed random number in the interval (0, 1).

Figure 8 shows the convergence process. The convergence trend is very similar to Case 1 and Case 2.

By observing the convergence of the estimated value

\hat{G}

in the simulation experiment parameter matrix in Figure 6, Figure 7 and Figure 8, we can see that the detection signal has an impact on the results, but ultimately it does not change the convergence result of the parameters.

4.3. Comparison Simulation Experiment

In this section, we compare Algorithm 2 in reference [27] , which requires an initial stable control. We refer to reference [40] and give evaluation indexes of the integral absolute error (IAE), the mean square error (MSE), and iteration time.

I A E_{y} = \frac{1}{N} \sum_{k = 1}^{N} | r_{k} - y_{k} |

(67)

M S E_{y} = \sqrt{\frac{1}{N} \sum_{k = 1}^{N} {| r_{k} - y_{k} |}^{2}}

(68)

The evaluation indicators are shown in Table 1. From Table 1, we can observe that the ISE and MSE performance of the proposed Algorithm 2 was superior to the comparison method, with longer iteration times than the comparison algorithm. However, due to the fact that Algorithm 2 has to search for an initial stable control based on data, our algorithm had a longer iteration time compared to the compared algorithms [27].

Considering d as the number of parameters to be estimated and N as the number of samples, the complexity of (47) for N≫d is

O (

dN) [41]. In the revised paper, we added a comparison of algorithm computational complexity, as reflected in Table 2.

Because the algorithm contains a large number of estimated parameters, it has a greater computational complexity compared to other algorithms.

5. Conclusions

In this article, we considered the LQT control problem with external stochastic noise. A data-driven AOPQ algorithm was developed to solve the stochastic LQT problem with unknown system dynamics. This algorithm utilizes a data-based initial stable control to avoid the assumption of an initial stable control. We demonstrated the stability of the designed controller and the convergence of the algorithm. Finally, the simulation results indicated that the algorithm has sufficient tracking performance.

Author Contributions

Conceptualization, L.H.; Methodology, L.H.; Writing—original draft, L.H.; Writing—review & editing, L.H.; Visualization, C.W. and Y.S.; Supervision, C.W. and Y.S.; Funding acquisition, C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was partially supported by National Natural Science Foundation of China under grant (62173232, 62003214, 62173054).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Rizvi, S.A.A.; Lin, Z. Output feedback Q-learning control for the discrete-time linear quadratic regulator problem. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 1523–1536. [Google Scholar] [CrossRef]
Hewer, G. An iterative technique for the computation of the steady state gains for the discrete optimal regulator. IEEE Trans. Autom. Control 1971, 16, 382–384. [Google Scholar] [CrossRef]
Li, X.; Xue, L.; Sun, C. Linear quadratic tracking control of unknown discrete-time systems using value iteration algorithm. Neurocomputing 2018, 314, 86–93. [Google Scholar] [CrossRef]
Jiang, Y.; Jiang, Z.P. Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics. Automatica 2012, 48, 2699–2704. [Google Scholar] [CrossRef]
Modares, H.; Lewis, F.L.; Jiang, Z.P. Optimal output-feedback control of unknown continuous-time linear systems using off-policy reinforcement learning. IEEE Trans. Cybern. 2016, 46, 2401–2410. [Google Scholar] [CrossRef]
Luo, B.; Wu, H.N.; Huang, T.; Liu, D. Data-based approximate policy iteration for affine nonlinear continuous-time optimal control design. Automatica 2014, 50, 3281–3290. [Google Scholar] [CrossRef]
Lee, J.Y.; Park, J.B.; Choi, Y.H. On integral generalized policy iteration for continuous-time linear quadratic regulations. Automatica 2014, 50, 475–489. [Google Scholar] [CrossRef]
Vrabie, D.; Pastravanu, O.; Abu-Khalaf, M.; Lewis, F. Adaptive optimal control for continuous-time linear systems based on policy iteration. Automatica 2009, 45, 477–484. [Google Scholar] [CrossRef]
Sutton, R.; Barto, A. Reinforcement Learning: An Introduction; The MIT Press: London, UK, 1998. [Google Scholar]
Song, R.; Lewis, F.L.; Wei, Q.; Zhang, H. Off-Policy Actor-Critic Structure for Optimal Control of Unknown Systems with Disturbances. IEEE Trans. Cybern. 2016, 46, 1041–1050. [Google Scholar] [CrossRef]
Lewis, F.L.; Liu, D. Robust Adaptive Dynamic Programming. In Reinforcement Learning and Approximate Dynamic Programming for Feedback Control; John Wiley & Sons: Hoboken, NJ, USA, 2013; pp. 281–302. [Google Scholar] [CrossRef]
Wonham, W.M. Optimal stationary control of a linear system with state-dependent noise. SIAM J. Control 1967, 5, 486–500. [Google Scholar] [CrossRef]
Jiang, Y.; Jiang, Z.P. Approximate dynamic programming for optimal stationary control with control-dependent noise. IEEE Trans. Neural Netw. 2011, 22, 2392–2398. [Google Scholar] [CrossRef] [PubMed]
Bian, T.; Jiang, Y.; Jiang, Z.P. Adaptive dynamic programming for stochastic systems with state and control dependent noise. IEEE Trans. Autom. Control 2016, 61, 4170–4175. [Google Scholar] [CrossRef]
Pang, B.; Jiang, Z.P. Reinforcement learning for adaptive optimal stationary control of linear stochastic systems. IEEE Trans. Autom. Control 2023, 68, 2383–2390. [Google Scholar] [CrossRef]
Tsitsiklis, J.N.; Van Roy, B. Average cost temporal-difference learning. Automatica 1999, 35, 1799–1808. [Google Scholar] [CrossRef]
Adib Yaghmaie, F.; Gunnarsson, S.; Lewis, F.L. Output regulation of unknown linear systems using average cost reinforcement learning. Automatica 2019, 110, 108549. [Google Scholar] [CrossRef]
Yaghmaie, F.A.; Gustafsson, F.; Ljung, L. Linear quadratic control using model-free reinforcement learning. IEEE Trans. Autom. Control 2023, 68, 737–752. [Google Scholar] [CrossRef]
Yaghmaie, F.A.; Gustafsson, F. Using Reinforcement learning for model-free linear quadratic control with process and measurement noises. In Proceedings of the 2019 IEEE 58th Conference on Decision and Control (CDC), Nice, France, 11–13 December 2019. [Google Scholar]
Rami, M.A.; Chen, X.; Zhou, X.Y. Discrete-time Indefinite LQ Control with State and Control Dependent Noises. J. Glob. Optim. 2002, 23, 245–265. [Google Scholar] [CrossRef]
Ni, Y.H.; Elliott, R.; Li, X. Discrete-time mean-field Stochastic linear-quadratic optimal control problems, II: Infinite horizon case. Automatica 2015, 57, 65–77. [Google Scholar] [CrossRef]
Chen, S.; Yong, J. Stochastic Linear Quadratic Optimal Control Problems. Appl. Math. Optim. 2001, 43, 21–45. [Google Scholar] [CrossRef]
Rami, M.; Zhou, X.Y. Linear matrix inequalities, Riccati equations, and indefinite stochastic linear quadratic controls. IEEE Trans. Autom. Control 2000, 45, 1131–1143. [Google Scholar] [CrossRef]
Liu, X.; Li, Y.; Zhang, W. Stochastic linear quadratic optimal control with constraint for discrete-time systems. Appl. Math. Comput. 2014, 228, 264–270. [Google Scholar] [CrossRef]
Kiumarsi, B.; Lewis, F.L.; Modares, H.; Karimpour, A.; Naghibi-Sistani, M.B. Reinforcement Q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics. Automatica 2014, 50, 1167–1175. [Google Scholar] [CrossRef]
Sharma, S.K.; Jha, S.K.; Dhawan, A.; Tiwari, M. Q-learning based adaptive optimal control for linear quadratic tracking problem. Int. J. Control. Autom. Syst. 2023, 21, 2718–2725. [Google Scholar] [CrossRef]
Liu, X.; Zhang, L.; Peng, Y. Off-policy Q-learning-based tracking control for stochastic linear discrete-time systems. In Proceedings of the 2022 4th International Conference on Control and Robotics, ICCR 2022, Guangzhou, China, 2–4 December 2022; pp. 252–256. [Google Scholar]
Modares, H.; Lewis, F.L. Linear quadratic tracking control of partially-unknown continuous-time systems using reinforcement learning. IEEE Trans. Autom. Control 2014, 59, 3051–3056. [Google Scholar] [CrossRef]
Zhao, J.; Yang, C.; Gao, W.; Zhou, L. Reinforcement learning and optimal setpoint tracking control of linear systems with external disturbances. IEEE Trans. Ind. Inform. 2022, 18, 7770–7779. [Google Scholar] [CrossRef]
Lopez, V.G.; Alsalti, M.; Müller, M.A. Efficient off-policy Q-learning for data-based discrete-time LQR problems. IEEE Trans. Autom. Control 2023, 68, 2922–2933. [Google Scholar] [CrossRef]
Zhang, W.; Chen, B.S. On stabilizability and exact observability of stochastic systems with their applications. Automatica 2004, 40, 87–94. [Google Scholar] [CrossRef]
Thompson, M.; Freedman, H.I. Deterministic mathematical models in population ecology. Am. Math. Mon. 1982, 89, 798. [Google Scholar] [CrossRef]
Koning, W.L.D. Optimal estimation of linear discrete-time systems with stochastic parameters. Automatica 1984, 20, 113–115. [Google Scholar] [CrossRef]
Gao, J. Machine learning applications for data center optimization. Google White Pap. 2014, 21, 1–13. [Google Scholar]
Yu, H.; Bertsekas, D.P. Convergence results for some temporal difference methods based on least squares. IEEE Trans. Autom. Control 2009, 54, 1515–1531. [Google Scholar] [CrossRef]
Lamperti, J. Stochastic Processes: A Survey of the Mathematical Theory. J. Am. Stat. Assoc. 1979, 74, 970–974. [Google Scholar]
Al-Tamimi, A.; Lewis, F.L.; Abu-Khalaf, M. Model-free Q-learning designs for linear discrete-time zero-sum games with application to H-infinity control. Automatica 2007, 43, 473–481. [Google Scholar] [CrossRef]
Willems, J.C.; Rapisarda, P.; Markovsky, I.; De Moor, B.L. A note on persistency of excitation. Syst. Control Lett. 2005, 54, 325–329. [Google Scholar] [CrossRef]
Luenberger, D. Canonical forms for linear multivariable systems. Autom. Control IEEE Trans. 1967, 12, 290–293. [Google Scholar] [CrossRef]
Jiang, Y.; Fan, J.; Chai, T.; Lewis, F.L.; Li, J. Tracking control for linear discrete-time networked control systems with unknown dynamics and dropout. IEEE Trans. Neural Netw. Learn. Syst. 2017, 29, 4607–4620. [Google Scholar] [CrossRef]
Prashanth, L.A.; Korda, N.; Munos, R. Fast LSTD using stochastic approximation: Finite time analysis and application to traffic control. In Proceedings of the Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2014, Nancy, France, 15–19 September 2014; Springer: Berlin/Heidelberg, Germany, 2014. [Google Scholar]

Figure 1. Flowchart of Algorithm 2.

Figure 2. Convergence of the parameter estimates under the AOPQ algorithm.

Figure 3. Tracking errors of the system , where blue boundaries are given by

∥ e_{k} ∥ \leq \sqrt{β (K) / ∥ R_{1} ∥}

.

Figure 3. Tracking errors of the system , where blue boundaries are given by

∥ e_{k} ∥ \leq \sqrt{β (K) / ∥ R_{1} ∥}

.

Figure 4. System output and reference trajectory under AOPQ algorithm (Algorithm 2).

Figure 5. The output and reference of the system.

Figure 6. The learning results of Case 1.

Figure 7. The learning results of Case 2.

Figure 8. The learning results of Case 3.

Table 1. Comparison of errors between in Algorithm 2 and the Algorithm 2 in [27].

50 ≤ k ≤ 200	IAE	MSE	Iteration Time
Algorithm 2	0.63	0.74	30
Compared approach	0.81	0.78	20

Table 2. Comparison of Algorithm Computing Complexity.

	The Number of Parameters	Complexity
Algorithm 2	$d = (n + m + q) (n + m + q + 1) / 2 + n (n + 1) / 2$	$O (d N)$
Compared approach	$d = (n + m + q) (n + m + q + 1) / 2$	$O (d N)$

d: the number of parameters to be estimated.

n, m, q

: the dimensions of the state, the input and reference state. N: the rollout length.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hao, L.; Wang, C.; Shi, Y. Quadratic Tracking Control of Linear Stochastic Systems with Unknown Dynamics Using Average Off-Policy Q-Learning Method. Mathematics 2024, 12, 1533. https://doi.org/10.3390/math12101533

AMA Style

Hao L, Wang C, Shi Y. Quadratic Tracking Control of Linear Stochastic Systems with Unknown Dynamics Using Average Off-Policy Q-Learning Method. Mathematics. 2024; 12(10):1533. https://doi.org/10.3390/math12101533

Chicago/Turabian Style

Hao, Longyan, Chaoli Wang, and Yibo Shi. 2024. "Quadratic Tracking Control of Linear Stochastic Systems with Unknown Dynamics Using Average Off-Policy Q-Learning Method" Mathematics 12, no. 10: 1533. https://doi.org/10.3390/math12101533

APA Style

Hao, L., Wang, C., & Shi, Y. (2024). Quadratic Tracking Control of Linear Stochastic Systems with Unknown Dynamics Using Average Off-Policy Q-Learning Method. Mathematics, 12(10), 1533. https://doi.org/10.3390/math12101533

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Quadratic Tracking Control of Linear Stochastic Systems with Unknown Dynamics Using Average Off-Policy Q-Learning Method

Abstract

1. Introduction

2. LQT Problem with Stochastic Disturbance

Problem Description

3. Solving Stochastic LQT Problems with Unknown System

3.1. Model Free Average Off-Policy Q-Learning Algorithm

3.2. Data-Driven Average Off-Policy Q-Learning Algorithm

3.3. Convergence Analysis of Algorithm 2

4. Simulation Experiment

4.1. Example 1

4.2. Example 2

4.3. Comparison Simulation Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI