Next Article in Journal
Evaluating Order Allocation Sustainability Using a Novel Framework Involving Z-Number
Previous Article in Journal
Leveraging ChatGPT and Long Short-Term Memory in Recommender Algorithm for Self-Management of Cardiovascular Risk Factors
Previous Article in Special Issue
Quadrature Based Neural Network Learning of Stochastic Hamiltonian Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Convergence Analysis for an Online Data-Driven Feedback Control Algorithm

1
Department of Mathematics, Florida State University, Tallahassee, FL 32304, USA
2
Citigroup Inc., Wilmington, DE 19801, USA
3
Devision of Computational Science and Mathematics, Oak Ridge National Laboratory, Oak Ridge, TN 37830, USA
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(16), 2584; https://doi.org/10.3390/math12162584
Submission received: 10 July 2024 / Revised: 9 August 2024 / Accepted: 19 August 2024 / Published: 21 August 2024
(This article belongs to the Special Issue Machine Learning and Statistical Learning with Applications)

Abstract

:
This paper presents convergence analysis of a novel data-driven feedback control algorithm designed for generating online controls based on partial noisy observational data. The algorithm comprises a particle filter-enabled state estimation component, estimating the controlled system’s state via indirect observations, alongside an efficient stochastic maximum principle-type optimal control solver. By integrating weak convergence techniques for the particle filter with convergence analysis for the stochastic maximum principle control solver, we derive a weak convergence result for the optimization procedure in search of optimal data-driven feedback control. Numerical experiments are performed to validate the theoretical findings.

1. Introduction

In this paper, we carry out numerical analysis demonstrating the convergence of a data-driven feedback control algorithm designed for generating online control based on partial noisy observational data.
Our focus lies on the stochastic feedback control problem, which aims to determine optimal controls. These control actions are used to guide a controlled state dynamical system towards meeting certain optimality conditions, leveraging feedback from the system’s current state. There are two practical challenges in solving the feedback control problem. First, when the control problem’s dimension is high, the computational cost for searching the optimal control escalates exponentially. This is known as the “curse of dimensionality”. Second, in numerous scenarios, the state of the controlled system is not directly observable and must be inferred through detectors or observation facilities. These sensors are typically subject to noise that originates from the device itself or the surrounding environment. For instance, radar receives noisy data and processes them through the arctangent function. Therefore, state estimation techniques become necessary to estimate the current state for designing optimal control, with observations gathered to aid in estimating the hidden state.
To address the aforementioned challenges, a novel online data-driven feedback control algorithm has been developed [1]. This algorithm introduces a stochastic gradient descent optimal control solver within the stochastic maximum principle framework to combat the high dimensionality issue in optimal control problems. Traditionally, stochastic optimal control problems are solved using dynamical programming or the stochastic maximum principle, both requiring numerical simulations for large differential systems [2,3,4]. However, the stochastic maximum principle stands out for its capability to handle random coefficients in the state model and finite-dimensional terminal state constraints [5]. In the stochastic maximum principle approach, a system of backward stochastic differential equations (BSDEs) is derived as the adjoint equation of the controlled state process. Then, the solution of the adjoint BSDE is utilized to formulate the gradient of the cost function with respect to the control process [6,7]. However, solving BSDEs numerically entails significant computational costs, especially in high-dimensional problems, demanding a large number of random samples [8,9]. To bolster efficiency, a sample-wise optimal control solver method has been devised [10], where the solution of the adjoint BSDE is represented using only one realization or a small batch of samples. This approach justifies the application of stochastic approximation in the optimization procedure [11,12], and it shifts the computational cost from solving BSDEs to searching for the optimal control, thereby enhancing overall efficiency [13].
In data-driven feedback control, optimal filtering methods also play a pivotal role in dynamically estimating the state of the controlled system. Two prominent approaches for nonlinear optimal filtering are the Zakai filter and the particle filter. While the Zakai filter aims to compute the conditional probability density function (pdf) for the target dynamical system using a parabolic-type stochastic partial differential equation known as the Zakai equation [14], the particle filter, also known as a sequential Monte Carlo method, approximates the desired conditional pdf using the empirical distribution of a set of random samples (particles) [15]. Although the Zakai filter theoretically offers more accurate approximations for conditional distributions, the particle filter is favored in more practical applications due to the high efficiency of the Monte Carlo method in approximating high-dimensional distributions [16].
The aim of this study is to examine the convergence of the data-driven feedback control algorithm proposed in [1], providing mathematical validation for its performance. While convergence in particle filter methods has been well studied [17,18,19], this work adopts the analysis technique outlined in [18] to establish weak convergence results for the particle filter regarding the number of particles. Analysis techniques for BSDEs alongside classical convergence results for stochastic gradient descent [13,20] are crucial for achieving convergence in the stochastic gradient descent optimal control solver. The theoretical framework of this analysis merges the examination of particle filters with the analysis of optimal control, and the overarching objective of this paper is to derive a comprehensive weak convergence result for the optimal data-driven feedback control.
In this paper, we present two numerical examples to demonstrate the baseline performance and convergence trend of our algorithm. The first example involves a classic linear quadratic optimal control problem, comparing the analytical control with the estimated control. The second example addresses a nonlinear scenario, specifically a Dubins vehicle maneuvering problem, where both the system and observations exhibit significant nonlinearity.
The rest of this paper is organized as follows. In Section 2, we introduce the data-driven feedback control algorithm. Convergence analysis will be presented in Section 3, and in Section 4, we conduct two numerical experiments to validate our theoretical findings.

2. An Efficient Algorithm for Data-Driven Feedback Control

We first briefly introduce the data-driven feedback control problem that we consider in this work. Then, we describe our efficient algorithm for solving the data-driven feedback control problem by using a stochastic gradient descent-type optimization procedure for the optimal control.

2.1. Problem Setting for the Data-Driven Optimal Control Problem

In probability space ( Ω , F , P ) , we consider the following augmented system on time interval [ 0 , T ]
d X t M t = b ( t , X t , u t ) g ( X t ) d t + σ ( t , X t , u t ) 0 0 I d W t B t , X 0 = ξ M 0 = 0 ,
where X : = { X t } t = 0 T is the R d -dimensional controlled state process with dynamics b : [ 0 , T ] × R d × R m R d , σ : [ 0 , T ] × R d × R m R d × q is the diffusion coefficient for a d-dimensional Brownian motion W that perturbs the the state X, and u is an m-dimensional control process valued in some set U that controls the state process X. In the case that the state X is not directly observable, we have an observation process M that collects partial noisy observations on X with observation function g : R d R p , and B is a p-dimensional Brownian motion that is independent from W.
Let F B = { F t B } t 0 be the filtration of B augmented by all the P -null sets in F , and F W , B { F t W , B } t 0 be the filtration generated by W and B (augmented by P -null sets in F ). Under mild conditions, for any square integrable random variable ξ independent of W and B, and any F W , B -progressively measurable process u (valued in U), Equation (1) admits a unique solution ( X , M ) which is F W , B -adapted. Next, we let F M = { F t M } t 0 be the filtration generated by M (augmented by all the P -null sets in F ). Clearly, F M F W , B , and F M F W , F M F B , in general. The F M progressively measurable control processes, denoted by u M , are control actions driven by the information contained in observational data.
We introduce the set of data-driven admissible controls as
U a d [ 0 , T ] = u M : [ 0 , T ] × Ω U R m | u M is F M progressively measurable ,
and the cost functional that measures the performance of data-driven control u M is defined as
J ( u M ) = E 0 T f ( t , X t , u t M ) d t + h ( X T ) ,
where f is the running cost, and h is the terminal cost.
The goal of the data-driven feedback control problem is to find the optimal data-driven control u * U a d [ 0 , T ] such that
J ( u * ) = inf u M U a d [ 0 , T ] J ( u M ) .

2.2. The Algorithm for Solving the Data-Driven Optimal Control Problem

To solve the data-driven feedback control problem, we will use the algorithm from [1], which is derived from the stochastic maximum principle.

2.2.1. The Optimization Procedure for Optimal Control

When the optimal control u * is in the interior of U a d , the gradient process of the cost functional J * with respect to the control process on time interval t [ 0 , T ] can be derived using the Gâteaux derivative of u * and the stochastic maximum principle in the following form:
( J * ) u ( u t * ) = E b u ( t , X t * , u t * ) Y t + σ u ( t , X t * , u t * ) Z t + f u ( t , X t * , u t * ) | F t M ,
where stochastic processes Y and Z are solutions of the following forward–backward stochastic differential equations (FBSDEs) system:
d X t * = b ( t , X t * , u t * ) d t + σ ( t , X t * , u t * ) d W t , X 0 = ξ d M t * = ϕ ( X t ) d t + d B t , M 0 = 0 d Y t = ( b x ( t , X t * , u t * ) Y t σ x ( t , X t * , u t * ) Z t f x ( t , X t * , u t * ) ) d t + Z t d W t + ζ t d B t , Y T = ( g x ( X T ) )
where Z is the martingale representation of Y with respect to W and ζ is the martingale representation of Y with respect to B.
To solve the data-driven feedback optimal control problem, we also use gradient descent-type optimization, and the gradient process ( J * ) u is defined in (4). Then, we can use the following gradient descent iteration to find the optimal control u t * at any time instant t [ 0 , T ]
u t l + 1 , M = u t l , M r ( J * ) u ( u t l , M ) , l = 0 , 1 , 2 , ,
where r is the step size for the gradient. We know that the observational information F t M is gradually increased as we collect more and more data over time. Therefore, at a certain time instant t, we target finding the optimal control u t * with accessible information F t M . Since the evaluation for ( J * ) u ( u t l , M ) requires trajectories ( Y s , Z s ) t s T as Y t and Z t are solved backwards from T to t, we take conditional expectation E [ · | F t M ] to the gradient process { ( J * ) u ( u t l , M ) } t s T , i.e.,
E [ ( J * ) u ( u s l , M ) | F t M ] = E [ b u ( s , X s , u s l , M ) Y s + σ u ( s , X s , u s l , M ) Z s + f u ( s , X s , u s l , M ) | F t M ] , s [ t , T ] ,
where X s , Y s and Z s correspond to the estimated control u s l , M . For the gradient descent iteration (6) on the time interval [ t , T ] , by taking conditional expectation E [ · | F t M ] , we obtain
E [ u s l + 1 , M | F t M ] = E [ u s l , M | F t M ] r E [ ( J * ) u ( u s l , M ) | F t M ] , l = 0 , 1 , 2 , s [ t , T ] .
When s > t , the observational information { F s M } t s T is not available at time t. We use conditional expectation E [ u s l + 1 , M | F t M ] to replace u s l , M since it provides the best approximation for u s l , M given the current observational information F t M . We denote
u s l , M | t : = E [ u s l + 1 , M | F t M ]
and then the gradient descent iteration is
u s l + 1 , M | t = u s l , M | t r E [ ( J * ) u ( u s l , M | t ) | F t M ] , l = 0 , 1 , 2 , s [ t , T ] ,
where E [ ( J * ) u ( u s l , M | t ) | F t M ] can be obtained by solving the following FBSDEs
d X t = b ( s , X s , u s l , M | t ) d s + σ ( s , X s , u s l , M | t ) d W s , s [ t , T ] d Y t = ( b x ( s , X s , u s l , M | t ) Y s σ x ( s , X s , u s l , M | t ) Z s f x ( s , X s , u s l , M | t ) ) d s + Z s d W s + ζ s d B s , Y T = ( g x ( X T ) )
and evaluated effectively using the numerical algorithm, which will be introduced later.
When the controlled dynamics and the observation function ϕ are nonlinear, we will use optimal filtering techniques to obtain the conditional expectation E [ Ψ ( s ) | F t M ] . Before applying the particle filter method, which is one of the most important particle-based optimal filtering methods, we define
Ψ ( s , X s , u s l , M | t ) : = b u ( s , X s , u s l , M | t ) Y s + σ u ( s , X s , u s l , M | t ) Z s + f u ( s , X s , u s l , M | t )
for s [ t , T ] . With the conditional probability density function (pdf) p ( X t | F t M ) that we obtain through optimal filtering methods and the fact that Ψ ( s , X s , u s l , M | t ) is a stochastic process depending on the state of random variable X t , the conditional gradient process E [ ( J * ) u ( u t l , M | t ) | F t M ] in (7) can be obtained by the following integral
E [ ( J * ) u ( u s l , M | t ) | F t M ] = R d E [ Ψ ( s , X s , u s l , M | t ) | X t = x ] · p ( x | F t M ) d x , s [ t , T ] .

2.2.2. Numerical Approach for Data-Driven Feedback Control by PF-SGD

For the numerical framework, we need the temporal partition Π N T
Π N T = { t n , 0 = t 0 < t 1 < < t N T = T } ,
and we use the control sequence { u t n * } n = 1 N T to represent the control process u * over the time interval [ 0 , T ] .
  • Numerical Schemes for FBSDEs
For the FBSDEs system, we adopt the following schemes:
X i + 1 = X i + b ( t i , X i , u t i l , M | t n ) t i + σ ( t i , X i , u t i l , M | t n ) W t i , Y i = E i [ Y i + 1 ] + E i [ b x ( t i + 1 , X i + 1 , u t i + 1 l , M | t n ) Y i + 1 + σ x ( t i + 1 , X i + 1 , u t i + 1 l , M | t n ) Z i + 1 + f x ( t i + 1 , X i + 1 , u t i + 1 l , M | t n ) ] t i , Z i = 1 t i E i [ Y i + 1 W t i ] ,
where X i + 1 , Y i , and Z i are numerical approximations for X t i + 1 , Y t i , and Z t i , respectively.
Then, the standard Monte Carlo method can approximate expectations with K random samples:
X i + 1 k = X i + b ( t i , X i , u t i l , M | t n ) t i + σ ( t i , X i , u t i l , M | t n ) t i ω i k , k = 1 , 2 , , K , Y i = k = 1 K Y i + 1 k K + t i K k = 1 K [ b x ( t i + 1 , X i + 1 k , u t i + 1 l , M | t n ) Y i + 1 k + σ x ( t i + 1 , X i + 1 k , u t i + 1 l , M | t n ) Z i + 1 k + f x ( t i + 1 , X i + 1 k , u t i + 1 l , M | t n ) ] , Z i = 1 t i k = 1 K Y i + 1 k t i ω i k K ,
where { ω i k } k = 1 K is a set of random samples following the standard Gaussian distribution that we use to describe the randomness of W t i .
The above schemes solve the FBSDE system (5) as a recursive algorithm, and the convergence of the schemes is well studied—cf. ([20,21]).
  • Particle Filter Method for Conditional Distribution
To apply the particle filter method, we consider the controlled process on time interval [ t n 1 , t n ]
X t n = X t n 1 + t n 1 t n b ( s , X s , u s ) d s + t n 1 t n σ ( s , X s , u s ) d W s
Assume that at time instant t n 1 , we have S particles, denoted by { x n 1 ( s ) } s = 1 S , that follow an empirical distribution π ( X t n 1 | F t n 1 M ) : = 1 S s = 1 S δ x n 1 ( s ) ( X t n 1 ) as an approximation for p ( X t n 1 | F t n 1 M ) . The prior pdf that we want to find in the prediction stage is approximated as
π ˜ ( X t n | F t n 1 M ) : = 1 S s = 1 S δ x ˜ n ( s ) ( X t n )
where x ˜ n ( s ) is sampled from π ( X t n 1 | F t n 1 M ) p ( X t n | X t n 1 ) and p ( X t n | X t n 1 ) is the transition probability derived from the state dynamics (15). As a result, the sample cloud { x ˜ n ( s ) } s = 1 S provides an approximate distribution for the prior p ( X t n | F t n 1 M ) . Then, in the update stage, we have
π ˜ ( X t n | F t n M ) : = s = 1 S δ x ˜ n ( s ) ( X t n ) p ( M t n | x ˜ n ( s ) ) s = 1 S p ( M t n | x ˜ n ( s ) ) = s = 1 S w n ( s ) δ x ˜ n ( s ) ( X t n )
In this way, we obtain a weighted empirical distribution π ˜ ( X t n | F t n M ) that approximates the posterior pdf p ( X t n | F t n M ) with the importance density weight w n ( s ) p ( M t n | x ˜ n ( s ) ) . Then, to avoid the degeneracy problem, we need the resampling step. Thus, we have
π ( X t n | F t n M ) = 1 S s = 1 S δ x n ( s ) ( X t n )
Then, we combine the numerical schemes for the adjoint FBSDEs system (10) and the particle filter algorithm to formulate an efficient stochastic optimization algorithm to solve the optimal control process u * .
  • Stochastic Optimization for Control Process
In this subsection, we combine the numerical schemes for the adjoint FBSDEs system (10) and the particle filter algorithm to formulate an efficient stochastic optimization algorithm to solve the optimal control process u * .
On a time instant t n Π N T , we have
E [ ( J * ) u ( u t i l , M | t n ) | F t n M ] = R d E [ Ψ ( t i , X t i , u t i l , M | t n ) | X t n = x ] · p ( x | F t n M ) d x ,
where t i t n is a time instant after t n .
Then, we use the approximate solutions ( Y i , Z i ) of FBSDEs from schemes (14) to replace ( Y t i , Z t i ) and the conditional distribution p ( X t n | F t n M ) is approximated by the empirical distribution π ( X t n | F t n M ) obtained from the particle filter algorithm (16)–(18). Then, we can solve the optimal control u t n * through the following gradient descent optimization iteration
u t i l + 1 , M | t n = u t i l , M | t n r 1 S s = 1 S E [ b u ( t i , X t i , u t i l , M | t n ) Y i + σ u ( t i , X t i , u t i l , M | t n ) Z i + f u ( t i , X t i , u t i l , M | t n ) | X t n = x n ( s ) ]
Then, the standard Monte Carlo method can approximate expectation E · | X t n = x n ( s ) by Λ samples:
u t i l + 1 , M | t n u t i l , M | t n r 1 S 1 Λ s = 1 S λ = 1 Λ [ b u ( t i , X t i ( λ , s ) , u t i l , M | t n ) Y i + σ u ( t i , X t i ( λ , s ) , u t i l , M | t n ) Z i + f u ( t i , X t i ( λ , s ) , u t i l , M | t n ) | X t n = x n ( s ) ]
We can see from the above Monte Carlo approximation that in order to approximate the expectation in one gradient descent iteration step, we need to generate S × Λ samples. This is even more computationally expensive when the controlled system is a high-dimensional process.
Thus, we want to apply the idea of stochastic gradient descent (SGD) to improve the efficiency of classic gradient descent optimization and combine it with the particle filter method. Instead of using the fully calculated Monte Carlo simulation to approximate the conditional expectation, we use only one realization of X t i to represent the expectation. For the conditional distribution of the controlled process, we can use the particles to describe. Thus, we have
E [ ( J * ) u ( u t i l , M | t n ) | F t n M ] b u ( t i , X t i ( l ^ , s ^ ) , u t i l , M | t n ) Y i ( l ^ , s ^ ) + σ u ( t i , X t i ( l ^ , s ^ ) , u t i l , M | t n ) Z i ( l ^ , s ^ )     + f u ( t i , X t i ( l ^ , s ^ ) , u t i l , M | t n )
where l is the iteration index, and the index l ^ indicates that the random generation of the controlled process varies among the gradient descent iteration steps. X t i ( l ^ , s ^ ) indicates a randomly generated realization of the controlled process with a randomly selected initial state X t n ( l ^ , s ^ ) = x n s ^ from the particle cloud { x n ( s ) } s = 1 S .
Then, we have the following SGD schemes:
u t i l + 1 , M | t n = u t i l , M | t n r ( b u ( t i , X t i ( l ^ , s ^ ) , u t i l , M | t n ) Y i ( l ^ , s ^ ) + σ u ( t i , X t i ( l ^ , s ^ ) , u t i l , M | t n ) Z i ( l ^ , s ^ ) + f u ( t i , X t i ( l ^ , s ^ ) , u t i l , M | t n ) )
where Y i ( l ^ , s ^ ) is the approximate solution Y i corresponding to the random sample X t i ( l ^ , s ^ ) , and the path of X t i ( l ^ , s ^ ) is generated as follows
X t i + 1 ( l ^ , s ^ ) = X t i ( l ^ , s ^ ) + b ( t i , X t i ( l ^ , s ^ ) , u t i l , M | t n ) t i + σ ( t i , X t i ( l ^ , s ^ ) , u t i l , M | t n ) t i ω i ( l ^ , s ^ )
where ω i ( l ^ , s ^ ) N ( 0 , 1 ) . Then, an estimate for our desired data-driven optimal control at time instant t n is
u ^ t n : = u t n L , M | t n
The scheme for FBSDEs is
Y i ( l ^ , s ^ ) = Y i + 1 ( l ^ , s ^ ) + [ b x ( t i + 1 , X t i + 1 ( l ^ , s ^ ) , u t i l , M | t n ) Y i + 1 ( l ^ , s ^ ) + σ x ( t i + 1 , X t i + 1 ( l ^ , s ^ ) , u t i l , M | t n ) Z i + 1 ( l ^ , s ^ ) + f x ( t i + 1 , X t i + 1 ( l ^ , s ^ ) , u t i l , M | t n ) ] t i Z i ( l ^ , s ^ ) = Y i + 1 ( l ^ , s ^ ) ω i ( l ^ , s ^ ) t i
Then, we have the following Algorithm 1:
Algorithm 1 PF-SGD algorithm for data-driven feedback control problem.
  • Initialize the particle cloud { x 0 ( s ) } s = 1 S ξ and the number of iteration L N
  • while  n = 0 , 1 , 2 , , N T , do
  •   Initialize an estimated control process { u t i ( 0 , M ) | t n } i = n N T and a step-size ρ
  •   for SGD iteration steps l = 0 , 1 , 2 , , L , do
  •    Simulate one realization of controlled process { X t i + 1 ( l ^ , s ^ ) | t n } i = n N T 1 through scheme (24) with
  •     { X t n ( l ^ , s ^ ) } = x n ( s ^ ) { x n ( s ) } s = 1 S ;
  •    Calculate solution { Y t i ( l ^ , s ^ ) | t n } i = N T n of the FBSDEs system (14) corresponding to
  •     { X t i + 1 ( l ^ , s ^ ) | t n } i = n N T 1 through schemes (25);
  •    Update the control process to obtain { u t i ( l + 1 , M ) | t n } i = n N T through scheme (23);
  •   end for
  •   The estimated optimal control is given by u ^ t n * = u t n ( L , M ) | t n
  •   Propagate particles through the particle filter algorithm (16)–(18) to obtain { x n + 1 ( s ) } s = 1 S
  •   by using the estimated optimal control u ^ t n * .
  • end while

3. Convergence Analysis

Our convergence analysis aims to show the convergence of the distribution of the state to the "true state" under the temporal model discretization N. We also show the convergence of the estimated control to the "true control" under the expectation restricting it to a compact set. To proceed, we first introduce our notations and the assumptions required in the proof in Section 3.1. Then, in Section 3.2, we shall provide the main convergence theorems.

3.1. Notations and Assumptions

  • Notations
  • We use U n : { t n , , T } R d to denote the control process starts from time t n and ends at time T. We use
    U n : = { U n | U n : { t n , , T } R d , U n is F t n M - adapted }
    to denote the collection of the admissible controls starting at time t n .
  • We define the control at time t n to be u n : = U n | t n , the conditional distribution coming from a particle filter algorithm.
  • We define μ n N : = π t n | t n N where the superscript means that the measure is obtained through the particle filter method, and so it is random.
  • We use S N to denote the sampling operator: π t n | t n N 1 N i = 1 N δ x t ( i ) and L n to denote the updating step in the particle filter. We use P n N to denote the transition operator (the prediction step) under the SGD–particle filter framework. P n is the deterministic transition operator for the exact case (the control is exact in SDG). We mention “deterministic” here to distinguish the case where the control u n may be random due to the SGD optimization algorithm.
  • We use · , · to denote the deterministic L 2 inner product, i.e., if f , g L 2 ( [ 0 , T ] ; R d ) , then
    f , g : = 0 T f · g d t
  • We define J N x ( U n ) : = E [ J N ( U n ) | X n = x ] . We then have E [ J N X n ( U n ) ] : = E [ J N ( U n ) | X n = s ] d μ n N ( s ) . We remark that U n is a process that starts from time t n , and so X n is essentially the initial condition of the diffusion process.
  • We define the distance between two random measures to be the following:
    d ( μ , ν ) : = sup | | f | | 1 E ω [ | μ ω f ν ω f | 2 ]
    where the expectation is taken over by the randomness of the measure.
  • We use the total variation distance between two deterministic probability measures μ , ν :
    d T V ( μ , ν ) : = sup | | f | | 1 | μ f ν f |
  • We use K n to denote the total number of iterations taken in the SGD algorithm at time t n ; we use N to denote the total number of particles in the system. We use C to denote a generic constant which may vary from line to line.
  • Abusing the notation, we will denote J N x ( U n ) in the following way where the argument U n can be a vector of any length 1 n N :
    J N x ( U n ) | t i : = E X t n = x [ b u ( X t i , U n | t i ) Y t i + σ u ( X t i , U n ) Z t i + f u ( X t i , U n | t i ) ]
  • Assumptions
  • We assume that J N satisfy the following strong condition: for any x X , there exist a constant λ > 0 such that for all U , V U 0 :
    λ | | U V | | 2 J N x ( U ) J N x ( V ) , U V
    Notice that (30) implies that such inequality is true for any U n , V n U n , and it can be seen from simply fixing all the U n | t i , V n | t i , 0 i n 1 to be 0.
    This is a very strong assumption, and one should consider relaxing it to
    λ | | U V | | 2 E ω [ E μ n , · [ J N x ( U ) , J N x ( V ) , U V ] ]
    That is, this relation holds in expectation instead of point-wise.
  • Both b and σ are deterministic and in C b 2 , 2 ( R d × R m ; R d ) in space variable x and control u.
  • b , b x , b u , σ , σ x , f x , f u are all uniformly Lipschitz in x , u and uniformly bounded.
  • σ satisfies the uniform elliptic condition.
  • The initial condition X 0 : = ξ L 2 ( F 0 ) .
  • The terminal (Loss) function Φ is C 1 and positive, and Φ x has the most linear growth at infinity.
  • We assume that the function g n (related to the Bayesian step) has the following bound: there exists 0 < κ < 1 such that
    κ g n κ 1

3.2. The Convergence Theorem for the Data-Driven Feedback Control Algorithm

Our algorithm combines the particle filter method and the stochastic gradient descent method. Lemma 1 (combine Lemma 4.7–4.9 from the book [22]) provides the convergence result for the particle filter method alone. It shows that each prediction and updating step is guaranteed to be convergent.
Recall that S N is the sampling operator where we sample N particles. P n N denotes the transition operator (the prediction step) under the SGD–particle filter framework. P n denotes the deterministic transition operator assuming that SGD gives the exact control. L n denotes the updating step in the particle filter method.
Lemma 1. 
We assume that there exists κ ( 0 , 1 ] . The following is true:
sup μ P ( μ ) d ( S N μ , μ ) 1 N
d ( P n N μ , P n N ν ) d ( μ , ν ) d ( P n μ , P n ν ) d ( μ , ν )
d ( L n ν , L n μ ) 2 κ 2 d ( ν , μ )
Given Lemma 1, Theorem 4.5 in [22] tells us the particle filter framework is convergent. Then, following Lemma 1, we can have the distance between the true distribution of the state and the estimated distribution through the SGD–particle filter framework.
d ( μ n + 1 N , · , μ n + 1 ) d ( L n S N P n N μ n N , · , L n P n μ n ) d ( L n S N P n N μ n N , · , L n S N P n μ n ) + d ( L n S N P μ n , L n P n μ n ) 2 κ 2 2 N + d ( P n N μ n N , · , P n μ n N , · ) + d ( S N P n μ n N , · , P n μ n ) 2 κ 2 3 N + d ( P n N μ n N , · , P n μ n N , · ) + d ( μ n N , · , μ n )
where in the above inequalities, we have used triangle inequalities and Lemma 1.
Hence, if we can show that the inequality of the following form holds
d ( P n N μ n N , · , P n μ n N , · ) C n d ( μ n N , · , μ n ) + ϵ n
for some constant C n and ϵ n that we can tune, then by recursion, we can show that by using (34), the convergence holds true.
Remark. 
We point out that the difficulty lies in showing (35). Recall that the distance between two random measures is defined in (27) and involves testing the overall measurable function bounded by 1. However, we will see later that it is more desirable to test against Lipschitz functions. Hence, since the underlying measure is a finite Borel probability measure, we want to identify the function first with a continuous function on a compact set (Lusin). Then, we approximate this continuous function uniformly by a Lipschitz function since now the domain is compact. This way, we can roughly show that a form close to (35) is true.
Remark. 
Notice that the first measure in d ( P n N μ n N , · , P n μ n N , · ) has two sources of randomness: the randomness in P n which comes from the SGD algorithm used to find the control, and the randomness in the measure μ n N . However, we do not distinguish the two when we take the expectation.
To prove the convergence, we need to create a subspace where all particles X i (obtained from the particle filter method) at any time n are within this bounded subspace. Or we can relax it to the statement that the probability of any particles X i escaping from a very large region is very small. Lemma 2 shows we can restrict the particles to a compact subspace with the radius M by starting from any particles and any admissible control U 0 .
Lemma 2. 
There exists M and constant C, such that under any admissible control U 0
P ( sup i { 1 , , N } U 0 U | X i | M ) C M 2 , X i π t i | t i 1 N or X i π t i | t i N
Proof. 
See Appendix A.1. □
Remark. 
Lemma 2 tells that starting from any random selection of particles and any admissible control U 0 , at any time t, all particles are restricted by a compact set M with diameter d i a m ( M ) M , such that
P ( sup i { 1 , , N } U 0 U | X i | d i a m ( M ) ) C M 2 , X i π t i | t i 1 N or X i π t i | t i N
We will use the following result extensively later
E [ 1 { | X n | M } ] C M 2 , 1 n N
The following Lemma 3 describes the difference between the estimated optimal control u n and the true control u n * . Let G k : = { Δ W n i , x i } i = 0 k 1 . We can see that knowing G k essentially means that we know the control u k in the SGD framework at time t n , since according to our scheme, the control is G k measurable.
Lemma 3. 
Under a fixed temporal discretization number N, with the particle cloud μ N , ω , a deterministic u n * and a compact domain K n , (such that E ω E μ n N , ω [ 1 K n c ] C M n 2 and d i a m ( K n c ) M n ), we have for any iteration number K that the following holds
E ω E μ n N , ω 1 K n | u n ω u n * | 2 | X n = x C M n 2 sup | | q | | 1 E ω [ | μ n N , ω q μ n q | 2 ] + C M n + C M n 2 K
Remark. 
The value of sup | | q | | 1 E ω [ | μ n N , ω q μ n q | 2 ] depends on μ n N , ω , which is obtained from the previous step, and it does not depend on the current M n . As a result, we can see that, as long as
sup | | q | | 1 E ω [ | μ n N , ω q μ n q | 2 ] 0
(39) can be made arbitrarily small on any compact domain K n , and this indicates the point-wise convergence for u at any time t n .
Proof. 
For simplicity in the proof, we denote control as U n K + 1 where K is the iteration in SGD, and n is the current time t n . j n x k ( U n K ) denote the SGD process using estimated control U n K , and J N x ( U n * ) denote the process using true control U n * .
U n K + 1 = U n K η k j n x k ( U n K )
U n * = U n * η k E μ n [ J N x ( U n * ) ]
where x k is drawn from the current distribution μ n N , ω and E μ n [ J N x ( U n * ) ] = 0 by the optimality condition. Take the difference between (40) and (41), square both sides, and take conditional expectation E [ · | G K ] , and this conditional expectation is taken with respect to the following randomness:
  • The randomness comes from the selection on the initial point x n k .
  • The randomness comes from the pathwise approximated Brownian motion used for FBSDEs.
  • The randomness comes from the accumulation of the past particle sampling.
We can write E [ j x ( U K ) | G K ] = E μ n N , ω [ J N x ( U K ) | G K ] , which can be seen from the following
E [ j x ( U K ) | G K ] = E μ n N , ω [ E X t n = x [ j x ( U K ) | G K ] ] = E μ n N , ω [ J N x ( U K ) | G K ]
Then, take the square norm on both sides, multiply by an indicator function 1 K n and take conditional expectation E [ · | G K ] . Noticing that U n K is G K is measurable and U * is deterministic. In this case, we obtain the following
E [ 1 K n | | U n K + 1 U n * | | 2 | G K ] = E [ 1 K n | | U n K U n * | | 2 | G K ] 2 η k E μ n N , ω [ 1 K n J N x ( U n K ) | G K ] E μ n [ 1 K n J N x ( U n * ) ] , U n K U n * + η K 2 E [ 1 K n | | j x ( U n K ) E μ n [ J N x ( U n * ) ] | | | G K ] = E [ 1 K n | | U n K U n * | | 2 | G K ] 2 η k E μ n N , ω [ 1 K n J N x ( U n K ) J N x ( U n * ) + J N x ( U n * ) E μ n [ J N x ( U n * ) ] , U n K U n * | G K ] + η k 2 E μ n N [ 1 K n | | j x ( U n K ) E μ n [ J N x ( U n * ) ] | | ] = E [ 1 K n | | U n K U n * | | 2 | G K ] 2 η k E μ n N , ω [ 1 K n J N x ( U n K ) 1 K n J N x ( U * ) , U n K U n * | G K ] 2 η k E μ n N , ω [ 1 K n J N x ( U n * ) ] E μ n N , ω [ 1 K n E μ n [ J N x ( U n * ) ] ] , 1 K n ( U n K U n * ) + η k 2 E [ 1 K n | | j x ( U n K ) E μ n [ J N x ( U n * ) ] | | | G K ] ( 1 λ η k ) E μ n N , ω [ 1 K n | | U n K U n * | | 2 | G K ] + η k λ | | E μ n N , ω 1 K n J N x ( U n * ) 1 K n E μ n [ J N x ( U n * ) ] | | 2 * * + η k 2 E μ n N , ω [ 1 K n C | x i | 2 + C ]
where in the last line, we used the following Lemma from [13] that states that there exists C such that
E [ | | j x ( U n ) | | 2 ] C | x | 2 + C
Recall that E μ n [ J N x ( U * ) ] = 0 , we then have
* * | | E μ n N , ω 1 K n J N x ( U * ) E μ n N , ω E μ n [ J N x ( U * ) ] + E μ n N , ω E μ n [ J N x ( U * ) ] E μ n N , ω 1 K n E μ n [ J N x ( U * ) ] | | 2 | | E μ n N , ω 1 K n J N x ( U * ) E μ n [ 1 K n J N x ( U * ) ] + E μ n [ 1 K n J N x ( U * ) ] E μ n N , ω [ E μ n [ J N x ( U * ) ] | | 2 ( 1 + ϵ ) | | E μ n N , ω 1 K n J N x ( U * ) E μ n [ 1 K n J N x ( U * ) ] | | 2 + ( 1 + 1 ϵ ) | | E μ n [ 1 K n J N x ( U * ) ] E μ n [ J N x ( U * ) ] | | 2 C | | E μ n N , ω 1 K n J N x ( U * ) E μ n [ 1 K n J N x ( U * ) ] | | 2 + C M n
Then, we take the expectation on both sides over the randomness and we have
E [ 1 K n | | U K + 1 U * | | 2 ] ( 1 λ η k ) E μ n N [ | | U k U * | | 2 ] + η k λ E ω [ C | | E μ n N , ω 1 K n J N x ( U * ) E μ n [ 1 K n J N x ( U * ) ] | | 2 + C M n ] + η k 2 C M n 2 | | U 0 U * | | 2 K + C E ω [ | | E μ n N , ω [ 1 K n J N x ( U * ) ] E μ n [ 1 K n J N x ( U * ) | | 2 2 ] + C M n + C M n 2 K
Notice that for the control U * , we know that for for a fixed x, J N x ( U * ) is uniformly bounded:
| | E μ n N , ω [ 1 K n J N x ( U * ) ] E μ n [ 1 K n J N x ( U * ) ] | | 2 2 = i = n N Δ t | E μ n N , ω [ 1 K n J N x ( U * ) | i ] E μ n [ 1 K n J N x ( U * ) | i ] | 2 i = n N Δ t sup j { n , , N } | E μ n N , ω [ 1 K n J N ( U * ) ( x ) | j ] E μ n [ 1 K n J N ( U * ) ( x ) | j ] | 2 sup j { n , , N } | E μ n N , ω [ 1 K n J N x ( U * ) | j ] E μ n [ 1 K n J N x ( U * ) | j ] | 2
However, since from the result in (44):
sup j { n , , N } | J N x ( U * ) | j | 2 C | x | 2 + C
we have that
1 K n sup j { n , , N } | J N x ( U * ) | j | 2 C | M n | 2 1 K n | q ( x ) |
for some q ( x ) , where | | q ( x ) | | 1 . As a result, we see that
( 46 ) C M n 2 E ω | E μ n N , ω [ q ( x ) ] E μ n [ q ( x ) ] | 2 + C M n 2 K C M n 2 sup | | q | | 1 E ω [ | μ n N , ω q μ n q | 2 ] + C M n 2 K
Thus, we have
E ω E μ n N , ω [ 1 K n sup n | u K + 1 u * | 2 ] C M n 2 sup | | q | | 1 E ω [ | μ n N , ω q μ n q | 2 ] + C M n + C M n 2 K
where we have absorbed the constant term N in C. □
Lemma 3 shows that when the empirical distribution μ n N is close enough to the true distribution μ n , the difference between u n and u n * under the expectation restricted to a compact set is bounded by the difference between the true distribution of the state and the estimated distribution. Thus, suppose we can show the convergence of the distribution. In that case, we have the convergence result of the estimated control to the "true control" under the expectation restricted to a compact set.
Next, we want to show that by moving forward in one step, the distance between the true distribution of the state and the estimated distribution through the SGD–particle filter framework is bounded by the distance of the previous step with some constants in Lemma 4.
Lemma 4. 
For each n = 0 , 1 , , N 1 , there exist M n , L n , δ n , K n such that the following inequality holds
d ( μ n + 1 N , · , μ n + 1 ) 2 κ 2 ( ( 1 + C Δ t L n M n ) d ( μ n N , · , μ n ) + C M n + C M n K n + 2 δ n + 3 N )
Proof. 
The key step is to estimate the quantity d 2 ( P n N μ n N , · , P n μ n N , · ) in (34). WLOG, we assume that the sup is realized by the function f with | | f | | 1 ; then, we have
d 2 ( P n N μ n N , · , P n μ n N , · ) = E ω [ | P n N μ n N , · f P n μ n N , · f | 2 ]
Notice that P n N is the prediction operator that uses the control u n which carries the randomness from SGD, and P n uses the control u n * . Then P n N μ N N , ω is a random measure, and we comment that both u n * and μ n are deterministic.
Without loss of generality, we use u n ω and μ n N , ω to denote the random control and the random measure. (Even though the randomness can be different, we can concatenate ( ω 1 , ω 2 ) : = ω to define them as ω in general.)
We have for the fixed randomness ω , and by Fubini’s theorem
| P n N μ n N , ω f P n μ n N , ω f | 2 = | E μ n N , ω E [ f ( X n + b ( X n , u n ω ) Δ t + σ ( X n ) Δ W n ) f 1 ω | X n = x ] E μ n N , ω E [ f ( X n + b ( X n , u n * ) Δ t + σ ( X n ) Δ W n ) f 2 | X n = x ] | 2 = | E μ n N , ω E [ f 1 ω f 2 | X n = x ] | 2 = | E μ n N , ω 1 M n E [ f 1 ω f 2 | X n = x ] | 2 A 1 + | E μ n N , ω 1 M n c E [ f 1 ω f 2 | X n = x ] | 2 A 2
where the inner conditional expectation is taken with respect to Δ W n .
Now, since we can pick M n to be a large compact set containing the origin, then
P ( sup n , U 0 | X n | diam ( M n ) ) C M n 2
To deal with A 1 , A 2 , we see that it is desirable that the function f has the Lipschitz property. However, it is only measurable in general. The strategy to overcome this difficulty is to first use Lusin’s Theorem to find a continuous identification f ˜ with f on a large compact set; then, on this compact set, we can approximate f ˜ uniformly by a Lipschitz function.
We see that
A 1 E μ n N , ω 1 M n E [ | f 1 ω f 2 | 2 | X n = x ]
Then, by taking expectation on both sides over all the randomness in this quantity, we have
E ω [ A 1 ] E ω E μ n N , ω 1 M n E   [ | f 1 ω f 2 | 2 ]
We know that there exists a big compact K n (so a large M n ) containing the origin such that
P ( sup n , U 0 | X n | diam ( K n ) ) C M n 2
and a continuous f ˜ n with f ˜ n | K n = f | K n by Lusin’s theorem.
Thus, we know that f ˜ n | K n M n = f | K n M n , and we also have the following inequality:
E ω E μ n N , ω 1 M n E   [ | f 1 ω f 2 | 2 ] = E ω E μ n N , ω ( 1 M n K n + 1 M n K n c ) E   [ | f 1 ω f 2 | 2 ] E ω E μ n N , ω ( 1 M n K n + 1 K n c ) E   [ | f 1 ω f 2 | 2 ]
Moreover, since both K n and M n are compact, K n : = K n M n is also compact with d i a m ( K n ) M n . From Lemma 2, we know that there exists some constant C such that for any π t n | t n 1 N , π t n | t n N that one obtains from or particle filter–SGD algorithm, X π t n | t n 1 N or π t n | t n N :
E ω E [ 1 { X K n c } ] C M n 2
Hence, we have that
( 59 ) E ω E μ n N , ω 1 K n E   [ | f 1 ω f 2 | 2 ] + C M n 2
To deal with A 2 , notice that | f 1 ω f 2 | 2 by the choice of f, we have the following.
A 2 E μ n N , ω 1 M n c E [ | f 1 ω f 2 | 2 | X n = x ] E ω [ A 2 ] 4 E ω [ E μ n N , ω [ 1 K n c ] ] C M n 2
by Lemma 2.
To deal with A 1 , we have by the density of the Lipschitz function that there exists | | f n f ˜ n | | K n , δ n with Lipschitz constant L n . We point out that L n may depend on K n , δ n and the function f ˜ | K n . Now, by taking the expectation on both sides and using the Lipschitz property, we have
E ω [ A 1 ] ( C Δ t L n ) 2 E ω E μ n N , ω 1 K n | u n ω u n * | 2 | X n = x * + C M n 2 + 4 δ n 2
We realize that * is the SGD optimization part of the algorithm in expectation, and we note that we have dropped the inner expectation. The expectation E μ n N , ω [ · ] means that given the initial condition X n = x 1 K n , with X n μ n N , ω , one wants to find the difference in expectation between u n and u n * . The outer expectation E ω [ · ] means averaging overall the randomness in both the measure and the SGD.
Now, by using (50) in Lemma 3, absorbing N in the constant C, we obtain the following
E ω [ A 1 ] ( C Δ t L n ) 2 N M n 2 sup | | q | | 1 E ω [ | μ n N , ω q μ n q | 2 ] + C M n 2 K n + C M n 2 + 4 δ n 2
By definition of the distance between two random measures, we have that:
E [ A 1 ] ( C Δ t L n ) 2 N M n 2 d 2 ( μ n N , · , μ n ) + C M n 2 K n + C M n 2 + 4 δ n 2 E [ A 1 ] C Δ t L n M n d ( μ n N , · , μ n ) + C M n K n + C M n + 2 δ n
Since E [ A 2 ] C M n , we have that
( 34 ) 2 κ 2 ( 3 N + C Δ t L n M n d ( μ n N , · , μ n ) + C M n K n + C M n + 2 δ n + 2 M n + d ( μ n N , · , μ n ) ) d ( μ n + 1 N , · , μ n + 1 ) 2 κ 2 ( ( 1 + C Δ t L n M n ) d ( μ n N , · , μ n ) + C M n + C M n K n + 2 δ n + 3 N )
where in (66), we have merged N into C. □
Remark. 
Lusin’s theorem requires the underlying measure to be finite Borel regular, and in this case, we are looking at the measure μ ˜ defined as follows: for A R n , μ ˜ ( A ) = P ( { ω | t h e r e e x i s t n , U 0 s u c h t h a t X n ( ω ) A } ) . μ ˜ is clearly a probability measure induced on the Polish space R n , and so it is tight by the inverse implication of Prokhorov’s theorem (or we can use the fact that all finite Borel measures defined on a complete metric space are tight). Thus, it is inner regular; since now μ ˜ is also clearly locally finite, it also implies the outer regularity.
Finally, we can use Lemma 4 repeatedly to show the convergence result:
Theorem 5. 
By taking μ 0 N = μ 0 , there exist { M n | M n R , n = 0 , 1 , . . N 1 } , { L n | L n R , n = 0 , 1 , . . N 1 } and { δ n | δ n R , n = 0 , 1 , . . N 1 } such that
d ( μ N N , · , μ N ) i = 0 N 1 ( 2 κ 2 ) i j = 0 i 1 C N j C M N i + 2 δ N i + C M N i K N i + 3 N
where C j : = 1 + C Δ t L j M j . Then, for any M > 0 , we have by picking { M n } , { K n } , N large enough and { δ n } small enough, and then the following holds
d ( μ N N , · , μ N ) C M
for some fixed constant C which depends only on κ.
Proof. 
With C n defined as
C n : = 1 + C Δ t L n M n
and by using (66) repeatedly, we obtain the following result:
d ( μ N N , · , μ N ) i = 0 N 1 ( 2 κ 2 ) i j = 0 i 1 C N j ( C M N i + C M N i K N i + 2 δ N i + 3 N ) + ( 2 κ 2 ) N j = 0 N 1 C N j d ( μ 0 N , μ 0 ) i = 0 N 1 ( 2 κ 2 ) i j = 0 i 1 C N j ( C M N i + C M N i K N i + 2 δ N i + 3 N )
Since we know that d ( μ 0 N , μ 0 ) = 0 , we now just need to show that (70) vanishes when K l , N gets large and δ i gets small, i { 0 , 1 , , N } . Notice that M l comes from the domain truncation for each time step and δ l comes from the uniform approximation which is free to choose. The choice of δ l will potentially determine the value of L n .
We fix M N : = N M , δ N : = 1 N M where N is the number of time discretization and M is potentially a large number.
Then, we define δ l , M l through the following:
( 2 κ 2 ) i + 1 j = 0 i C N j 2 δ N i 1 = ( 2 κ 2 ) i j = 0 i 1 2 δ N i
( 2 κ 2 ) i + 1 j = 0 i C N j C M N i 1 = ( 2 κ 2 ) i j = 0 i 1 C N j C M N i
Here, we define C N + 1 1 .
Notice that one should iterate (71) and (72) iteratively, since defining δ i will lead to the Lipschitz constant L i at stage i, which is needed for the definition for C i .
Then, we have that
i = 0 N 1 ( 2 κ 2 ) i j = 0 i 1 C N j C M n i N C N M C M
and we also have
i = 0 N 1 ( 2 κ 2 ) i j = 0 i 1 C N j 2 δ n i N 1 N M 1 M
By picking K N i to be large, we then can have
i = 0 N 1 ( 2 κ 2 ) i j = 0 i 1 C N j C M N i K N i N 1 N M 1 M
Last but not least, by taking N so large such that
( i = 0 N i ( 2 κ 2 ) i j = 0 i 1 C N j ) 3 N 1 M
we can see that (70) converges to 0 by taking M to be very large. □
Remark. 
Notice that in Theorem 5, it is natural to have terms that depend on 1 K n and 1 N . The presence of M n and δ n are due to technical difficulties. M n basically gives the growth of the particles in the worst-case scenario (we want our domain to be compact), while L n and δ n come from the Lipchitz approximation for the test function f.

4. Numerical Example

In this section, we carry out two numerical examples. In the first example, we consider a classic linear quadratic optimal control problem, in which the optimal control can be derived analytically. We use this example as a benchmark example to show the baseline performance and the convergence trend of our algorithm. In the second example, we solve a more practical Dubins vehicle maneuvering problem, and we design control actions based on bearing angles to let the target vehicle follow a pre-designed path.

4.1. Example 1. Linear Quadratic Control Problem with Nonlinear Observations

Assume B, K are symmetric, positive definite. The forward process Y and the observation process M are given by
d Y ( t ) = A ( u ( t ) r ( t ) ) d t + σ B u ( t ) d W t d M ( t ) = sin ( Y ( t ) ) d t + d B t
The cost functional is given by
J [ u ] = 1 2 E 0 T R ( Y t Y t * ) , ( Y t Y t * ) d t + 1 2 0 T K u t , u t d t + 1 2 Q Y T , Y T
and we want to find J ( u * ) = inf u U a d [ 0 , T ] J ( u ) .

4.1.1. Experimental Design

An interesting fact of such an example is that one can construct a time deterministic exact solution which depends only on x 0 .
By simplifying (78), we have
J [ u ] = 1 2 0 T ( R E [ Y t T Y t ] 2 R Y t * T E [ Y t ] + R Y * T Y * + K u t , u t d t ) + 1 2 E [ Q Y T , Y T ]
Then, we define:
X t = E [ Y t ] = E [ Y 0 + 0 T A ( u ( s ) r ( s ) ) d s + 0 T σ B u ( s ) d W s ] = E [ Y 0 + 0 T A ( u ( s ) r ( s ) d s ]
Hence, we see that
E [ Y t T Y t ] = E [ ( Y 0 + 0 t A ( u ( s ) r ( s ) ) d s + 0 t σ B u ( s ) d W s ) 2 ] = E [ Y 0 T Y 0 + 0 t ( u ( s ) r ( s ) ) T A T A ( u ( s ) r ( s ) d s + Y 0 T 0 t A ( u ( s ) r ( s ) ) d s + 0 t ( u ( s ) r ( s ) ) T A T Y 0 d s ] + E [ 0 t σ 2 u ( s ) T B T B u ( s ) d s ] = X t T X t + σ 2 0 t u ( s ) T B T B u ( s ) d s
and (81) is true because all the terms are deterministic in time given x 0 . Moreover, we observe that
E [ Y T T Y T ] = E [ ( Y 0 + 0 T A ( u ( s ) r ( s ) ) d s + 0 T σ B u ( s ) d W s ) 2 ] = X T T X T + σ 2 0 T u ( s ) T B T B u ( s ) d s
As a result, we see that (79) now takes the form:
J [ u ] = 1 2 0 T ( X s T R X s 2 X s T R X s * + X s * T R X s * + u s T ( σ 2 B T Q B + K ) u s ) d s + 1 2 σ 2 0 T 0 t u s T B T R B u s d s d t + 1 2 X T T Q X T
and by performing a simple integration by part, we have
J [ u ] = 1 2 0 T ( X s T R X s 2 X s T R X s * + X s * T R X s * + u s T ( σ 2 B T Q B + K ) u s ) d s + 1 2 σ 2 0 T ( T t ) u s T B T R B u s d s + 1 2 X T T Q X T
As a result, we have the following standard deterministic control problem:
J [ u ] = 1 2 0 T ( X s T R X s 2 X s T R X s * + X s * T R X s * + u s T ( σ 2 B T Q B + K ) u s + σ 2 ( T t ) u s T B T R B u s ) 2 f d s + 1 2 X T T Q X T
d X t d t = A ( u ( t ) r ( t ) ) b , X t 0 = X 0
Then, one can form the following Hamiltonian
H ( x , p , u ) = b p + f
where p is x v , and v is the value function.
Then, to find the optimal control, we have
u H = 0
which is
A p + u ( σ 2 B T R B ( T t ) + ( K + σ 2 B T Q B ) ) = 0
Thus, we obtain
u t = A p ( t ) σ 2 B T R B ( T t ) + ( K + σ 2 B T Q B )
Additionally, notice that
d d t p ( t ) = R ( X t X t * ) , p ( T ) = Q X T
and then,
p ( t ) = Q X T + 0 T R ( X s X s * ) d s
Combining (86), (90) and (92) together, we can solve the control of the system.
Then, set
A = 1 0.2 0.2 0.2 0.2 1 0.2 0.2 0.2 0.2 1 0.2 0.2 0.2 0.2 1
Set B, R, K and Q as identity matrices. With X 0 = 0 , we have the following solution according to this setup.
To solve (92), let
X t 1 X t * 1 : = t X t 2 X t * 2 : = c o s ( t ) X t 3 X t * 3 : = t 2 X t 4 X t * 4 : = 2 π s i n ( 2 π t )
Then, we have
p ( t ) = X t 1 X t 2 X t 3 X t 4 + T 2 2 t 2 2 s i n ( T ) s i n ( t ) T 3 3 t 3 3 c o s ( 2 π T ) c o s ( 2 π t )
Let r ^ ( t ) be
r ^ t 1 = t 2 / 2 β t , r ^ t 2 = s i n ( t ) β t , r ^ t 3 = t 3 3 β t , r ^ t 4 = c o s ( 2 π t ) β t
where β t = ( 1 + σ 2 ) + σ 2 ( T t ) . Then,
r ( t ) = A r ^ ( t )
Then, plug (94) into (90), and we solve (86)
d X t d t = A ( u ( t ) r ( t ) ) = A ( A p ( t ) β t A r ^ ( t ) ) )
and obtain
X t = X t 1 X t 2 X t 3 X t 4 = α t A 2 σ 2 ( T 2 2 s i n ( T ) T 3 3 c o s ( 2 π T ) X T 1 X T 2 X T 3 X T 4 )
Thus, replacing X t with X t * , we obtain
X t * = X t 1 , * X t 2 , * X t 3 , * X t 4 , * = t c o s ( t ) t 2 2 π s i n ( 2 π t ) + α t A 2 σ 2 ( T 2 2 s i n ( T ) T 3 3 c o s ( 2 π T ) X T 1 X T 2 X T 3 X T 4 )
where X T i can be obtained from the system (96) by letting t = T , and
α t = l n 1 + σ 2 + σ 2 T ( 1 + σ 2 ) + σ 2 ( T t )
Then, to find the exact form by following the trajectory of y t in this setup, one will have to solve the following coupled forward–backward ODE.
d X t d t = A ( u t r t ) x n = y t n
d p ( t ) d t = X t X t * , p T = x T t n , y t n
with u t = A p t / σ 2 ( T t ) + ( 1 + σ 2 ) . As a result, we have
d X t d t = A ( A p t / σ 2 ( T t ) + ( 1 + σ 2 ) r t ) x n = y t n
d p ( t ) d t = X t X t * , p T = x T t n , y t n
That is, we need to solve the above coupled FBODE. Then, seeing that p t = x T t n , y t n + t n T X s X s * d s , and writing a t : = 1 / σ 2 ( T t ) + ( 1 + σ 2 ) , we have
d X t d t = A a t A X T a t A t n T ( X s X s * ) d s r t , x t n = y t n
To solve (102) numerically, we conduct a numerical discretization:
x t n + 1 x t n = a t n A 2 X T Δ t a t n ( Δ t ) 2 A 2 i = n N 1 ( X t i X t i * ) A r t A r t a t n ( Δ t ) 2 A 2 i = n N 1 X t i * = X t n X t n + 1 a t n ( Δ t ) 2 A 2 i = n N 1 X t i a t n A 2 X T , x t n = y t n
We can put (103) into a large linear system and solve it numerically.

4.1.2. Performance Experiment

We set the total number of discretizations to be N = 50 . Set iteration L = 10 4 , σ = 0.1 , the number of particles in each dimension is 128, T = 1 , and X 0 = 0 .
In Figure 1, we present the estimated data-driven control and the true optimal control.
In Figure 2, we show the estimated state trajectories with respect to true state trajectories in each dimension.
We can see from these Figures that our data-driven feedback control algorithm works very well for this 4-D linear quadratic control problem despite there being nonlinear observations.

4.1.3. Convergence Experiment

In this experiment, we demonstrate the convergence performance of our algorithm, and we study the error decay of the algorithm in the L 2 norm with respect to the number of particles used. Each result is an average of | | u e s t u * | | 2 of 50 independent tests.
Specifically, we set L = 10 4 and we just increase the number of particles S = {2, 8, 32, 128, 512, 2048, 4096, 8192, 16,384, 32,768}, and we obtained the result in Figure 3.
Set the number of particles S = { 2 , 8 , 32 , 128 , 512 , 1024 , 2048 , 4096 } , and L = S 2 . We obtained the result in Figure 4.
From the results above, we can see that the error will decrease and converge as we increase the number of particles and the number of iterations.

4.2. Example 2. Two-Dimensional Dubins Vehicle Maneuvering Problem

In this example, we solve a Dubins vehicle maneuvering problem. The controlled process is described by the following nonlinear controlled dynamics:
d S t = d X t d Y t = sin ( θ t ) d t cos ( θ t ) d t + σ d W t
d θ t = u t d t + σ 2 d W t
d M t = [ arctan ( X t + 1 Y t 1 ) , arctan ( X t 2 Y t 1 ) ] T + η t
where the pair ( X , Y ) gives the position of a car-like robot moving in the 2D plane, θ is the steering angle that controls the moving direction of the robot, which is governed by the control action u t , and σ is the noise that perturbs the motion and control actions. Assume that we do not have direct observations on the robot. Instead, we use two detectors located on different observation platforms at ( 1 , 1 ) and ( 2 , 1 ) to collect bearing angles of the target robot as indirect observations. Thus, we have the observation process M t . Given the expected path S * , the car should follow it and arrive at the terminal position on time. The performance cost functional based on observational data that we aim to minimize is defined as:
J [ u ] = E 1 2 0 T R ( S t S t * ) , ( S t S t * ) d t + 1 2 0 T K u t , u t d t + Q ( S T S T * ) , ( S T S T * )
In our numerical experiments, we let the car start from ( X 0 , Y 0 ) = ( 0 , 0 ) to ( X T , Y T ) = ( 1 , 1 ) . The expected path S t * is X t 2 + Y t 2 = 1 . Other settings are T = 1 , Δ t = 0.02 , i.e., N T = 50 , σ = 0.1 , η t N ( 0 , 0.1 ) , L = 1000 , K = 1 , and the initial heading direction is π / 2 . To emphasize the importance of following the expected path and arriving at the target location at the terminal time, let R = Q = 20 .
In Figure 5, we plot our algorithm’s designed trajectory and the estimated trajectory. We can see from this figure that the car moves towards the target along the designed path and is “on target” at the final time with a very small error.
We set L = 10 3 and we just increase the number of particles S = { 2 , 8 , 32 , 128 , 512 , 1024 , 2048 , 4096 , 8192 , 16384 , 32768 } . To provide the convergence of our algorithm in solving this Dubins vehicle maneuvering problem, we repeated the above experiment 50 times and we obtained the error = 1 M r e p t . m = 1 M r e p t . ( S t S t * ) , ( S t S t * ) in Figure 6 where M r e p t . = 50 .
Set the number of particles S = { 8 , 16 , 32 , 64 , 128 , 256 , 512 , 1024 } , and L = S 2 . We obtained the error= 1 M r e p t . m = 1 M r e p t .   ( S t S t * ) , ( S t S t * ) in Figure 7, where the error is the average of | | S t S t * | | 2 .
From the results above, we can see that the error will decrease and converge as we increase the number of particles and the number of iterations.

5. Conclusions

In this paper, we present the weak convergence of the data-driven feedback control algorithm proposed in [1]. We do not discuss the convergence rate due to the challenge of determining the radius M of the compact subspace that bounds all particles X i . However, in practice, given a terminal time T, one can use Monte Carlo simulations to find an M that satisfies a certain probability in Lemma 2. Our numerical experiments indicate that both the estimated control and estimated distribution converge at a rate related to the number of particles and iterations.
Future work can focus on analyzing the convergence rate and error bounds for a given state system. This will provide clarity on the number of particles and iterations required to achieve the desired estimation accuracy when applying the algorithm from [1].

Author Contributions

Conceptualization, S.L., H.S. and F.B.; methodology, F.B. and R.A.; software, S.L.; validation, S.L. and F.B.; formal analysis, S.L. and H.S.; investigation, S.L.; resources, S.L.; data curation, S.L.; writing—original draft preparation, S.L. and H.S.; writing—review and editing, F.B. and R.A.; visualization, S.L.; supervision, F.B.; project administration, F.B.; funding acquisition, F.B. and R.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partially supported by U.S. Department of Energy through FASTMath Institute and Office of Science, Advanced Scientific Computing Research program under the grant DE-SC0022297. FB would also like to acknowledge the support from U.S. National Science Foundation through project DMS-2142672.

Data Availability Statement

All code written as part of this study will be made available on GitHub upon completing the peer review process for this article.

Conflicts of Interest

Author Hui Sun was employed by the company Citigroup Inc. The research work performed by the author does not represent any corporate opinion. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
BSDEsBackward stochastic differential equations
FBSDEsForward–backward stochastic differential equations
PFParticle filter
SGDStochastic gradient descent

Appendix A

Appendix A.1

Proof of Lemma 2.
Proof. 
We start with time t 0 .
Step 1. Starting from X 0 ξ with E [ ξ 2 ] C 0 , and by fixing an arbitrary control u 0 , we have for the prediction step:
E [ | X 1 | 2 ] = E [ | X 0 + b ( X 0 , u 0 ) Δ t + σ ( X 0 ) Δ W 0 | 2 ] E [ ( 1 + Δ t ) X 0 2 + ( 1 + 1 Δ t ) b 2 ( Δ t ) 2 ] + C σ 2 Δ t ( 1 + Δ t ) C 0 2 + ( C b 2 ( Δ t + 1 ) + C σ 2 ) Δ t : = C 0
Step 2. We denote the distribution L ( X 1 ) π t 1 | t 0 , and then the particle method will perform a random resampling from such a distribution and obtain a random distribution
1 N i = 1 N δ x i ( ω ) : = π t 1 | t 0 N
Hence, we have for X π t 1 | t 0 N , taking the expectation where the expectation is taken over all randomness in the measure
E [ X 2 ] = E E [ X 2 | G 1 ] = N N E [ E [ x 1 2 | G 1 ] ] = E [ X ˜ 2 ]
where x i π t 1 | t 0 are i.i.d random samples, G 0 contains the sampling randomness and X ˜ π t 1 | t 0 . The conditional expectation is meant to show that all the particles ξ are conditionally independent (since there is other randomness that has accumulated in the history if we want to apply this argument recursively.) Thus, by (A1)
E [ X ˜ 2 ] C 0
Step 3. We now have the random measure π t 1 | t 0 N , and we proceed to the analysis step. We have by definition
X 1 g ( x ) d π t 1 | t 0 N ( x ) g ( x ) d π t 1 | t 0 N ( x ) : = π ˜ t 1 | t 1 N
where π t 1 | t 0 N ( x ) is the distribution of the terminal state X 1 from the previous step. Recalling assumption 7, we give an estimate over E [ | X 1 | 2 ] :
E [ | X 1 | 2 ] ( 1 κ ) 2 E [ x 2 d π t 1 | t 0 N ( x ) ] ( 1 κ ) 2 C 0 : = C 1
Step 4. Now, we again apply the random sampling step
1 N i = 1 N δ x i ( ω ) : = π t 1 | t 1 N
where x i ( ω ) π ˜ t 1 | t 1 N . Then, for X π t 1 | t 1 N , we have
E [ X 2 ] = E E [ X 2 | G 1 ] = N N E [ E [ x i 2 | G 1 ] ] = E [ X ˜ 2 ]
where X ˜ π ˜ t 1 | t 1 N and G 1 is the filtration that builds on G 1 and the randomness of the current sampling. Then, by (A6), we have
E [ X 2 ] C 1 , X π ˜ t 1 | t 1 N
and this completes all the estimates for the first time-stepping. Hence, after one time step, we have
C 1 = κ 2 ( 1 + Δ t ) C 0 + C Δ t
which means that by applying the same argument, we will have the following recursion in general:
C n + 1 = κ 2 ( 1 + Δ t ) C n + C Δ t
As a result, by picking arbitrary u n , using this same argument repeatedly until N, we have that for all n = 1 , , N :
C n = ( κ 2 ( 1 + Δ t ) ) n C 0 + i = 0 n 1 ( κ 2 ( 1 + Δ t ) ) i C Δ t
and we notice that C n is increasing in n. As a result, we know that for any X n π t n | t n 1 N , π t n | t n N , we have that
E [ | X n | 2 ] C N
Hence, by Chebyshev’s inequality, we have
P ( | X n | M ) C N M 2 , n { 1 , 2 , , N }
and then we have that
P ( sup n | X n | M ) C N M 2
By noticing that the control values are arbitrarily picked, we have that
P ( sup n , U 0 | X n | M ) C M 2 , X n π t n | t n 1 N or π t n | t n N

References

  1. Archibald, R.; Bao, F.; Yong, J.; Zhou, T. An efficient numerical algorithm for solving data driven feedback control problems. J. Sci. Comput. 2020, 85, 51. [Google Scholar] [CrossRef]
  2. Bellman, R. Dynamic programming. Science 1966, 153, 34–37. [Google Scholar] [CrossRef] [PubMed]
  3. Feng, X.; Glowinski, R.; Neilan, M. Recent developments in numerical methods for fully nonlinear second order partial differential equations. SIAM Rev. 2013, 55, 205–267. [Google Scholar] [CrossRef]
  4. Peng, S. A general stochastic maximum principle for optimal control problems. SIAM J. Control Optim. 1990, 28, 966–979. [Google Scholar] [CrossRef]
  5. Yong, J.; Zhou, X.Y. Stochastic Controls: Hamiltonian Systems and HJB Equations; Springer Science & Business Media: Cham, Switzerland, 2012. [Google Scholar]
  6. Gong, B.; Liu, W.; Tang, T.; Zhao, W.; Zhou, T. An efficient gradient projection method for stochastic optimal control problems. SIAM J. Numer. Anal. 2017, 55, 2982–3005. [Google Scholar] [CrossRef]
  7. Tang, S. The maximum principle for partially observed optimal control of stochastic differential equations. SIAM J. Control Optim. 1998, 36, 1596–1617. [Google Scholar] [CrossRef]
  8. Zhang, J. A numerical scheme for BSDEs. Ann. Appl. Probab. 2004, 14, 459–488. [Google Scholar] [CrossRef]
  9. Zhao, W.; Fu, Y.; Zhou, T. New kinds of high-order multistep schemes for coupled forward backward stochastic differential equations. SIAM J. Sci. Comput. 2014, 36, A1731–A1751. [Google Scholar] [CrossRef]
  10. Archibald, R.; Bao, F.; Yong, J. A stochastic gradient descent approach for stochastic optimal control. East Asian J. Appl. Math. 2020, 10, 635–658. [Google Scholar] [CrossRef]
  11. Sato, I.; Nakagawa, H. Approximation analysis of stochastic gradient Langevin dynamics by using Fokker-Planck equation and Ito process. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; Volume 32, pp. 982–990. [Google Scholar]
  12. Shapiro, A.; Wardi, Y. Convergence analysis of gradient descent stochastic algorithms. J. Optim. Theory Appl. 1996, 91, 439–454. [Google Scholar] [CrossRef]
  13. Archibald, R.; Bao, F.; Cao, Y.; Sun, H. Numerical analysis for convergence of a sample-wise backpropagation method for training stochastic neural networks. SIAM J. Numer. Anal. 2024, 62, 593–621. [Google Scholar] [CrossRef]
  14. Zakai, M. On the optimal filtering of diffusion processes. Z. Wahrscheinlichkeitstheorie Verw. Gebiete 1969, 11, 230–243. [Google Scholar] [CrossRef]
  15. Gordon, N.J.; Salmond, D.J.; Smith, A.F. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEE Proc. F (Radar Signal Process.) 1993, 140, 107–113. [Google Scholar] [CrossRef]
  16. Morzfeld, M.; Tu, X.; Atkins, E.; Chorin, A.J. A random map implementation of implicit filters. J. Comput. Phys. 2012, 231, 2049–2066. [Google Scholar] [CrossRef]
  17. Andrieu, C.; Doucet, A.; Holenstein, R. Particle Markov chain Monte Carlo methods. J. R. Statist. Soc. B 2010, 72, 269–342. [Google Scholar] [CrossRef]
  18. Crisan, D.; Doucet, A. A survey of convergence results on particle filtering methods for practitioners. IEEE Trans. Signal Process. 2002, 50, 736–746. [Google Scholar] [CrossRef]
  19. Künsch, H.R. Particle filters. Bernoulli 2013, 19, 1391–1403. [Google Scholar] [CrossRef]
  20. Bao, F.; Cao, Y.; Meir, A.; Zhao, W. A first order scheme for backward doubly stochastic differential equations. SIAM/ASA J. Uncertain. Quantif. 2016, 4, 413–445. [Google Scholar] [CrossRef]
  21. Zhao, W.; Zhou, T.; Kong, T. High order numerical schemes for second-order FBSDEs with applications to stochastic optimal control. Commun. Comput. Phys. 2017, 21, 808–834. [Google Scholar] [CrossRef]
  22. Law, K.; Stuart, A.; Zygalakis, K. Data Assimilation; Springer: Cham, Switzerland, 2015. [Google Scholar]
Figure 1. Estimated control vs. true optimal control.
Figure 1. Estimated control vs. true optimal control.
Mathematics 12 02584 g001
Figure 2. Estimated state vs. true state.
Figure 2. Estimated state vs. true state.
Mathematics 12 02584 g002
Figure 3. Error vs. number of particles.
Figure 3. Error vs. number of particles.
Mathematics 12 02584 g003
Figure 4. Error vs. number of steps.
Figure 4. Error vs. number of steps.
Mathematics 12 02584 g004
Figure 5. Controlled trajectory from (0,0) to (1,1).
Figure 5. Controlled trajectory from (0,0) to (1,1).
Mathematics 12 02584 g005
Figure 6. Error vs. number of particles.
Figure 6. Error vs. number of particles.
Mathematics 12 02584 g006
Figure 7. Error vs. number of steps.
Figure 7. Error vs. number of steps.
Mathematics 12 02584 g007
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liang, S.; Sun, H.; Archibald, R.; Bao, F. Convergence Analysis for an Online Data-Driven Feedback Control Algorithm. Mathematics 2024, 12, 2584. https://doi.org/10.3390/math12162584

AMA Style

Liang S, Sun H, Archibald R, Bao F. Convergence Analysis for an Online Data-Driven Feedback Control Algorithm. Mathematics. 2024; 12(16):2584. https://doi.org/10.3390/math12162584

Chicago/Turabian Style

Liang, Siming, Hui Sun, Richard Archibald, and Feng Bao. 2024. "Convergence Analysis for an Online Data-Driven Feedback Control Algorithm" Mathematics 12, no. 16: 2584. https://doi.org/10.3390/math12162584

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop