Identification of Nonlinear State-Space Systems via Sparse Bayesian and Stein Approximation Approach

Zhang, Limin; Li, Junpeng; Zhang, Wenting; Yang, Junzi

doi:10.3390/math10193667

Open AccessArticle

Identification of Nonlinear State-Space Systems via Sparse Bayesian and Stein Approximation Approach

by

Limin Zhang

^1,2,*

,

Junpeng Li

³,

Wenting Zhang

² and

Junzi Yang

²

¹

College of Intelligence and Computing, Tianjin University, Tianjin 300072, China

²

Department of Mathematics and Computer Science, Hengshui University, Hengshui 053000, China

³

Institute of Electrical Engineering, Yanshan University, Qinhuangdao 066004, China

^*

Author to whom correspondence should be addressed.

Mathematics 2022, 10(19), 3667; https://doi.org/10.3390/math10193667

Submission received: 20 August 2022 / Revised: 18 September 2022 / Accepted: 30 September 2022 / Published: 6 October 2022

(This article belongs to the Special Issue Mathematics-Based Methods in Graph Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

This paper is concerned with the parameter estimation of non-linear discrete-time systems from noisy state measurements in the state-space form. A novel sparse Bayesian convex optimisation algorithm is proposed for the parameter estimation and prediction. The method fully considers the approximation method, parameter prior and posterior, and adds Bayesian sparse learning and optimization for explicit modeling. Different from the previous identification methods, the main identification challenge resides in two aspects: first, a new objective function is obtained by our improved Stein approximation method in the convex optimization problem, so as to capture more information of particle approximation and convergence; second, another objective function is developed with L

_{1}

-regularization, which is sparse method based on recursive least squares estimation. Compared with the previous study, the new objective function contains more information and can easily mine more important information from the raw data. Three simulation examples are given to demonstrate the proposed algorithm’s effectiveness. Furthermore, the performances of these approaches are analyzed, including parameter estimation of root mean squared error (RMSE), parameter sparsity and prediction of state and output result.

Keywords:

sparse Bayesian identification; state-space; convex optimisation; Stein approximation method

MSC:

37M

1. Introduction

In the actual world, non-linear systems are commonplace, such as social networks, industry systems, biological systems, finance, and chemical engineering. Identification of non-linear systems are widely acknowledged for its importance and difficulties [1,2], such as fractional order system [3,4,5], neural networks [6], non-linear ARMAX (NARMAX) [7], and Hammerstein–Wiener [8] models. The non-linear state-space model is a kind of expression for all these non-linear systems. A common method for identifying non-linear state-space models is to look for a concise description that is consistent with some non-linear terms (kernel functions) based on data [9,10]. Classic functional decomposition methods, such as Volterra expansion, Taylor polynomial expansion, or Fourier series [9,11], provide a few options for kernel functions. These methods are founded on the idea that there are a finite set of fundamental kernel functions whose linear combination can be utilized to describe the dynamics of a non-linear state-space system. However, under the condition of more kernel functions, the efficiency of this kind of method decreases rapidly. A promising way to identify non-linear state-space systems is probabilistic method [12,13,14,15], which has received a lot of attention over the past few years. Earlier non-linear state-space systems identification methods based on probabilistic method, such as regression method, maximum likelihood (ML) [16], expectation-maximization (EM) [17], mainly utilize gradient descent method while ignoring the parameter identification and approximation process information.

To further improve the efficiency of parameter identification, many identification techniques have been presented that combine the gradient descent technique and Bayesian approximation recently. In particular, those based on the variational inference method have attracted more attention due to their superior performance, such as variational inference in the Gaussian process (GP) [12], Gaussian-process state-space model (GP-SSM) [13], deep variational Bayes filters (DVBF) [14], optimistic inference control (OIC) [15]. Various approximation expectation-maximization (EM)-based techniques have also been studied. In [18], the authors have examined applying EM method to estimate the parameters of a non-linear state space model of the disease dynamics. In [19,20], both Bayesian and a ML estimation strategy are employed in addition to a competitive GP approximation method. For learning, Monte Carlo technique and EM method are utilized in [21], which also includes variational method using the same GP approximation.

Notably, none of these methods take into account the prediction’s resilience and all of them presuppose that the structure of non-linear state-space models is unknown. However, in reality, many parameters are missing precision due to reasons, such as slow convergence rate or falling into local optimum. In this article, we concentrate on the optimum problem since it is more difficult and has numerous real-world implications. For example, in a pilot plant PH process, Hammerstein–Wiener model is used in the prediction of the neutral liquid pH value [22]. In addition, the problem of non-linear state-space systems with missing precision is related to complexity of parameters. Sparse representation can be used in dealing with sparse solutions of linear regression equations, which can effectively reduce the complexity of parametric solutions. There are just a few papers that deal with the sparse representation identification issue and multiple constraints on the system’s parameters [23,24]. In [25], the convex constraint is added to the parameters sparse representation in the non-linear identification algorithm. These works of literature only consider the selection of sparse methods and do not consider conditional restrictions in the process of parameter approximation.

Unlike the previous work, our suggested framework, on the other hand, allows us to include more constraints on the corresponding model parameters, e.g., inequality constraint, priori information and Stein discrepancy constraints. The dynamical system discovery issue is reimagined in this paper from the perspectives of sparse regression [26,27,28], compressed sensing [29,30,31], Stein approximation theory [32], and convex optimization method [23]. The use of sparsity approaches is relatively new [33,34,35] in dynamical systems. We all know that most non-linear state-space systems have only a few important dynamics elements, resulting in sparse parameter set in a high-dimensional non-linear function space.

Although these efforts are focused on non-linear system identification using state-space models, they have some significant flaws. First, these methods carry out system identification in a two-stage way, that is, compute the posterior objective function with parameters, and then learning system parameter with noise. The computing posterior objective function and the learning system parameter are two different processes in these two-stage approaches, and their parameters cannot be modified together. System identification will perform worse as a result of this. The ideal relationship between these two processes would be one of complementarity. That is, learning system identification should contribute to computing posterior objective function, and the updated posterior objective function should contribute to learning system parameter with noise. For non-linear systems with state-space formulation, ref. [36] addresses the recursive joint inference and learning problem, and a reduced rank formulation of GP-SSMs is used to model the system as a Gaussian-process state space model (GP-SSM). In [37], a two-stage Bayesian optimization framework is introduced, which consists of representation of the objective function in low-dimensional parameter space and surrogate model selection in the reduced space. In this study, the only posterior objective function is considered, which can not achieve effective interactive learning, and may also compromise the optimization performance for system parameters.

To address the problems of existing identification methods for non-linear state-space systems, we propose non-linear state-space identification algorithm with Sparse Bayesian and Stein Approach (NSSI-SBSA), which is an optimization approach for improving the accuracy of system parameter identification and posterior distribution computing simultaneously in an integrated structure as opposed to the conventional two-stage method. In our new method, we select least absolute shrinkage and selection operator (LASSO) [26] as parameter sparsity algorithms. The sparse parameter is taken into posterior distribution, which reduced the complexity of posterior distribution. Compared with some sparse method of least angle regression (LARS) [38], sequentially thresholded least squares (STLS) [39], and basis pursuit denoising (BPDN) [40], LASSO is more suitable for high-dimensional data. In our article, the sparse model identification results strikes a natural balance between model sparsity and precision, and prevent the model from being overfit to the data.

From a statistical perspective, we will discuss how a Bayesian technology, optimization method, and Stein approximation strategy might mitigate the difficulties of large correlations in the state matrix. The following are some of the technical note’s most important contributions:

(1): The NSSI-SBSA algorithm is proposed. In the algorithm, the sparse method is used for the parameter estimation and prediction. Parameter prior, Bayesian sparse learning, and optimisation are used in an integrated computing framework instead of the classical two-stage method. The sparse model identification results strikes a natural balance between model sparsity and precision, preventing the model from overfit to the data.
(2): A nonconvex optimisation problem is constructed in the non-linear state-space system identification issue with additive noise. Compared with other related methods, we not only take the evidence maximisation as an objective function, but also consider the Stein discrepancy of parameters as another objective function in the non-convex optimization problem. The two functions are integrated into one objective function containing more information. It can captures more important information from the raw date and reduce the complexity of parametric solutions.

The rest of this paper is organized as follows. Section 2 describes problem statement and background. In Section 3, construct the model in Bayesian framework. In Section 4, non-convex optimisation with Stein constrain for identification is introduced. Three numerical illustration, including Narendra-Li Model, NARX model, Kernel state-space models (KSSM) are presented in Section 5. Finally, we give some closing remarks in Section 6.

2. Problem Statement and Background

We consider the following non-linear state-space model [25]

\begin{matrix} π (x_{i k}) & = f_{i} (x_{k}, u_{k}) + ν_{i k} \\ = \sum_{r = 1}^{N} θ_{i r} f_{i r} (x_{k}, u_{k}) + ν_{i k} \end{matrix}

(1)

o_{i k} = g_{i} (x_{k}, u_{k}) + e_{i k},

where

x_{k} = [x_{1 k}, \dots x_{i k} \dots x_{n_{x}, k}]

is the state variable in time step k, and

u_{k}

is the external control input. When system is time continuous,

π (x_{i k}) = {\dot{x}}_{i k}

; When system is time discrete,

π (x_{i k}) = x_{i k}

or

x_{i k} - x_{i, k - 1}

;

ν_{i k} \sim N (0, σ_{i k}^{2})

is noise (when

i \neq k, σ_{i k} = 0

), which is set to be i.i.d. Gaussian distribution.

f_{i r} (x_{k}, u_{k}) : R^{n_{x} + n_{u}} \to R

and

g_{i} (x_{k}, u_{k})

: R^{n_{x} + n_{u}} \to R

is Lipschitz continuous function, and

n_{x}

and

n_{u}

are the dimensions of

x_{k}

and

u_{k}

, respectively.

θ_{i r}

is the weight of basis functions. Together,

θ_{i r}

and

f_{i r}

determine the dynamics. It is worth emphasizing that we make no assumptions about the non-linear functions on the right-hand side of (1).

If the system can provide M data samples that meet (1), the system in (1) can be represented as

Y_{i} = F_{i} θ_{i} + ν_{i}, i = 1, \dots, n_{x},

(2)

where

Y_{i} = {[π (x_{i 1}), \dots, π (x_{i M})]}^{⊤} \in R^{M \times 1}, θ_{i_{T}} ≜ {[θ_{i 1}, \dots, θ_{i N_{i}}]}^{⊤} \in R^{N_{i} \times 1},

ν_{i} ≜ {[ν_{i 0}, \dots, ν_{i, M - 1}]}^{⊤} \in R^{M \times 1}

, and

F_{i} \in

R^{M \times N_{i}}

is called dictionary matrix. j-th column in

F_{i}

is described by

{[f_{i j} (x_{0}, u_{0}), \dots, f_{i j} (x_{M - 1}, u_{M - 1})]}^{⊤} .

In this article, the identification task is to estimate

θ_{i}

from the measured data of

Y_{i}

. This leads to the solution of a linear regression issue, in which the least square (LS) method can be used if some of model’s non-linear part is understood, i.e.,

F_{i}

is known. In (1), we merely discuss the identification problem, as we do in

g_{i} (x_{k}, u_{k})

. Because of the potential non-relevant or independent columns in

F_{i}

, the solution

θ_{i}

in

(2)

is often sparse, and a few frequently used non-linear dynamical models need to be considered. For the convenience of expression, we rewrite the linear regression issues in (2) into the following form

Y = F θ + ν

(3)

3. Constructing the Model in Bayesian Framework

All unknowns in Bayesian modeling are evaluated as they are random variables with specific distributions [41]. For

Y = F θ + ν

in (3), noisy variables

ν

is Gaussian independently identically distribution (i.i.d.), i.e.,

ν \sim N (0, β I), β = σ^{2}

, identity matrix

I

. We can obtain the likelihood of the data

P (Y ∣ θ) = N (Y ∣ F θ, β I) \propto

exp {- \frac{1}{2 β} ∥ Y - F θ ∥_{2}^{2}}

.

P (θ)

is a prior distribution defined as

P (θ) \propto exp {- \frac{1}{2 α} θ^{T} θ} = \prod_{j} exp {- \frac{1}{2 α} θ_{j}^{T} θ_{j}} =

\prod_{j} P (θ_{j})

, and

α

is hyper parameter. For the convenience of calculating sparse parameters

θ

,

- \frac{1}{2 α} θ^{T} θ

is selected as a concave non-decreasing function of

|θ_{j}|

. The priors of

θ

includes Gaussian and t-distribution (see [42] for details).

The posterior

P (θ ∣ Y)

is heavily linked and non-Gaussian, so computing the posterior mean

E (θ ∣ Y)

is often difficult. To solve this problem, take

P (θ ∣ Y)

as an approximation of Gaussian distribution. In [41], effective posterior computation algorithms are used in the computation.

Another method is to use super Gaussian priors, in which the priors

P (θ_{j})

is computed by the variational EM algorithms [42]. We define hyperparameters

α ≜

{[α_{1}, \dots, α_{N}]}^{⊤} \in R_{+}^{N}

. The priors of

θ

can be written as:

P (θ) = \prod_{j = 1}^{n} P (θ_{j})

,

P (θ_{j}) = N (θ_{j} ∣ 0, α_{j}) φ (α_{j})

, where

φ (α_{j})

is probability density function and

φ (α_{j}) \geq 0

.

P (θ, α) = \prod_{j} N (θ_{j} ∣ 0, α_{j}) φ (α_{j}) = P (θ ∣ α) P (α) \leq

P (θ)

, where

P (θ ∣ α) ≜ \prod_{j} N (θ_{j} ∣ 0, α_{j}), P (α) ≜ \prod_{j} φ (α_{j})

. Considering the data

Y

, the posterior probability of

θ

can be represented as

P (θ ∣ Y, α) =

\frac{P (Y ∣ θ) P (θ; α)}{\int P (Y ∣ θ) P (θ; α) d θ} = N (m_{θ}, Σ_{θ}) .

From [43], the posterior mean

m_{θ}

and covariance

Σ_{θ}

are given by:

\begin{matrix} m_{θ} & = Λ F^{⊤} {(λ I + F Λ F^{⊤})}^{- 1} Y \\ Σ_{θ} & = Λ - Λ F^{⊤} {(λ I + F Λ F^{⊤})}^{- 1} F, \end{matrix}

(4)

where

Λ

is a diagonal matrix written as

diag [α]

. It is obvious to maximize

P (θ, α ∣ Y),

the most important question is how to select the best

\hat{α}

.

P (Y ∣ θ)

and

P (θ; α)

are taken as prior information, so we need only consider

\int P (Y ∣ θ) P (θ; α) d θ

. Using type-II ML [43], the marginal likelihood

\int P (Y ∣ θ) P (θ; α) d θ

can be maximised, and the selected

\hat{α}

is written as

\hat{α} = \underset{α \geq 0}{argmax} \int P (Y ∣ θ) \prod_{j = 1}^{n} N (θ_{j} ∣ 0, α_{j}) φ (α_{j}) d θ .

(5)

After

\hat{α}

is computed in (5), the estimation of

θ

can be obtained as

\hat{θ} = E (θ ∣ Y; \hat{α}) =

\hat{Λ} F^{⊤} {(λ I + F \hat{Λ} F^{⊤})}^{- 1} Y

, with

\hat{Λ} ≜ diag [\hat{α}] .

It indicates that picking the most likely hyperparameters

\hat{α}

is capable of explaining the data

Y

.

4. Non-Convex Optimisation with Stein Method for Identification

4.1. Stein Operators Selection and Stein Constrain Design

The approach can be sketched as follows for a target distribution P with support Z. Find a suitable operator

B : = B_{P}

(referred to as the Stein operator) and a large class of functions

F_{B} = F (B_{P})

(referred to as the Stein class) such that Z has distribution P, denoted

Z \sim P

, if, and only if, we have

f \in F_{B}

for all functions

E [B f (Z)] = 0 .

A Stein operator can be designed in a variety of ways [32]. In our framework, Stein’s identity and kernelized Stein difference are crucial.

P (θ)

is probability density function, which is continuous and differentiable on

θ

\subseteq R^{d}

. According to Stein’s theory, suitable smooth and derivable function

ϕ (θ)

and

Q (θ)

are selected in (6), which are expressed as

ϕ (θ) = {[ϕ_{1} (θ), \dots, ϕ_{d} (θ)]}^{⊤}

,

Q (θ) = {[Q_{1} (θ), \dots, Q_{d} (θ)]}^{⊤}

.

E_{θ \sim P} [B_{P} ϕ (θ) Q (θ)] = 0,

(6)

where

B_{P} ϕ (θ) Q (θ) = Q {(θ)}^{^{⊤}} ϕ {(θ)}^{^{⊤}} \nabla_{θ} log P (θ)

+ \nabla_{θ} ϕ (θ) Q (θ) + \nabla_{θ} Q (θ) ϕ (θ) .

The Stein operator

B_{P}

operates on the function

ϕ (θ) Q (θ)

and produces a zero mean function

B_{P} ϕ (θ) Q (θ)

under

θ \sim P

.

Assume mild zero boundary conditions on

ϕ (θ) Q (θ) .

\forall θ_{i}

, when

θ

is compact,

θ \subseteq R^{d},

P (θ) ϕ (θ) Q (θ) \approx 0

. The expectation of

B_{Q} ϕ (θ) Q (θ)

under

θ \sim Q

are no longer equal zero. The magnitude of

E_{θ \sim Q} [B_{P} ϕ (θ) Q (θ)]

is related with the difference between P and Q. The probability distances

S (Q, P)

for

P (θ)

and

Q (θ)

in some proper function set

F_{B}

are defined as

S (Q, P) = max_{ϕ \in F_{B}} \{{[E_{θ \sim Q} trace (B_{P} ϕ (θ) Q (θ))]}^{2}\} .

(7)

The discriminative power and computational tractability of the Stein discrepancy are determined by the set

F_{B}

.

F_{B}

includes sets of functions with bounded Lipschitz norms, each of which is a difficult and intractable functional optimization problem with special considerations. To tackle this trouble of calculation,

Q (θ)

and

ϕ (θ)

are selected in the unit sphere of a reproducing kernel Hilbert space (RKHS) [32]. Kernelized Stein discrepancy (KSD) between

P (θ)

and

Q (θ)

is described as

S (Q, P) = max_{ϕ \in H^{d}} {[E_{x \sim Q} (trace (B_{P} ϕ (θ) Q (θ)))]}^{2},

(8)

{s . t . | ϕ (θ) Q (θ) |}_{H^{d}} \leq 1 .

The optimal solution of (8) is

ψ (θ) = [ϕ_{Q, P}^{*} (θ) Q (θ)] / {| ϕ_{Q, P}^{*} Q (θ) |}_{H^{d}}

where

ψ_{Q, P}^{*} (\cdot) = E_{θ \sim Q} [B_{P} k (θ, \cdot) Q (θ)] .

(9)

A direct calculation shows that

S (Q, P) = | ψ_{Q, P}^{* 2} |_{H^{d}} .

(10)

For any fixed

θ^{'}

, kernel function

k (θ, \cdot)

belongs to RKHS.

S (Q, P)

=0, that is to say

ψ_{Q, P}^{*} (θ) \equiv 0

only if

P (θ) = Q (θ)

. The radial basis function

(RBF)

kernel

k (θ, θ^{'}) = exp (- \frac{1}{h} {∥θ - θ^{'}∥}_{2}^{2})

is purely positive definite in a strict sense. When

θ^{'}

approaches

θ

, the

RBF

converges to 1. Then,

S (Q, P)

contains the information of parameter approximation, which is an important factor affecting the accuracy of parameters

θ

. For convenience, we define

T_{j} = ψ_{Q, P}^{*} (\cdot)

and

K (θ_{j}, \cdot) = k (θ_{j}, \cdot) Q (θ_{j})

.

Q (θ_{j})

and

P (θ_{j})

are from the Gaussian distribution with different hyperparameters.

Q (θ_{j})

is defined as follows

Q (θ_{j} | τ_{j}) \propto exp [- \frac{1}{2 τ_{j}} θ_{j}^{T} θ_{j}] .

By subtracting

Q (θ_{j} | τ_{j})

and

k (θ_{j}, \cdot)

into

K (θ_{j}, \cdot),

we have

K (θ_{j}, \cdot) \propto exp [- \frac{1}{h} {(θ_{j} - θ_{0})}^{T} (θ_{j} - θ_{0})] .

Based on

K (θ_{j}, \cdot)

, we have

\nabla_{θ} K (θ_{j}, \cdot) \propto - \frac{2}{h} (θ_{j} - θ_{0}) exp [- \frac{1}{h} {(θ_{j} - θ_{0})}^{T} (θ_{j} - θ_{0})] .

According to the results,

T_{j}

is written as

\begin{matrix} T_{j} & = K (θ_{j}, \cdot) \cdot \nabla_{θ_{j}} log P (θ_{j} | α_{j}) + \nabla_{θ} K (θ_{j}, \cdot) \\ = - \frac{1}{α_{j}} θ_{j} exp [- \frac{1}{h} {(θ_{j} - θ_{0})}^{T} (θ_{j} - θ_{0})] \\ - \frac{2}{h} (θ_{j} - θ_{0}) exp [- \frac{1}{h} {(θ_{j} - θ_{0})}^{T} (θ_{j} - θ_{0})] \\ \leq - \frac{1}{α_{j}} θ_{j} exp [- \frac{1}{h} θ_{j}^{T} θ_{j}] - \frac{2}{h} θ_{j} exp [- \frac{1}{h} {(θ_{j} - θ_{0})}^{T} (θ_{j} - θ_{0})] \\ = - (\frac{1}{α_{j}} + \frac{2}{h}) θ_{j} exp [- \frac{1}{h} θ_{j}^{T} θ_{j}] . \end{matrix}

It is easy to derive the expectation of

T_{j}

\begin{matrix} E [T_{j}] & = \int - (\frac{1}{α_{j}} + \frac{2}{h}) θ_{j} exp [- \frac{1}{h} θ_{j}^{T} θ_{j}] exp [- \frac{1}{2 τ_{j}} θ_{j}^{T} θ_{j}] d θ_{j} \\ = \int - (\frac{1}{α_{j}} + \frac{2}{h}) θ_{j} exp [- (\frac{1}{h} + \frac{1}{2 τ_{j}}) θ_{j}^{T} θ_{j}] d θ_{j} \\ = \frac{\frac{1}{α_{j}} + \frac{2}{h}}{\frac{1}{τ_{j}} + \frac{2}{h}} exp [- (\frac{1}{h} + \frac{1}{2 τ_{j}}) θ_{j}^{T} θ_{j}] . \end{matrix}

For convenience, let

ξ (α_{j}) = \frac{\frac{1}{α_{j}} + \frac{2}{h}}{\frac{1}{τ_{j}} + \frac{2}{h}}

, we have

E [T_{j}] = ξ (α_{j}) exp [- (\frac{1}{h} + \frac{1}{2 τ_{j}}) θ_{j}^{T} θ_{j}] .

(11)

Putting every

E [T_{j}]

together, we obtain the following result:

E [T] = \prod E [T_{j}] .

Based on the (11), we see that

E [T]

is non-convex objective function in the

α

-space. The optimisation problem is described as

\hat{α} = \underset{α \geq 0}{argmin} E [T] .

(12)

Remark 1.

Stein method is improved and kernel function

k (θ, θ^{'})

is also from Stein class

F_{B}

, but the dynamic characteristics of proposed function

Q (θ)

is considered in the designing of

B_{P}

operator for the approximation of

P (θ)

in the unit sphere of RKHS. The new

B_{P}

operator can increase the chance to jump out of the local non-convex optimum. In the optimization problem,

E [T]

is a new objective function, which can accelerate approaching speed between

P (θ)

and

Q (θ) .

4.2. Parameter Sparse Identification of Constraints from Data

In (12),

E [T]

is another objective function, which makes the parameter

α

less sensitive to noisy data and converges to true value. The problem of system identification with convex constraints is given by a sparse Bayesian formulation, which is then handled as a non-convex optimisation problem in this section. To obtain a better parameter

α

, the new objective function is constructed as

\hat{α} \underset{α \geq 0}{= argmax} (1 / E [T]) \int P (Y ∣ θ) \prod_{j = 1}^{n} N (θ_{j} ∣ 0, α_{j}) φ (α_{j}) d θ .

(13)

4.2.1. Objective Function in Parameter Identification

Theorem 1.

Use the notation

J_{α} (α)

as the objective function

\begin{matrix} J_{α} (α) & = log |λ I + F Λ F^{⊤}| \\ + Y^{⊤} {(λ I + F Λ F^{⊤})}^{- 1} Y + \sum_{j = 1}^{N} p (α_{j}) \\ - 2 \sum_{j = 1}^{N} (\frac{1}{h} + \frac{1}{2 τ_{j}}) θ_{j}^{T} θ_{j} - 2 log ξ (α_{j}) . \end{matrix}

(14)

By minimising

J_{α} (α)

, the optimal hyperparameters

\hat{α}

in (13) is derived, where

p (α_{j}) = - 2 log φ (α_{j})

. The mean of

θ

is calculated and represented as

\hat{θ} = \hat{Λ} F^{⊤} {(λ I + F \hat{Λ} F^{⊤})}^{- 1} Y

.

Proof.

Using the Woodbury inversion identity, re-express

m_{θ}

and

Σ_{θ}

in (4):

\begin{matrix} m_{θ} & = Λ F^{⊤} {(λ I + F Λ F^{⊤})}^{- 1} Y = \frac{1}{λ} Σ_{θ}^{⊤} F Y \end{matrix}

(15)

\begin{matrix} Σ_{θ} = Λ - Λ F^{⊤} {(λ I + F Λ F^{⊤})}^{- 1} F Λ \\ = {(Λ^{- 1} + \frac{1}{λ} F^{⊤} F)}^{- 1} . \end{matrix}

(16)

Since the data likelihood P

(Y ∣ θ)

is Gaussian, we can express the integral in (13) as follow:

(1 / E [T]) \int N (Y ∣ F θ, λ I) \prod_{j = 1}^{N} N (θ_{j} ∣ 0, α_{j}) φ (α_{j}) d θ

(17)

= (1 / E [T]) {(\frac{1}{2 π λ})}^{M / 2} {(\frac{1}{2 π})}^{N / 2} \int exp {- E (θ)} d θ \prod_{j = 1}^{N} \frac{φ (α_{j})}{\sqrt{α_{j}}},

where

\begin{matrix} E (θ) & = \frac{1}{2 λ} {∥ Y - F θ ∥}^{2} + \frac{1}{2} θ^{⊤} Λ^{- 1} θ \\ = \frac{1}{2} {(θ - m_{θ})}^{⊤} Σ_{θ}^{- 1} (θ - m_{θ}) + E (Y) . \end{matrix}

(18)

We obtain

E (Y)

using the Woodbury inversion identity.

\begin{matrix} E (Y) & = \frac{1}{2} (\frac{1}{λ} Y^{⊤} Y - \frac{1}{λ} Y^{⊤} F Σ_{θ} Σ_{θ}^{- 1} Σ_{θ} F^{⊤} Y \frac{1}{λ}) \\ = \frac{1}{2} Y^{⊤} {(λ I + F Λ F^{⊤})}^{- 1} Y . \end{matrix}

(19)

Just for the sake of calculation, we evaluate the integral of

exp {- E (θ)}

as follows

\int exp {- E (θ)} d θ = exp {- E (Y)} {(2 π)}^{N / 2} {|Σ_{θ}|}^{1 / 2} .

Then, applying a

- 2 log (\cdot)

transformation to (17), we have

\begin{matrix} - 2 log \int P (Y ∣ θ) \prod_{j = 1}^{n} N (θ_{j} ∣ 0, α_{j}) φ (α_{j}) d θ - 2 log E [T] \\ = & - log |Σ_{θ}| + M log 2 π λ + log | Λ | + Y^{⊤} {(λ I + F Λ F^{⊤})}^{- 1} Y \\ + \sum_{j = 1}^{N} p (α_{j}) - 2 \sum_{j = 1}^{N} (\frac{1}{h} + \frac{1}{2 τ_{j}}) θ_{j}^{T} θ_{j} - 2 \sum_{j = 1}^{N} log ξ (α_{j}) \\ = & log |λ I + F Λ F^{⊤}| + M log 2 π + Y^{⊤} {(λ I + F Λ F^{⊤})}^{- 1} Y \\ + \sum_{j = 1}^{N} p (α_{j}) - 2 \sum_{j = 1}^{N} (\frac{1}{h} + \frac{1}{2 τ_{j}}) θ_{j}^{T} θ_{j} - 2 \sum_{j = 1}^{N} log ξ (α_{j}), \end{matrix}

where

|Λ^{- 1} ∥ λ I + F Λ F^{⊤}| = | λ I | |Λ^{- 1} + \frac{1}{λ} F^{⊤} F|

,

log |λ I + F Λ F^{⊤}| = - log |Σ_{θ}| + M log λ + log | Λ | .

From (13), we then obtain

\begin{matrix} \hat{α} & = arg min_{α \geq 0} (log |λ I + F Λ F^{⊤}| + Y^{⊤} {(λ I + F Λ F^{⊤})}^{- 1} Y \\ + \sum_{j = 1}^{N} p (α_{j}) - 2 \sum_{j = 1}^{N} (\frac{1}{h} + \frac{1}{2 τ_{j}}) θ_{j}^{T} θ_{j} - 2 log ξ_{j} (α_{j})) . \end{matrix}

(20)

To acquire an approximation of

θ

, we compute the posterior

mean

θ : \hat{θ} = E (θ ∣ Y; \hat{α}) = Λ F^{⊤} {(λ I + F \hat{Λ} F^{⊤})}^{- 1} Y

.

Remark 2.

In (13), the objective function of recursive least squares estimation with L1-regularization is developed, which is integrated into the objective function of Stein approximation. The new one contains more information and can captures more important information from the raw date. We can obtain the relative good parameter probability distribution.

Lemma 1.

J_{α} (α)

in (14) is non-convex function.

Proof of Lemmma 1.

The data-dependent term

Y^{⊤} {(λ I + F Λ F^{⊤})}^{- 1} Y

in (14) is studied. By (15) and (16), it can be transformed as

\begin{matrix} Y^{⊤} {(λ I + F Λ F^{⊤})}^{- 1} Y \\ = & \frac{1}{λ} Y^{⊤} Y - \frac{1}{λ} Y^{⊤} F Σ_{θ} F^{⊤} \frac{1}{λ} Y \\ = & \frac{1}{λ} {∥Y - {Fm}_{θ}∥}_{2}^{2} + m_{θ}^{⊤} Λ^{- 1} m_{θ} \\ = & min_{x} \{\frac{1}{λ} {∥ Y - Fx ∥}_{2}^{2} + x^{⊤} Λ^{- 1} x\} . \end{matrix}

(21)

The minimisation issue is simply demonstrated to be convex in

θ

and

α

dimensions. Define

ρ (α) ≜ log |λ I + F Λ F^{⊤}| + \sum_{j = 1}^{N} p (α_{j}) - 2 \sum_{j = 1}^{N} log ξ (α_{j})

.

log | x |

is concave function. Furthermore,

λ I + F Λ F^{⊤}

is an affine function of

α

. When

α \geq 0,

it is positive semi-definite. This means

log |λ I + F Λ F^{⊤}|

is a concave non-decreasing function of

α

. We can see that

ρ (α)

is a concave function with respect to

α

.

4.2.2. Modified Objective Function in $θ$ Estimation

We use the modified objective function of

θ

with a penalty function. By analyzing the corresponding objective function of (14) in the

α

space, the analogous objective function is subsequently shown to be non-convex as well.

Theorem 2.

Solving the optimisation problem below yields the estimated value for θ given restrictions.

min_{θ} {∥ Y - F θ ∥}_{2}^{2} + λ r (θ),

where penalty function

r (θ) = {min}_{α \geq 0} θ^{⊤} Λ^{- 1} θ + log |λ I + F Λ F^{⊤}| + \sum_{j = 1}^{N} p (α_{j}) - 2 \sum_{j = 1}^{N} (\frac{1}{h} + \frac{1}{2 α_{j}}) θ_{j}^{T} θ_{j} - 2 \sum_{j = 1}^{N} log ξ (α_{j})

.

Proof of Theorem 2.

Using the data-dependent term in

(21)

and

J_{α} (α)

in (14), a stringent upper boundary auxiliary function can be created on

J_{α} (α)

as

\begin{matrix} J_{α, θ} (α, θ) & = \frac{1}{λ} {∥ Y - F θ ∥}_{2}^{2} + θ^{⊤} Λ^{- 1} θ + log |λ I + F Λ F^{⊤}| \\ + \sum_{j = 1}^{N} p (α_{j}) - 2 \sum_{j = 1}^{N} (\frac{1}{h} + \frac{1}{2 τ_{j}}) θ_{j}^{T} θ_{j} - 2 \sum_{j = 1}^{N} log ξ (α_{j}) . \end{matrix}

When we minimise

J_{α, θ} (α, θ)

over

α

and obtain

\begin{matrix} J_{θ} (θ) & ≜ min_{α \geq 0} J_{α, θ} (α, θ) \\ = \frac{1}{λ} {∥ Y - F θ ∥}_{2}^{2} + min_{α \geq 0} (θ^{⊤} Λ^{- 1} θ + log |λ I + F Λ F^{⊤}| \\ + \sum_{j = 1}^{N} p (α_{j}) - 2 \sum_{j = 1}^{N} (\frac{1}{h} + \frac{1}{2 τ_{j}}) θ_{j}^{T} θ_{j} - 2 \sum_{j = 1}^{N} log ξ (α_{j})) . \end{matrix}

(22)

From the derivations in (21), we can see that the posterior mean

m_{θ}

is the estimate of the parameter

θ

.

Lemma 2.

In Theorem 2, the penalty function

r (θ)

promotes sparsity on the weights by being a non-decreasing concave function of θ.

Proof of Lemma 2.

It is obvious that

ρ (α)

in Lemma 1 is concave. Using the duality lemma (see Section 4.2 in [35]),

ρ (α)

is denoted as

{min}_{α^{*} \geq 0} 〈α^{*}, α〉 - ρ^{*} (α^{*})

, where

ρ^{*} (α^{*})

is defined as the concave conjugate of

ρ (α)

and

ρ^{*} (α^{*}) = {min}_{α \geq 0} 〈α^{*}, α〉 - ρ (α)

. By (21), function

J_{α, θ} (α, θ)

can be re-written as

\begin{matrix} J_{α, θ} (α, θ) & ≜ 〈α^{*}, α〉 - ρ^{*} (α^{*}) + Y^{⊤} {(λ I + F Λ F^{⊤})}^{- 1} Y \\ = \frac{1}{λ} {∥ Y - F θ ∥}_{2}^{2} + \sum_{j} (\frac{θ_{j}^{2}}{α_{j} - \frac{2}{h} - \frac{1}{τ_{j}}} + α_{j}^{*} α_{j}) - ρ^{*} (α^{*}) . \end{matrix}

(23)

r (θ)

is re-expressed as

r (θ) = min_{α, α * \geq 0} \{\sum_{j} (\frac{θ_{j}^{2}}{α_{j} - \frac{2}{h} - \frac{1}{τ_{j}}} + α_{j}^{*} α_{j}) - ρ^{*} (α^{*})\} .

(24)

It is easy to see that when

α_{j} = \frac{1}{2} \sqrt{{(\frac{2}{h} + \frac{1}{τ_{j}})}^{2} + \frac{4 θ_{j}^{2}}{α_{j}^{*}}} + (\frac{1}{h} + \frac{2}{τ_{j}})

,

r (θ)

reaches the minimum over

α

. Substitute

α_{j}

into

r (θ)

r (θ) = min_{α * \geq 0} \{\sum_{j} 2 \sqrt{\frac{α_{j}}{α_{j} - \frac{2}{h} - \frac{1}{τ_{j}}} α_{j}^{*}} | θ_{j} | - ρ^{*} (α^{*})\} .

(25)

When

r (θ)

is minimum,

θ

is much more sparse. From (26),

r (θ)

is non-decreasing concave function of

θ

.

4.2.3. Parameter Estimation with Sparse Method

From (23), we can see that

ρ^{*} (α^{*})

does not affect the estimation of parameters

α

, so

J_{α, θ} (α, θ)

is redefined as

J_{α^{*}} (α, θ) ≜ \frac{1}{λ} {∥ Y - F ` ∥}_{2}^{2} + \sum_{j} (\frac{θ_{j}^{2}}{α_{j} - \frac{2}{h} - \frac{1}{τ_{j}}} + α_{j}^{*} α_{j}) .

(26)

For a fixed

α^{*}

, we notice that

J_{α^{*}} (α, θ)

is jointly convex in

θ

and

α

. In (27), we can obtain

\frac{θ_{j}^{2}}{α_{j} - \frac{2}{h} - \frac{1}{τ_{j}}} + α_{j}^{*} α_{j} \geq 2 \sqrt{\frac{α_{j}}{α_{j} - \frac{2}{h} - \frac{1}{τ_{j}}} α_{j}^{*}} θ_{j}

. When

α_{j} = \frac{1}{2} \sqrt{{(\frac{2}{h} + \frac{1}{τ_{j}})}^{2} + \frac{4 θ_{j}^{2}}{α_{j}^{*}}} + (\frac{1}{h} + \frac{2}{τ_{j}})

,

J_{α^{*}} (α, θ)

is minimized for any

θ

. Substitute the

α_{j}

into

J_{α^{*}} (α, θ)

,

\hat{θ}

can be obtained as follows:

\hat{θ} = \underset{θ}{argmin} \{{∥ Y - F θ ∥}_{2}^{2} + 2 λ \sum_{j = 1}^{N} \sqrt{\frac{α_{j}}{α_{j} - \frac{2}{h} - \frac{1}{τ_{j}}} α_{j}^{*}} | θ_{j} |\} .

(27)

Due to the concavity of

r (θ)

, the objective function in Theorem 2 can be optimised using a re-weighted

L_{1}

-minimisation in a similar kth way as was considered in (27).

In order to obtain more stable and accurate parameter

\hat{θ}

, the re-estimated method is put forward (Algorithm 1). At k-th iteration, the modified weight is then supplied by:

u_{j}^{(k)} ≜ {\frac{\partial r (θ)}{2 \partial |θ_{j}|}}_{θ = θ^{(k)}} = η_{j} \cdot \sqrt{α_{j}^{*}},

(28)

where

η_{j} = \sqrt{\frac{α_{j}}{α_{j} - \frac{2}{h} - \frac{1}{τ_{j}}}} .

On the basis of the aforementioned, we can now describe how the parameters can be updated. To begin, we set the iteration count k to zero,

u_{j}^{(0)} = 1

and initialise

η_{j}^{(k + 1)} = \sqrt{\frac{α_{j}^{(k)}}{α_{j}^{(k)} - \frac{2}{h} - \frac{1}{τ_{j}^{(k)}}}} .

(29)

Algorithm 1: Non-linear state-space identification algorithm with sparse Bayesian and Stein approach (NSSI-SBSA)

Input:

1: Generate time series data from the system of discrete-time dynamics characterized by (1)

2: Choose the dictionary functions that will be used to build the dictionary matrix

mentioned in Section 2;

u_{j}^{0} = 1

,

α_{j}^{(k)}, h, τ_{j}^{(k)},

stopping threshold

ε

3: for

k = 0, 1, \dots

do

4: Solve the minimisation problem with L1-regularization and optimization method on

θ

.

\underset{θ}{argmin} \{{∥ Y - F θ ∥}_{2}^{2} + 2 λ \sum_{j = 1}^{N} \sqrt{\frac{α_{j}}{α_{j} - \frac{2}{h} - \frac{1}{τ_{j}}} α_{j}^{*}} | θ_{j}^{} |\}

5: Update parameter

η_{j}^{(k + 1)}

and

u_{j}^{(k + 1)}

in (28) and (29).

6:

u_{j}^{(k + 1)} ≜ η_{j}^{(k + 1)} \cdot \sqrt{α_{j}^{* (k + 1)}}

7: Update parameter

α_{j}^{(k + 1)} = \frac{1}{2} \sqrt{{(\frac{2}{h} + \frac{1}{τ_{j}^{(k)}})}^{2} + \frac{4 {(θ_{j}^{(k)})}^{2}}{α_{j}^{* (k)}}} + (\frac{1}{h} + \frac{2}{τ_{j}^{(k)}})

8: end for

9: if

| θ - \hat{θ} | < ε

then

10: Break

11: end if

Output:

The sparse weight set of

\hat{θ}

.

We obtain

u_{j}^{(k)} = η_{j}^{(k)} \cdot \sqrt{α_{j}^{* (k)}}

.

J_{α, θ} (α, θ)

is considered again. For any fixed

α

and

θ

, the tightest bound can be obtained by minimising over

α^{*}

.

α^{*}

is estimated, which equals the gradient of the function

ρ (α)

in Lemma 1. The estimation of

α^{*}

is computed as

\begin{matrix} {\hat{α}}^{*} & = \nabla_{α} ρ (α) \\ = diag [F^{⊤} {(λ I + F Λ F^{⊤})}^{- 1} F] + p^{'} (α) - 2 ζ (α), \end{matrix}

(30)

where

p^{'} (α) =

{[p^{'} (α_{1}), \dots, p^{'} (α_{N})]}^{⊤}

,

ζ (α) =

{[\frac{ξ^{'} (α_{1})}{ξ^{'} (α_{1})}, \dots, \frac{ξ^{'} (α_{N})}{ξ^{'} (α_{N})}]}^{⊤}

. The optimal

α^{* (k + 1)}

can then be replaced by

{\hat{α}}^{* (k + 1)} = diag [F^{⊤} {(λ I + F Λ^{(k)} F^{⊤})}^{- 1} F] +

p^{'} (α^{(k)}) - 2 ζ (α) .

After computing the estimation of

α_{j} = \frac{1}{2} \sqrt{{(\frac{2}{h} + \frac{1}{τ_{j}})}^{2} + \frac{4 θ_{j}^{2}}{α_{j}^{*}}} + (\frac{1}{h} + \frac{2}{τ_{j}}),

(31)

we can compute

α_{j}^{* (k + 1)}

, which gives

α_{j}^{* (k + 1)} = F_{j}^{⊤} {(λ I + F_{j} Λ^{(k)} F_{j}^{⊤})}^{- 1} F_{j} + p^{'} (α_{j}^{(k)}) - 2 ζ (α_{j}^{(k)})

α_{j}^{(k + 1)}

can be defined

α_{j}^{(k + 1)} = \frac{1}{2} \sqrt{{(\frac{2}{h} + \frac{1}{τ_{j}^{(k)}})}^{2} + \frac{4 {(θ_{j}^{(k)})}^{2}}{α_{j}^{* (k)}}} + (\frac{1}{h} + \frac{2}{τ_{j}^{(k)}}) .

Substitute

α_{j}^{(k + 1)}

and

α_{j}^{* (k + 1)}

into (27). Certain weights

η_{j}

are estimated at each iteration k until

| θ - \hat{θ} | < ε

, where

ε

is stopping threshold. Algorithm 1 summarizes the above-mentioned procedure.

5. Numerical Example

All examples are conducted on a computer with an Intel Core i7-6500U CPU@2.50-GHz and 16 GB of RAM. CVX package is used to solve convex programs in MATLAB2016 platform. We will give three numerical examples: Narendra-Li Model [20], NARX model [25], and Kernel state-space models [44]. The utility and performance of Algorithm 1 is proven on three simulation cases in this section. The performance of Algorithm 1 on examples is then illustrated involving a well studied and challenging non-linear system. The root mean squared error (RMSE) criterion will be utilized to demonstrate the performance of the suggested identification approach against noise perturbation. The ith estimate of parameter

θ

is denoted by

{\hat{θ}}_{i}

at the Monte Carlo experiment. The RMSE at the experiment is defined as

R M S E = \sqrt{\frac{\sum_{i = 1}^{n} {({\hat{θ}}_{i} - θ_{i}^{*})}^{2}}{n - 1}}

where

θ_{i}^{*}

represents the true system parameter vector, and n is the trials. To validate the theoretical results, the identification of the structured state-space model in cases will be simulated in this part.

5.1. Example 1: Parameters Identification General Narendra-Li Model

Consider the state space representation of a non-linear system:

\begin{matrix} x_{t + 1}^{1} = & (\frac{α_{1} x_{t}^{1}}{1 + {(x_{t}^{1})}^{2}} + α_{2}) sin (x_{t}^{2}) + ξ_{1} (t) \end{matrix}

(32)

\begin{matrix} x_{t + 1}^{2} = & β_{1} x_{t}^{2} cos (x_{t}^{2}) + β_{2} x_{t}^{1} exp (- \frac{{(x_{t}^{1})}^{2} + {(x_{t}^{2})}^{2}}{8}) \\ + \frac{β_{3} {(u_{t})}^{3}}{1 + {(u_{t})}^{2} + 0.5 cos (x_{t}^{1} + x_{t}^{2})} + ξ_{2} (t) \\ y_{t} = & \frac{x_{t}^{1}}{1 + 0.5 sin (x_{t}^{2})} + \frac{x_{t}^{2}}{1 + 0.5 sin (x_{t}^{1})} \end{matrix}

(33)

where the state variable

x_{t} = {[x_{t}^{1}, x_{t}^{2}]}^{⊤}

.

ξ_{i} (t)

is Gaussian white noise. To generate the estimation data, the system is excited with a uniformly distributed random input signal

u (t) \in [- 2.5, 2.5]

with

1 \leq t \leq 1000

. The validation dataset is generated with the input

u (t) = sin \frac{2 π t}{10} + sin \frac{2 π t}{25}, t = 1, \dots, 1000 .

Let

Φ_{1}^{t} = \frac{x_{t}^{1}}{1 + {(x_{t}^{1})}^{2}} sin (x_{t}^{2}), Φ_{2}^{t} = sin (x_{t}^{2}), Φ_{3}^{t} = x_{t}^{2} cos (x_{t}^{2}), Φ_{4}^{t} = x_{t}^{1} exp (- \frac{{(x_{t}^{1})}^{2} + {(x_{t}^{2})}^{2}}{8}),

Φ_{5}^{t} = \frac{β_{3} {(u_{t})}^{3}}{1 + {(u_{t})}^{2} + 0.5 cos (x_{t}^{1} + x_{t}^{2})} .

Because there are two state variables, the dictionary matrix

Φ

can be built as follows:

Φ = [\begin{matrix} Φ_{1}^{t} & Φ_{2}^{t} & Φ_{3}^{t} & Φ_{4}^{t} & Φ_{5}^{t} \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ Φ_{1}^{t + M - 1} & Φ_{2}^{t + M - 1} & Φ_{3}^{t + M - 1} & Φ_{4}^{t + M - 1} & Φ_{5}^{t + M - 1} \end{matrix}]

(34)

Then, the state set can be defined as

x_{i} ≜ {[x_{t + 1}^{1}, \dots, x_{t + M}^{1}]}^{⊤} \in R^{M \times 1}, i = 1, 2

Using the dictionary matrix

Φ

in (34), the true value of parameter

θ

for the model in (32), (33) should be as follows:

θ_{t r u e} = [θ^{(1)}, θ^{(2)}]

= [\begin{matrix} α_{1} (= 4) & 0 \\ α_{2} (= 3) & 0 \\ 0 & β_{1} (= 1.4) \\ 0 & β_{2} (= 1.5) \\ 0 & β_{3} (= 1.6) \end{matrix}]

(35)

The parameters in bracket are true value. In our study, we use

T = 1000

samples for learning and add white Gaussian measurement noise of

ξ_{i} (t) = 0.1

to the training data. In (32) and (33), Algorithm 1 is used for identifying parameters. The coefficients

θ^{(j)}

is learned from prior data in (35). RMSE of

θ

is computed in the simulation, the result of which is

0.039

. When the noise

ξ_{i} (t)

are 0.2, 0.3, 0.4, and 0.5, there is little change in the value of RMSE of

θ

in Figure 1. Despite using 2000 data points of [20], our method is substantially better than [20] in Table 1. Table 1 compares some previous results reported in the literature [45,46,47] with our method, we can also see that our method perform the best. In this experiment, we also examine how the method performs for various

ξ_{i} (t) = 0.1, 0.3

. The last 60 in generated data sequence is selected for testing and the predicting result is compared with the true value. When the Algorithm 1 is executed 8 times, the average of the RMSE of output

y_{t}

is 0.06, and is not increased fast from Figure 2.

In general, it is clear that the proposed model is capable enough to well describe the system behavior.

5.2. Example 2: Application to a NARX Model

We analyze the following polynomial terms model for a single-input single-output (SISO) non-linear autoregressive system with exogenous (NARX) input in this example [25].

y (t_{k + 1}) = 0.7 y (t_{k - 1}) - 0.5 y (t_{k - 2}) + 0.6 u^{2} (t_{k - 2}) - 0.7 y (t_{k - 2}) u (t_{k - 1}) + ξ (t_{k})

(36)

with

y, u, ξ \in R

. In expanded form, we may write (36) as:

\begin{matrix} y (t_{k + 1}) & = w_{1} + w_{2} y (t_{k}) + \dots + w_{m_{x} + 2} y (t_{k - m_{z}}) + \dots + w_{N} y^{d_{x}} (t_{k - m_{z}}) u^{d_{u}} (t_{k - m_{u}}) + ξ (t_{k}) \\ = w^{⊤} f (y (t_{k}), \dots, y (t_{k - m_{x}}), u (t_{k}), \dots, u (t_{k - m_{u}})) + ξ (t_{k}) \end{matrix}

(37)

Model (36) is the general form of (37).

d_{x}

and

d_{u}

are the degree of the output and input;

m_{y}

and

m_{u}

is the given memory order of the output and input;

w^{⊤} = [w_{1}, \dots, w_{N}] \in R^{N}

is the weight vector; and

f (y (t_{k}), \dots, y (t_{k - m_{y}}), u (t_{k}), \dots, u (t_{k - m_{u}})) =

{[f_{1} (\cdot), \dots, f_{N} (\cdot)]}^{⊤} \in R^{N}

is the functions vector. Taking the NARX model (36) as an example, we set that

d_{y} = 1

,

d_{u} = 2

,

m_{y} = 2

,

m_{u} = 2

. This yields

f (\cdot) \in R^{28}

and, thus,

w \in R^{28}

. Since

w \in R^{4}

, only 4 of the 28 linked weights

w_{i}

are non-zero. In our study, we use k = 1000 samples for learning with white Gaussian noise. The last 60 in generated data sequence is selected for testing and the predicting result is compared with the true value. The estimated parameter w agrees with the true value as shown in Figure 3. The predicting performance of Algorithm 1 is shown in the Figure 4. From the Figure 4, the predicted and exact trajectories match well with different

ξ_{i} (t) = 0.1, 0.3

. When Algorithm 1 is executed 8 times, the average of the RMSE of output are 0.021 and 0.074, which are tolerable in the application.

5.3. Example 3: Kernel State-Space Models (KSSM) for Autoregressive Modeling

Kernel state-space models (KSSM) is autoregressive model, which satisfy the

τ

-order difference equation. As seen below, the model may be described as a first-order multivariate process.

{\bar{x}}_{t + 1} = F_{t} ({\bar{x}}_{t}) + V_{t}

(38)

where

{\bar{x}}_{t} = {[x_{t}, \dots, x_{t - τ + 1}]}^{T}

,

F_{t} ({\bar{x}}_{t}) = {[f_{t} (x_{t}, \dots, x_{t - τ + 1}), x_{t}, \dots, x_{t - τ + 2}]}^{T}

, and

V_{t} =

{[ξ_{t}, 0, \dots, 0]}^{T}

.

The hidden state of an SSM can then be viewed as the process

{\bar{x}}_{t}

, producing an SSM formulation of a complex autoregressive model with noisy

V_{t}

. By using non-linear autoregressive modeling with a fixed number of delayed samples, the model can be utilized to predict time series. In addition, if the state-transition function

f_{t}

is defined using kernels (39), we derive the suggested KSSM suited for autoregressive time series.

\begin{matrix} [\begin{matrix} x_{t + 1} \\ x_{t} \\ ⋮ \\ x_{t - τ + 2} \end{matrix}] & = [\begin{matrix} \sum_{i = 1}^{N} w_{i} k_{i} (x_{i} (t_{k})) \\ x_{t} \\ ⋮ \\ x_{t - τ + 2} \end{matrix}] + [\begin{matrix} ξ_{t} \\ 0 \\ ⋮ \\ 0 \end{matrix}] \\ Y_{t} & = h ({\bar{x}}_{t}) + V_{t}, \end{matrix}

(39)

where

Y_{t}

is the observed process,

h ({\bar{x}}_{t})

is the observation function,

V_{t}

is observation noise, and

w = [w_{1}, \dots w_{N}]

is the weight. Periodic time series is widely used in physics, engineering and biology. We take Fourier kernel function in the KSSM. Consider 5 candidate kernel functions for

k_{i} (\cdot)

:

sin x_{i}

,

cos x_{i}

,

x_{i}

,

sin 2 x_{i}

and

cos 2 x_{i} .

Algorithm 1 is applied in the identification of parameter w in (39). The RMSE of w is 0.16, which is a satisfactory result. The estimation data in the experiment has 500 sample points, and Figure 5 shows the simulated outputs of the two processes evaluated on the validation set. In Figure 5, we compare the true and estimated value w using the probability distribution and dispersoid distribution. It can see that the sparse effect of the algorithm proposed in this paper is obvious.

6. Conclusions

The parameter estimation of non-linear discrete-time state-space systems with noisy state data are the subject of this work. For parameter estimation and prediction, a novel sparse Bayesian convex optimisation method (NSSI-SBSA) is presented, which considers approximation method, parameter prior, and posterior. The fundamental problem with identification is divided into two parts: the first step, the improved Stein approach is used to create a new optimisation objective function. The second step is to create a reweighted

L_{1}

-regularized least squares solver, with the regularization value chosen from the optimization point. The new objective function is more information-rich and can easily extract more critical information from the raw data than the previous study. From the three examples, the NSSI-SBSA algorithm usually captures more information about the reliance of the data indicators than the methods discussed in the introduction part.

Author Contributions

Methodology, L.Z. and J.L.; Formal analysis, W.Z.; Writing—review and editing, J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation (NNSF) of China under Grant (61703149) and the Natural Science Foundation of Hebei Province of China (F2019111009).

Conflicts of Interest

The authors declare no conflict of interest.

References

Ljung, L. Perspectives on system identification. Annu. Control. 2010, 34, 1–12. [Google Scholar] [CrossRef]
Luo, G.; Yang, Z.; Zhan, C.; Zhang, Q. Identification of nonlinear dynamical system based on raised-cosine radial basis function neural networks. Neural Process. Lett. 2021, 53, 355–374. [Google Scholar] [CrossRef]
Yakoub, Z.; Naifar, O.; Ivanov, D. Unbiased Identification of Fractional Order System with Unknown Time-Delay Using Bias Compensation Method. Mathematics 2022, 10, 3028. [Google Scholar] [CrossRef]
Yakoub, Z.; Amairi, M.; Aoun, M.; Chetoui, M. On the fractional closed-loop linear parameter varying system identification under noise corrupted scheduling and output signal measurements. Trans. Inst. Meas. Control. 2019, 41, 2909–2921. [Google Scholar] [CrossRef]
Yakoub, Z.; Aoun, M.; Amairi, M.; Chetoui, M. Identification of continuous-time fractional models from noisy input and output signals. In Fractional Order Systems—Control Theory and Applications; Springer: Cham, Switzerland, 2022; pp. 181–216. [Google Scholar]
Kumpati, S.N.; Kannan, P. Identification and control of dynamical systems using neural networks. IEEE Trans. Neural Netw. 1990, 1, 4–27. [Google Scholar]
Leontaritis, I.J.; Billings, S.A. Input-output parametric models for non-linear systems part II: Stochastic non-linear systems. Int. J. Control. 1985, 41, 329–344. [Google Scholar] [CrossRef]
Rangan, S.; Wolodkin, G.; Poolla, K. New results for Hammerstein system identification. Proceedings of 1995 34th IEEE Conference on Decision and Control, New Orleans, LA, USA, 13–15 December 1995; IEEE: Piscataway, NJ, USA, 1995; Volume 1, pp. 697–702. [Google Scholar]
Billings, S.A. Nonlinear System Identification: NARMAX Methods in the Time, Frequency, and Spatio-Temporal Domains; John Wiley and Sons: Hoboken, NJ, USA, 2013. [Google Scholar]
Haber, R.; Unbehauen, H. Structure identification of nonlinear dynamic systems—A survey on input-output approaches. Automatica 1990, 26, 651–677. [Google Scholar] [CrossRef]
Barahona, M.; Poon, C.S. Detection of nonlinear dynamics in short, noisy time series. Nature 1996, 381, 215–217. [Google Scholar] [CrossRef]
Frigola, R.; Lindsten, F.; Schon, T.B.; Rasmussen, C.E. Bayesian inference and learning in Gaussian process state-space models with particle MCMC. Adv. Neural Inf. Process. Syst. 2013, 26. [Google Scholar]
Frigola, R.; Chen, Y.; Rasmussen, C.E. Variational Gaussian process state-space models. Adv. Neural Inf. Process. Syst. 2014, 27. [Google Scholar]
Karl, M.; Soelch, M.; Bayer, J.; Van der Smagt, P. Deep variational bayes filters: Unsupervised learning of state space models from raw data. arXiv 2016, arXiv:1605.06432. [Google Scholar]
Raiko, T.; Tornio, M. Variational Bayesian learning of nonlinear hidden state-space models for model predictive control. Neurocomputing 2009, 72, 3704–3712. [Google Scholar] [CrossRef]
Ljung, L. Theory for the User. In System Identification; Prentice Hall: Hoboken, NJ, USA, 1987. [Google Scholar]
Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Ser. B 1977, 39, 1–22. [Google Scholar]
Duncan, S.; Gyongy, M. Using the EM algorithm to estimate the disease parameters for smallpox in 17th century London. In Proceedings of the 2006 IEEE International Symposium on Intelligent Control, Munich, Germany, 4–6 October 2006; IEEE: Piscataway, NJ, USA, 2006; pp. 3312–3317. [Google Scholar]
Solin, A.; Sarkka, S. Hilbert space methods for reduced-rank Gaussian process regression. Stat. Comput. 2020, 30, 419–446. [Google Scholar] [CrossRef]
Svensson, A.; Schon, T.B. A flexible state–space model for learning nonlinear dynamical systems. Automatica 2017, 80, 189–199. [Google Scholar] [CrossRef]
Frigola, R. Bayesian Time Series Learning with Gaussian Processes; University of Cambridge: Cambridge, UK, 2015. [Google Scholar]
Wilson, A.G.; Hu, Z.; Salakhutdinov, R.R.; Xing, E.P. Stochastic variational deep kernel learning. Adv. Neural Inf. Process. 2016, 2586–2594. [Google Scholar]
Cerone, V.; Piga, D.; Regruto, D. Enforcing stability constraints in set-membership identification of linear dynamic systems. Automatica 2011, 47, 2488–2494. [Google Scholar] [CrossRef][Green Version]
Zavlanos, M.M.; Julius, A.A.; Boyd, S.P.; Pappas, G.J. Inferring stable genetic networks from steady-state data. Automatica 2011, 47, 1113–1122. [Google Scholar] [CrossRef]
Pan, W.; Yuan, Y.; Goncalves, J.; Stan, G.B. A sparse Bayesian approach to the identification of nonlinear state space systems. IEEE Trans. Autom. Control. 2015, 61, 182–187. [Google Scholar] [CrossRef]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 1996, 58, 267–288. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Friedman, J.H.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: New York, NY, USA, 2009. [Google Scholar]
James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction To Statistical Learning; Springer: New York, NY, USA, 2013. [Google Scholar]
Donoho, D.L. Compressed sensing. IEEE Trans. Inf. Theory 2006, 52, 1289–1306. [Google Scholar] [CrossRef]
Candes, E.J.; Romberg, J.; Tao, T. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inf. Theory 2006, 52, 489–509. [Google Scholar] [CrossRef]
Tropp, J.A.; Gilbert, A.C. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Trans. Inf. Theory 2007, 53, 4655–4666. [Google Scholar] [CrossRef]
Stein, C. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Probability Theory, University of California, Berkeley, CA, USA, 21 June–18 July 1970, 9–12 April, 16–21 June and 19–22 July 1971; The Regents of the University of California: Berkeley, CA, USA, 1972. [Google Scholar]
Brunton, S.L.; Tu, J.H.; Bright, I.; Kutz, J.N. Compressive sensing and low-rank libraries for classification of bifurcation regimes in nonlinear dynamical systems. Siam J. Appl. Dyn. Syst. 2014, 13, 1716–1732. [Google Scholar] [CrossRef]
Bai, Z.; Wimalajeewa, T.; Berger, Z.; Wang, G.; Glauser, M.; Varshney, P.K. Low-dimensional approach for reconstruction of airfoil data via compressive sensing. AIAA J. 2015, 53, 920–933. [Google Scholar] [CrossRef]
Arnaldo, I.; O’Reilly, U.M.; Veeramachaneni, K. Building predictive models via feature synthesis. In Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, Madrid, Spain, 11–15 July 2015; pp. 983–990. [Google Scholar]
Berntorp, K. Online Bayesian inference and learning of Gaussian-process state–space models. Automatica 2021, 129, 109613. [Google Scholar] [CrossRef]
Imani, M.; Ghoreishi, S.F. Two-stage Bayesian optimization for scalable inference in state-space models. IEEE Trans. Neural Netw. Learn. Syst. 2021, 1–12. [Google Scholar] [CrossRef]
Efron, B.; Hastie, T.; Johnstone, I.; Tibshirani, R. Least angle regression. Ann. Stat. 2004, 32, 407–499. [Google Scholar] [CrossRef]
Brunton, S.L.; Proctor, J.L.; Kutz, J.N. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proc. Natl. Acad. Sci. USA 2016, 113, 3932–3937. [Google Scholar] [CrossRef]
Chen, S.S.; Donoho, D.L.; Saunders, M.A. Atomic decomposition by basis pursuit. SIAM Rev. 2001, 43, 129–159. [Google Scholar] [CrossRef]
Bishop, C.M. Pattern recognition. Mach. Learn. 2006, 128. [Google Scholar]
Ma, Z.; Lai, Y.; Kleijn, W.B.; Song, Y.Z.; Wang, L.; Guo, J. Variational Bayesian learning for Dirichlet process mixture of inverted Dirichlet distributions in non-Gaussian image feature modeling. IEEE Trans. Neural Networks And Learn. Syst. 2018, 30, 449–463. [Google Scholar] [CrossRef] [PubMed]
Tipping, M.E. Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res. 2001, 1, 211–244. [Google Scholar]
Tobar, F.; Djuric, P.M.; Mandic, D.P. Unsupervised state-space modeling using reproducing kernels. IEEE Trans. Signal Process. 2015, 63, 5210–5221. [Google Scholar] [CrossRef]
Roll, J.; Nazin, A.; Ljung, L. Nonlinear system identification via direct weight optimization. Automatica 2005, 41, 475–490. [Google Scholar] [CrossRef]
Stenman, A. Model on Demand: Algorithms, Analysis And Applications; Department of Electrical Engineering, Linköping University: Linköping, Sweden, 1999. [Google Scholar]
Xu, J.; Huang, X.; Wang, S. Adaptive hinging hyperplanes and its applications in dynamic system identification. Automatica 2009, 45, 2325–2332. [Google Scholar] [CrossRef]

Figure 1. RMSE of state

x_{1}

and

x_{2}

.

Figure 1. RMSE of state

x_{1}

and

x_{2}

.

Figure 2. Output of the state mode for 60 testing data in example 1: (a)

ξ_{i} = 0.1

and (b)

ξ_{i} = 0.3

.

Figure 2. Output of the state mode for 60 testing data in example 1: (a)

ξ_{i} = 0.1

and (b)

ξ_{i} = 0.3

.

Figure 3. The distribution of w: the above is true w, the below is the estimated w model.

Figure 4. Output of the state model for 60 testing data in example 2: (a)

ξ = 0.1

and (b)

ξ = 0.3

.

Figure 4. Output of the state model for 60 testing data in example 2: (a)

ξ = 0.1

and (b)

ξ = 0.3

.

Figure 5. Compare of distribution of w with sparsity: (a) sparse value and (b) true value.

Table 1. Accuracy comparison of different methods (

ξ_{i} = 0.1

).

Table 1. Accuracy comparison of different methods (

ξ_{i} = 0.1

).

Method	RMSE	Data
Our paper	0.039	1000
Bayesian Learning [20]	0.06	2000
DWO [45]	$0.43$	50,000
MOD [46]	$0.46$	50,000
AHH [47]	$0.31$	2000
MARS [47]	$0.49$	2000

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, L.; Li, J.; Zhang, W.; Yang, J. Identification of Nonlinear State-Space Systems via Sparse Bayesian and Stein Approximation Approach. Mathematics 2022, 10, 3667. https://doi.org/10.3390/math10193667

AMA Style

Zhang L, Li J, Zhang W, Yang J. Identification of Nonlinear State-Space Systems via Sparse Bayesian and Stein Approximation Approach. Mathematics. 2022; 10(19):3667. https://doi.org/10.3390/math10193667

Chicago/Turabian Style

Zhang, Limin, Junpeng Li, Wenting Zhang, and Junzi Yang. 2022. "Identification of Nonlinear State-Space Systems via Sparse Bayesian and Stein Approximation Approach" Mathematics 10, no. 19: 3667. https://doi.org/10.3390/math10193667

APA Style

Zhang, L., Li, J., Zhang, W., & Yang, J. (2022). Identification of Nonlinear State-Space Systems via Sparse Bayesian and Stein Approximation Approach. Mathematics, 10(19), 3667. https://doi.org/10.3390/math10193667

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Identification of Nonlinear State-Space Systems via Sparse Bayesian and Stein Approximation Approach

Abstract

1. Introduction

2. Problem Statement and Background

3. Constructing the Model in Bayesian Framework

4. Non-Convex Optimisation with Stein Method for Identification

4.1. Stein Operators Selection and Stein Constrain Design

4.2. Parameter Sparse Identification of Constraints from Data

4.2.1. Objective Function in Parameter Identification

4.2.2. Modified Objective Function in $θ$ Estimation

4.2.3. Parameter Estimation with Sparse Method

5. Numerical Example

5.1. Example 1: Parameters Identification General Narendra-Li Model

5.2. Example 2: Application to a NARX Model

5.3. Example 3: Kernel State-Space Models (KSSM) for Autoregressive Modeling

6. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Identification of Nonlinear State-Space Systems via Sparse Bayesian and Stein Approximation Approach

Abstract

1. Introduction

2. Problem Statement and Background

3. Constructing the Model in Bayesian Framework

4. Non-Convex Optimisation with Stein Method for Identification

4.1. Stein Operators Selection and Stein Constrain Design

4.2. Parameter Sparse Identification of Constraints from Data

4.2.1. Objective Function in Parameter Identification

4.2.2. Modified Objective Function in θ Estimation

4.2.3. Parameter Estimation with Sparse Method

5. Numerical Example

5.1. Example 1: Parameters Identification General Narendra-Li Model

5.2. Example 2: Application to a NARX Model

5.3. Example 3: Kernel State-Space Models (KSSM) for Autoregressive Modeling

6. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.2.2. Modified Objective Function in $θ$ Estimation