Time Series Prediction Based on Adaptive Weight Online Sequential Extreme Learning Machine

Lu, Junjie; Huang, Jinquan; Lu, Feng

doi:10.3390/app7030217

Open AccessArticle

Time Series Prediction Based on Adaptive Weight Online Sequential Extreme Learning Machine

by

Junjie Lu

,

Jinquan Huang

^* and

Feng Lu

Jiangsu Province Key Laboratory of Aerospace Power Systems, College of Energy and Power Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2017, 7(3), 217; https://doi.org/10.3390/app7030217

Submission received: 5 January 2017 / Accepted: 20 February 2017 / Published: 2 March 2017

Download

Browse Figures

Versions Notes

Abstract

:

A novel adaptive weight online sequential extreme learning machine (AWOS-ELM) is proposed for predicting time series problems based on an online sequential extreme learning machine (OS-ELM) in this paper. In real-world online applications, the sequentially coming data chunk usually possesses varying confidence coefficients, and the data chunk with a low confidence coefficient tends to mislead the subsequent training process. The proposed AWOS-ELM can improve the training process by accessing the confidence coefficient adaptively and determining the training weight accordingly. Experiments on six time series prediction data sets have verified that the AWOS-ELM algorithm performs better in generalization performance, stability, and prediction ability than the OS-ELM algorithm. In addition, a real-world mechanical system identification problem is considered to test the feasibility and efficacy of the AWOS-ELM algorithm.

Keywords:

time series prediction; extreme learning machine; adaptive weight; online learning

Graphical Abstract

1. Introduction

Time series prediction technology has already been studied over the past few decades, and a large amount of applications have been reported in a wide range of fields, such as weather forecasting [1], stock market prediction [2], communication signal processing [3], sales forecasting [4], and so on. On account of the frequent applications of time series predictions, plenty of predicting methods have been developed. Gooijer and Hyndman gave an overview of various predicting methods and indicated the future directions for time series prediction problems [5]. In particular, the classical statistical linear method, an autoregressive integrated moving average model (ARIMA) based modeling method, is still widely adopted and the complete methodology is able to be found in Box’s and Jenkins’s remarkable contribution [6]. However, the predicting accuracy of classical statistical methods suffer from the nonlinearity and complexity of many real time series. For this reason, some computational intelligence methods which may outperform classical statistical methods in many complex nonlinear problems have emerged [7,8].

Artificial neural network (ANN) methods have attracted extensive attention in the time series prediction field [9]. ANNs are universal nonlinear regression techniques, and are able to be applied into time series prediction conveniently [10]. Furthermore, compared with classical methods, the assumptions for prediction can be relaxed by ANNs, such as Gaussian distribution of noise and the linearity of the model. In contrast to the fact that the model hyperparameters of the ARIMA model is required to be fine-tuned for good prediction, this complication is able to be avoided by ANNs [11]. The hyperparameters of the ARIMA model are quite often adjusted according to domain knowledge, while the ANNs model usually can obtain competitive results without any domain knowledge [12].

For some situations where intensive human adjustment is not affordable, ANNs and other computational intelligence methods are usually better than classical statistical methods. Kumar investigated neural network models for forecasting the Indian stock market index [13]. Yoon discussed integrating ANN and support vector machine (SVM) methods for long-term prediction, and the stability and accuracy are improved [14]. Nevertheless, since the neural networks used by Kumar and Yoon are obtained in an iterative way by means of gradient-based algorithms, the diagnosis systems suffer from time-consuming problems.

Extreme learning machine (ELM) is a high-efficiency learning algorithm for single-hidden layer feedforward neural network (SLFN), and it has been proven to have classification capacity and universal approximation capacity [15]. In addition, Huang has shown that the hidden-layer parameters can be randomly assigned, and then the output weight is able to be computed analytically [16]. It has been verified that ELM costs much less training time and has better or similar generalization performance than SVM and traditional neural networks [17]. Hosseinioun presented the use of wavelet transform and adaptive ELM to forecast outlier occurrence in stock market time series [18]. Dash presented an optimized ELM for predicting financial time series [19]. The ELM based prediction methods have high accuracy and fast learning speed in off-line cases, but they are not suitable for online applications. Liang et al. proposed OS-ELM by incorporating a sequential learning algorithm with ELM [20]. Compared with conventional online training methods, OS-ELM tends to have a faster training speed and better generalization performance. However, in lots of real-world online applications, the confidence coefficient of a sequential data chunk may be disturbed by measurement noise and external disturbance. If the data chunk with low confidence coefficient is employed in the learning process in a normal way, the accuracy of the trained network is likely to be reduced. In this paper, the AWOS-ELM algorithm is proposed to reduce the negative influence of data chunks with low confidence coefficients, where the confidence coefficient of each data chunk is assessed before being used to train the network, and the weight of each data chunk is obtained according to the assessed confidence coefficient. Experiments on six time series prediction problems and a mechanical system identification problem have verified that the AWOS-ELM algorithm performs better in generalization performance, stability, and predictability than the OS-ELM algorithm.

This manuscript is organized as follows. In Section 2, the basic concepts and related works of optimization ELM and OS-ELM algorithms are reviewed briefly. The integrated structure of the proposed algorithm and the formula derivation is given in Section 3. In Section 4, the performance evaluation of the AWOS-ELM is carried out on six time series prediction problems and a mechanical system identification problem. The conclusion is drawn in Section 5.

2. Preliminaries

In this section, with the purpose of offering preliminaries pertinent to the proposed AWOS-ELM algorithm, the optimization of ELM and OS-ELM is reviewed briefly. OS-ELM, an online learning algorithm on the basis of classical ELM, was proposed by Liang in 2006 for training sequential data. Huang further developed the classical ELM into the optimization ELM according to the Karush-Kuhn-Tucker (KKT) theory and optimization theory in 2012 [21]. Compared to the classical ELM, the regularization parameters are used in the optimization ELM to increase the accuracy and generation performance. The OS-ELM in this paper also employs the regularization parameters according to Huang’s theory.

2.1. Optimization Extreme Learning Machine

ELM is a high-efficiency SLFN learning algorithm, where the hidden node parameters can be assigned randomly. Assume that there are N different training data

{(x_{i}, t_{i})}_{i = 1}^{N} \subset ℜ^{n} \times ℜ^{m}

for the supervised learning process, where

x_{i} = {[x_{i 1}, x_{i 2}, \dots, x_{i n}]}^{T} \in ℜ^{n}

and

t_{i} = {[t_{i 1}, t_{i 2}, \dots, t_{i m}]}^{T} \in ℜ^{m}

are the input vector and output vector, respectively, and the mathematical model of SLFNs is described as:

f_{L} (x) = \sum_{i = 1}^{L} β_{i} g (ω_{i}, b_{i}, x), x \in ℜ^{n}

(1)

where

ω_{i} \in ℜ^{n}

,

b_{i} \in ℜ

and

β_{i} = {[β_{i 1}, β_{i 2}, \dots, β_{i m}]}^{T} \in ℜ^{m}

denote the ith hidden node parameters,

L

is the hidden nodes number, and

g (ω_{i}, b_{i}, x)

represents the hidden-layer output in accordance with the input

x

. If the

N

training samples are absolutely approximated by the SLFNs with

L

hidden nodes, it indicates the following equation:

\sum_{i = 1}^{L} β_{i} g (ω_{i}, b_{i}, x) = t_{j}, j = 1, 2, \dots, N

(2)

Equation (2) is able to be described compactly as the following equation:

H β = T

(3)

where

H (ω_{1}, \dots, ω_{L}, b_{1}, \dots, b_{L}, x_{1}, \dots, x_{N}) = {[\begin{matrix} g (ω_{1}, b_{1}, x_{1}) & \dots & g (ω_{L}, b_{L}, x_{1}) \\ ⋮ & ⋱ & ⋮ \\ g (ω_{1}, b_{1}, x_{N}) & \dots & g (ω_{L}, b_{L}, x_{N}) \end{matrix}]}_{N \times L}

(4)

β = {[\begin{matrix} β_{1}^{T} \\ ⋮ \\ β_{L}^{T} \end{matrix}]}_{L \times m} and T = {[\begin{matrix} t_{1}^{T} \\ ⋮ \\ t_{N}^{T} \end{matrix}]}_{N \times m}

(5)

Traditionally, for the purpose of training an SLFN, one needs to find specific

ω_{i}, b_{i}, β_{i}, i = 1, \dots, L

, such that

‖ H β - T ‖

takes a minimum value. If

H

is unknown, the gradient-based approaches are usually employed to iteratively adjust

ω_{i}, b_{i}, β_{i}

. However, for most applications, the gradient-based method is extremely time-consuming and often stop at the local minimum. According to the theory of Huang, the hidden-layer learning parameters

ω_{i}

and

b_{i}

can be assigned randomly, and as such the SLFN is able to approximate any target function universally as soon as the activation function is nonzero, the target function is continuous and the input sets are compact [17]. If

L \leq N

, the column rank of

H

is full with probability one, and in real-world applications, the condition

L \leq N

can be easily satisfied. Considering the norm of the output weight

β

to be part of the cost function [22], the optimization ELM model can be represented as:

\begin{array}{l} \min : L_{e l m} = \frac{1}{2} {‖ β ‖}^{2} + \frac{C}{2} \sum_{i = 1}^{N} {‖ ε_{i} ‖}^{2} \\ s t : h (x_{i}) β = t_{i} - ε_{i}, i = 1, 2, \dots, N \end{array}

(6)

where

ε_{i}

is the prediction error of the ith training sample. Because the convex optimization problem does not have an inequality constraint, the Slater’s condition is satisfied and the strong duality holds. Consequently, on the basis of the KKT theorem, Equation (6) can be represented as follows:

\min : L_{e l m} = \frac{1}{2} {‖ β ‖}^{2} + \frac{C}{2} \sum_{i = 1}^{N} ε_{i}^{2} - \sum_{i = 1}^{N} α_{i} (h (x_{i}) β - t_{i} + ε_{i})

(7)

where

α_{i}

is the Lagrange multiplier. By optimizing Equation (7), the output weight can be obtained as follows:

β = {(\frac{I}{C} + H^{T} H)}^{- 1} H^{T} T

(8)

Since the output weights

β

is computed analytically, compared with traditional iterative implementations of SLFNs, the optimization ELM has similar generalization performance and dramatically increased running speed.

2.2. Online Sequential Extreme Learning Machine

In many practical instances, the sequential training samples

ℵ = {(x_{i}, t_{i}) | x_{i} \in ℜ^{n}, t_{i} \in ℜ^{m}, i = 1, 2, \dots}

are produced chunk by chunk, and the chunk size may be fixed or various. Assume that the jth data chunk has

N_{j}

samples, then the chunk at time k is able to be represented as

ℵ_{k} = {(x_{i}, t_{i})}_{i = (\sum_{j = 0}^{k - 1} N_{j}) + 1}^{\sum_{j = 0}^{k} N_{j}}

. The initialization of the learning process is carried out according to a small data chunk

ℵ_{0} = {(x_{i}, t_{i})}_{i = 1}^{N_{0}}

, where

N_{0}

is the samples number of data chunk

ℵ_{0}

, and

N_{0}

ought to be equal to or greater than

L

. With The hidden-layer parameters

(ω_{i}, b_{i}), i = 1, 2, \dots, L

assigned into random values, the initial

H_{0}

can be computed as the following equation:

H_{0} : = H (ω_{1}, \dots, ω_{L}, b_{1}, \dots, b_{L}, x_{1}, \dots, x_{N_{0}})

(9)

and then the initial

β_{0}

is able to be obtained according to ELM as follows:

β_{0} = P_{0} H_{0}^{T} T_{0}

(10)

where

P_{0} = {(\frac{I}{C} + H_{0}^{T} H_{0})}^{- 1}

and

T_{0} = {[t_{1}, \dots, t_{N_{0}}]}^{T}

.

The partial hidden-layer output matrixes

h_{k + 1}

and the partial output-layer matrixes

t_{k + 1}

corresponding to data chunk at time k + 1 are respectively defined as:

h_{k + 1} : = H (ω_{1}, \dots, ω_{L}, b_{1}, \dots, b_{L}, x_{(\sum_{j = 0}^{k} N_{j}) + 1}, \dots, x_{\sum_{j = 0}^{k + 1} N_{j}})

(11)

t_{k + 1} : = {[t_{(\sum_{j = 0}^{k} N_{j}) + 1}, \dots, t_{\sum_{j = 0}^{k + 1} N_{j}}]}^{T}

(12)

Then

H_{k}

and

T_{k}

can be respectively expressed as:

H_{k} = H (ω_{1}, \dots, ω_{L}, b_{1}, \dots, b_{L}, x_{1}, \dots, x_{\sum_{j = 0}^{k} N_{j}}), T_{k} = [\begin{matrix} t_{1}^{T} \\ ⋮ \\ t_{\sum_{j = 0}^{k} N_{j}}^{T} \end{matrix}]

(13)

and we have

H_{k + 1} = [\begin{matrix} H_{k} \\ h_{k + 1} \end{matrix}], T_{k + 1} = [\begin{matrix} T_{k} \\ t_{k + 1}^{T} \end{matrix}]

(14)

The least squares solution of

H_{k + 1} β = T_{k + 1}

should be the output weight at time k + 1,

β_{k + 1}

and it is able to be computed in an iterative way as follows:

β_{k + 1} = β_{k} + P_{k + 1} h_{k + 1}^{T} (t_{k + 1}^{T} - h_{k + 1} β_{k})

(15)

P_{k + 1} = P_{k} - P_{k} h_{k + 1}^{T} {(I + h_{k + 1} P_{k} h_{k + 1}^{T})}^{- 1} h_{k + 1} P_{k}

(16)

OS-ELM is composed of an initialization phase and sequential learning phase and need not retain all the historic data. In the initialization phase,

H_{0}

,

β_{0}

,

P_{0}

, and

T_{0}

are initialized for the use in the sequential learning phase. The samples number of the initialization chunk should be equal to or greater than the hidden nodes number. In the sequential learning phase, the sequential date chunk is commenced on iteratively. Once the training process on the latest coming data chunk is completed, the historic data can be discarded and not used any more. From the derivation of OS-ELM, it is easy to find that OS-ELM and ELM have similar generalization performances. In fact, the ELM algorithm is a specific example of the OS-ELM algorithm if all of the training samples are processed in the initialization of the learning process.

3. The Proposed Adaptive Weight Online Sequential Extreme Learning Machine

In lots of real online applications, with measurement noise and unexpected external disturbance, the sequential data chunks often have varying confidence coefficients. It can be easily found that OS-ELM cannot deal with the varying confidence coefficients very well. If a data chunk with a low confidence coefficient is employed to train the network in the normal way, the accuracy of the trained network is likely to be reduced. In this section, we propose the novel AWOS-ELM, where the confidence coefficient of each data chunk is assessed before being used to train the network, and the weight of each data chunk is obtained accordingly.

3.1. Integrated Structure

The block diagram of AWOS-ELM algorithm is given in Figure 1. When the new sequential sample

ℵ_{k + 1} = {(x_{i}, t_{i})}_{i = (\sum_{j = 0}^{k} N_{j}) + 1}^{\sum_{j = 0}^{k + 1} N_{j}}

arrives, the weight estimator accesses the confidence coefficient of

ℵ_{k + 1}

and determines the corresponding weight

λ_{k + 1}

. Then the training module of AWOS-ELM algorithm utilizes

λ_{k + 1}

and

ℵ_{k + 1}

to train the network. The weight estimator includes an AWOS-ELM testing module, residual generator and sigmoid function. The testing module produces the prediction value

{{\hat{t}}_{i}}_{i = (\sum_{j = 0}^{k} N_{j}) + 1}^{\sum_{j = 0}^{k + 1} N_{j}}

according to the input vector

{x_{i}}_{i = (\sum_{j = 0}^{k} N_{j}) + 1}^{\sum_{j = 0}^{k + 1} N_{j}}

. The residual

r_{k + 1}

is produced by the residual generator according to the comparison between the prediction value

{{\hat{t}}_{i}}_{i = (\sum_{j = 0}^{k} N_{j}) + 1}^{\sum_{j = 0}^{k + 1} N_{j}}

and the target value

{t_{i}}_{i = (\sum_{j = 0}^{k} N_{j}) + 1}^{\sum_{j = 0}^{k + 1} N_{j}}

, where the residual

r_{k + 1}

is defined as the following:

r_{k + 1} = \sqrt{\frac{\sum_{i = 1}^{N_{k + 1}} {‖ {\hat{t}}_{i + \sum_{j = 0}^{k} N_{j}} - t_{i + \sum_{j = 0}^{k} N_{j}} ‖}_{F}^{2}}{N_{k + 1} \times m}}

(17)

where

m

is the dimension of the output vector

t_{i}

. Then the sigmoid mapper produce the accessed weight

λ_{k + 1}

according to the difference between

τ_{r}

and

r_{k + 1}

as the following:

λ_{k + 1} = \frac{1}{1 + e^{- φ (τ_{r} - r_{k + 1})}}

(18)

where

τ_{r}

is the threshold value and

φ

is a scaling factor which can represent the gradient of the sigmoid function at the threshold point. For the samples with the residuals closer to the threshold

τ_{r}

, the greater the

φ

is, the greater the difference between calculated weights will be. Moreover,

φ

is manually selected to be 500 according to the performance in this paper. As observed from Equation (18), the greater the residual

r_{k + 1}

is, the less the accessed weight

λ_{k + 1}

is, and

0 < λ_{k} < 1

. The network of the testing module should be updated according to the training module before next sequential data chunk is incoming. The abnormal samples which cannot match the normal model well will have low accessed weight, and their negative impact to the subsequent learning process will be reduced. Thus, the AWOS-ELM algorithm is able to properly handle the varying confidence coefficients of each data chunk.

3.2. Formula Derivation

The assessed weight of the data chunk at time k is

λ_{k}

, then

β_{k + 1}^{'}

is the least squares solution of the following equation:

[\begin{matrix} λ_{0} H_{0} \\ λ_{1} h_{1} \\ ⋮ \\ λ_{k + 1} h_{k + 1} \end{matrix}] β^{'} = [\begin{matrix} λ_{0} T_{0} \\ λ_{1} t_{1}^{T} \\ ⋮ \\ λ_{k + 1} t_{k + 1}^{T} \end{matrix}]

(19)

Let

H_{k + 1}^{'} : = [\begin{matrix} H_{k}^{'} \\ λ_{k + 1} h_{k + 1} \end{matrix}]

,

H_{0}^{'} : = λ_{0} H_{0}

,

T_{k + 1}^{'} : = [\begin{matrix} T_{k}^{'} \\ λ_{k + 1} t_{k + 1}^{T} \end{matrix}]

,

T_{0}^{'} : = λ_{0} T_{0}

, then Equation (19) can be described in a compact way as:

H_{k + 1}^{'} β^{'} = T_{k + 1}^{'}

(20)

Theorem 1.

The solution of Equation (20) in the sense of least squares is able to be obtained in an iterative way as follows:

β_{k + 1}^{'} = β_{k}^{'} + K_{k + 1} (t_{k + 1} - h_{k + 1} β_{k}^{'})

(21)

P_{k + 1}^{'} = (I - K_{k + 1} h_{k + 1}) P_{k}^{'}

(22)

K_{k + 1} = P_{k}^{'} h_{k + 1} {(\frac{I}{λ_{k + 1}^{2}} + h_{k + 1} P_{k}^{'} h_{k + 1}^{T})}^{- 1}

(23)

where

P_{k}^{'} : = {({H^{'}}_{k}^{T} H_{k}^{'})}^{- 1}

,

β_{k}^{'} = P_{k}^{'} {H^{'}}_{k}^{T} T_{k}^{'}

.

Proof.

According to the definition of

P_{k}^{'}

,

P_{k + 1}^{'}

can be found as follows:

P_{k + 1}^{'} = {({[\begin{matrix} H_{k}^{'} \\ λ_{k + 1} h_{k + 1} \end{matrix}]}^{T} [\begin{matrix} H_{k}^{'} \\ λ_{k + 1} h_{k + 1} \end{matrix}])}^{- 1} = {({H^{'}}_{k}^{T} H_{k}^{'} + λ_{k + 1}^{2} h_{k + 1}^{T} h_{k + 1})}^{- 1}

(24)

Apply the Sherman-Morrison-Woodbury formula [23] into Equation (24), and then

P_{k + 1}^{'}

can be determined iteratively as follows:

\begin{array}{l} P_{k + 1}^{'} & = & P_{k}^{'} - P_{k}^{'} h_{k + 1}^{T} {(\frac{I}{λ_{k + 1}^{2}} + h_{k + 1} P_{k}^{'} h_{k + 1}^{T})}^{- 1} h_{k + 1} P_{k}^{'} \\ = & P_{k}^{'} - K_{k + 1} h_{k + 1} P_{k}^{'} \end{array}

(25)

According to Equation (25) and

β_{k + 1}^{'} = P_{k + 1}^{'} {H^{'}}_{k + 1}^{T} T_{k + 1}^{'}

, the output matrix

β_{k + 1}^{'}

can be obtained from the following equation:

\begin{array}{l} β_{k + 1}^{'} & = & (P_{k}^{'} - K_{k + 1} h_{k + 1} P_{k}^{'}) {[\begin{matrix} H_{k}^{'} \\ λ_{k + 1} h_{k + 1} \end{matrix}]}^{T} [\begin{matrix} T_{k}^{'} \\ λ_{k + 1} t_{k + 1}^{T} \end{matrix}] \\ = & (P_{k}^{'} - K_{k + 1} h_{k + 1} P_{k}^{'}) ({H^{'}}_{k}^{T} T_{k}^{'} + λ_{k + 1}^{2} h_{k + 1}^{T} t_{k + 1}^{T}) \\ = & P_{k}^{'} {H^{'}}_{k}^{T} T_{k}^{'} - K_{k + 1} h_{k + 1} P_{k}^{'} {H^{'}}_{k}^{T} T^{'} \\ + λ_{k + 1}^{2} (P_{k}^{'} h_{k + 1}^{T} - K_{k + 1} h_{k + 1} P_{k}^{'} h_{k + 1}^{T}) t_{k + 1}^{T} \\ = & β_{k}^{'} - K_{k + 1} h_{k + 1} β_{k}^{'} \\ + λ_{k + 1}^{2} (P_{k}^{'} h_{k + 1}^{T} - K_{k + 1} h_{k + 1} P_{k}^{'} h_{k + 1}^{T}) t_{k + 1}^{T} \end{array}

(26)

Then we can simplify the

P_{k}^{'} h {(k + 1)}^{T} - K_{k + 1} h (k + 1) P_{k}^{'} h {(k + 1)}^{T}

in Equation (26) as:

\begin{array}{l} P_{k}^{'} h_{k + 1}^{T} - K_{k + 1} h_{k + 1} P_{k}^{'} h_{k + 1}^{T} \\ = P_{k}^{'} h_{k + 1}^{T} - K_{k + 1} h_{k + 1} P_{k}^{'} h_{k + 1}^{T} - \frac{K_{k + 1}}{λ_{k + 1}^{2}} + \frac{K_{k + 1}}{λ_{k + 1}^{2}} \\ = P_{k}^{'} h_{k + 1}^{T} - K_{k + 1} (\frac{I}{λ_{k + 1}^{2}} + h_{k + 1} P_{k}^{'} h_{k + 1}^{T}) + \frac{K_{k + 1}}{λ_{k + 1}^{2}} \\ = \frac{K_{k + 1}}{λ_{k + 1}^{2}} \end{array}

(27)

Substituting Equation (27) into Equation (26),

β_{k + 1}^{'}

is able to be determined in a compact way as:

β_{k + 1}^{'} = β_{k}^{'} + K_{k + 1} (t_{k + 1}^{T} - h_{k + 1} β_{k}^{'})

(28)

Proposed AWOS-ELM Algorithm:

If

L

and

g : ℜ \to ℜ

are given, the AWOS-ELM algorithm is able to be summarized as follows:

Step 1. Initialization phase: choose the initial data chunk

ℵ_{0} = {(x_{i}, t_{i})}_{i = 1}^{N_{0}}

, where

N_{0} \geq L

.

(1): Configure the learning parameters $(ω_{i}, b_{i}), i = 1, 2, \dots, L$ randomly, and set $λ_{0} = 1$ ;
(2): Calculate $H_{0}$ and $β_{0}$ according to Equations (9) and (10);
(3): Set $k = 0$ .

Step 2. Sequential learning phase: iteratively train the network using the data chunk at time k+1

ℵ_{k + 1} = {(x_{i}, t_{i})}_{i = (\sum_{j = 0}^{k} N_{j}) + 1}^{\sum_{j = 0}^{k + 1} N_{j}}

;

(1): Assess the confidence coefficient of this new data chunk according to the test module in Figure 1, and determine the corresponding weight $λ_{k + 1}$ ;
(2): Calculate the partial $h_{k + 1}$ and $t_{k + 1}$ as Equations (11) and (12);
(3): Compute $β_{k + 1}^{'}$ in an iterative way in accordance with Equations (21)–(23);
(4): Set $k = k + 1$ and go to Step 2 until all the training data chunks are used for the learning process.

Remark 1.

Actually, the AWOS-ELM is the OS-ELM algorithm with adaptive weight. As a new data chunk comes, the proposed algorithm need not repeat the training process of ELM. The AWOS-ELM uses the newly arriving data chunk and the known information which is learned before to conduct the update of the training network, while the ELM applies all data chunks to update the network parameters. Therefore, in the case of sequential predicting problems, the AWOS-ELM algorithm tends to produce a better training process than the ELM algorithm.

Remark 2.

In the learning process by AWOS-ELM, since the confidence coefficient of each data chunk varies with time, the weight of each data chunk is assessed as soon as the new training data arrives at the next unit time and the SLFN will be trained accordingly. Therefore, the learning process can aptly deal with the confidence coefficient of each data chunk.

Remark 3.

If

λ_{0} = λ_{1} = \dots = λ_{k} = \dots

, that is, each training data chunk has the same assessed weight, then it is obvious that AWOS-ELM is equivalent to OS-ELM, indicating that the OS-ELM is a special case of the AWOS-ELM algorithm.

4. Experiments

For the purpose of verifying the validity of the proposed AWOS-ELM algorithm, six benchmark data sets and an identification problem on the dynamics of a flexible robot arm are considered in this section for the performance comparison between the OS-ELM and AWOS-ELM algorithm. The attributes of each dataset are uniformed into the range [−1,1], and the corresponding outputs are uniformed into [0,1]. The software environment for all experiments is MATLAB 7.11 (MathWorks, Natick, MA, USA) and the hardware environment is an ordinary PC with Intel Core i5-3210M processor. With the purpose of obtaining reliable statistical results, fifty trials are carried out for each case. For the purpose of comparing the performance, the definition of the rooted mean squared errors (RMSE) is given as follows:

R M S E = \sqrt{\frac{\sum_{i = 1}^{N_{t e s t i n g}} {‖ {\hat{t}}_{i} - t_{i} ‖}_{F}^{2}}{N_{t e s t i n g} \times m}}

(29)

where

{\hat{t}}_{i}

is the prediction value in regard to the target value

t_{i}

,

N_{t e s t i n g}

denotes the testing samples number. For a learning algorithm, a smaller RMSE often indicates a better generalization performance. In addition,

σ

, the standard deviation of RMSEs in 50 trials, can effectively reflect the reliability of the proposed algorithm. Two classical hidden node functions, the radial basis function

h (x) = e^{- b_{i} {‖ x - a_{i} ‖}^{2}}

and the sigmoid function

h (x) = \frac{1}{1 + e^{- (a_{i} \cdot x + b_{i})}}

, are chosen for each learning algorithm. If the hidden function is set to be sigmoid,

ω_{i}

and

b_{i}

are chosen according to uniform distribution in the closed interval [−1,1], and if the RBF function is selected,

ω_{i}

is also obtained from the uniform distribution in [−1,1], while

b_{i}

is determined according to the uniform distribution in the open interval (0,0.5) [15]. For the AWOS-ELM algorithm, we select the threshold

τ_{r}

from the domain

{x | x = 0.02 + 0.01 k, k = 0, 1, 2, \dots 18}

, and the best threshold is selected manually according to the prediction performance. In addition, the regulation parameter is set as

C = 10^{- 5}

[24]. In order to verify the prediction performance of AWOS-ELM with the existence of disturbance, Gaussian noise with a standard deviation of 0.0015 is added into the training samples. For the purpose of having a fair comparison, the hidden nodes number of OS-ELM is optimized by the cross validation method, afterwards the optimized hidden nodes number is extended to the AWOS-ELM algorithm.

4.1. Benchmark Data Sets

In this subsection, we make a comparison between AWOS-ELM and OS-ELM on six time series prediction problems consisting of three artificial time series and three actual time series. Among them, a monthly milk production dataset, having 100 training and 44 testing data, and an electricity production dataset, having 350 training and 102 testing data, are obtained from the well-known Time Series Data Library. The sunspot time series, having 2500 training and 690 testing data, is the monthly mean total sunspot number from January 1749 to December 2015 and is obtained from Solar Influences Data Analysis Center. Pseudo periodic synthetic time series, having 8000 training and 1801 testing data, is found from the University of California, Irvine (UCI) repository. The other two chaotic series, viz. Mackey-Glass and Logistic time series, are generated according to mathematical equations. The Mackey-Glass series

{x_{k}^{m g} | k = 1, 2, 3, \dots}

is generated according to the following differential delay equation [25]

\frac{d x^{m g} (t)}{d t} = \frac{a (t - τ)}{1 + x^{m g} (t - τ)} - b x^{m g} (t)

(30)

where

τ = 17

,

a = 0.2

,

b = 0.1

, and

x (0) = 1.2

. The Logistic series

{x_{k}^{l o} | k = 1, 2, 3, \dots}

is produced by the following recursive equation [26],

x_{k + 1}^{l o} = λ x_{k}^{l o} (1 - x_{k}^{l o})

(31)

where

λ = 3.5

. The prediction performance comparison between AWOS-ELM and OS-ELM on benchmark data sets is given in Table 1. The mean and standard deviation of weights for AWOS-ELM algorithm are listed in Table 2. The results in Table 1 and Table 2 are the mean values of 50 trials.

As observed from Table 1, the training time for AWOS-ELM is close to that for OS-ELM in various data sets, just as we expected. When the hidden nodes number and chunk size are set to be consistent, AWOS-ELM has lower RMSE and lower standard deviation than OS-ELM in most data sets. This implies higher prediction accuracy and the superior prediction stability of AWOS-ELM algorithm due to the accessed confidence coefficient. Furthermore, from Table 2, we can find that the mean weights are less than 0.92, which implies that the proportion of low confidence data is considerable. Additionally, the standard deviations are quite large, which implies that there is a great difference between the weights of normal data and that of low confidence data in the AWOS-ELM algorithm. Thus, the AWOS-ELM can efficiently increase the prediction performance.

4.2. Robot Arm Example

In this subsection, the performance of AWOS-ELM and OS-ELM are compared on the identification problem regarding the dynamics modeling of a robot arm. The system input and output of this flexible robot arm are the measured reaction torque and the acceleration, respectively [27]. There are two attributes in this flexible robot arm example, and all of the 1024 pairs of data are described in Figure 2. In order to learn the model, the input and output of SLFN,

x_{i}

and

t_{i}

, are respectively defined as

x_{i} = {[u_{i}, u_{i - 1}, u_{i - 2}, u_{i - 3}, u_{i - 4}, d_{i - 1}, d_{i - 2}, d_{i - 3}, d_{i - 4}]}^{T}

and

t_{i} = d_{i}

, where

u_{i}

and

d_{i}

are the measured reaction torque and the acceleration of this robot arm, respectively. Thus, the samples number is 1019. The training set contains the front 819 samples, and the rest of the 200 samples are used to test the network.

Figure 3 illustrates the relationship between the hidden nodes number and testing RMSE of the OS-ELM and AWOS-ELM algorithms. It shows that the OS-ELM and AWOS-ELM with low hidden nodes number have similar testing RMSE, and the testing RMSE of AWOS-ELM algorithm tends to be lower than that of OS-ELM when the hidden nodes number increases. It shows that the OS-ELM and AWOS-ELM with low hidden nodes number have similar testing RMSE, and the testing RMSE of the AWOS-ELM algorithm tends to be lower than that of OS-ELM when the hidden nodes number increases. When the number of hidden nodes is too small, the prediction accuracy of SLFN trained by the OS-ELM or AWOS-ELM algorithm is very low. As described in Equations (17) and (18), the computing adaptive weight depends on the prediction value. If the prediction accuracy is too low, the AWOS-ELM cannot differentiate the low confidence data and normal data very well. Thus, in Figure 3, the performance of AWOS-ELM is similar to or even slightly less than that of OS-ELM when the hidden node number is low. In order to be fair for the sake of comparison, the hidden nodes number is set to be 45, which can guarantee that the algorithms with sigmoid function or RBF function both have fine performance levels. Figure 4 illustrates the identification effectiveness of the flexible robot arm dynamics modeling problem, and we can easily find that the AWOS-ELM algorithm has higher prediction accuracy than OS-ELM algorithm no matter whether considering the sigmoid function case or RBF function case. Table 3 showcases that the proposed AWOS-ELM algorithm and OS-ELM algorithm have similar training time and standard deviation values, implying the similar learning speed and stability, and that the AWOS-ELM algorithm has lower RMSE than the OS-ELM algorithm, which implies better generalization performance and higher prediction accuracy.

5. Conclusions

In lots of online learning applications, the sequentially arrived data usually have varying confidence coefficients. OS-ELM trains the neural network chunk-by-chunk, but at the same time, it cannot deal with varying confidence coefficients of each data chunk very well. Based on the OS-ELM algorithm, we propose a novel algorithm, AWOS-ELM, which assesses the confidence coefficient and determines the weight of each data chunk before using the chunk to train the network. The proposed AWOS-ELM algorithm can improve the learning process by picking out the abnormal samples which may mislead the subsequent learning process, thereby reducing the impact of these abnormal samples. Thus, the AWOS-ELM is able to learn sequentially similar to how the OS-ELM works, but at the same time deals with the varying confidence coefficients of each data chunk properly. Compared with OS-ELM, simulations on benchmark databases demonstrate that AWOS-LEM performs better in generalization performance, stability, and prediction accuracy. In addition, experimental results on a flexible robot arm identification problem demonstrate that AWOS-ELM has the ability to get higher prediction accuracy than OS-ELM, while the two algorithms have similar training speed and stability.

Acknowledgments

This research was funded by the National Natural Science Foundation of China (under Grant 51276087).

Author Contributions

Jinquan Huang and Feng Lu conceived and designed the main idea, Junjie Lu wrote the program and carried out the experiments, Junjie Lu and Jinquan Huang analyzed the data and interpreted the results, Junjie Lu and Feng Lu wrote the paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

Shukla, A.K.; Garde, Y.A.; Jain, I. Forecast of Weather Parameters Using Time Series Data. MAUSAM 2014, 65, 209–520. [Google Scholar]
Kato, R.; Nagao, T. Stock Market Prediction Based On Interrelated Time Series Data. In Proceedings of the IEEE Symposium on Computers & Informatics, Penang, Malaysia, 18–20 March 2012.
Soni, S.K.; Chand, N.; Singh, D.P. Reducing the Data Transmission in WSNs Using Time Series Prediction Model. In Proceedings of the IEEE International Conference on Signal Processing, Communications and Computing, Hong Kong, China, 13–15 August 2012.
Hulsmann, M.; Borscheid, D.; Friedrich, C.M.; Reith, D. General Sales Forecast Models for Automobile Markets and Their Analysis. Trans. Mach. Learn. Data Min. 2011, 5, 65–86. [Google Scholar]
Gooijer, J.G.D.; Hyndman, R.J. 25 Years of Time Series Forecasting. Int. J. Forecast. 2006, 22, 443–473. [Google Scholar] [CrossRef]
Box, G.E.P.; Jenkins, S.B. Time Series Analysis: Forecasting and Control. J. Oper. Res. Soc. 1976, 22, 199–201. [Google Scholar]
Smith, C.; Jin, Y. Evolutionary Multi-objective Generation of Recurrent Neural Network Ensembles for Time Series Prediction. Neurocomputing 2014, 143, 302–311. [Google Scholar] [CrossRef] [Green Version]
Donmanska, D.; Wojtylak, M. Application of Fuzzy Time Series Models for Forecasting Pollution Concentrations. Expert Syst. Appl. 2012, 39, 7673–7679. [Google Scholar] [CrossRef]
Kumar, N.; Jha, G.K. A Time Series ANN Approach for Weather Forecasting. Int. J. Control Theory Comput. Model 2013, 3, 19–25. [Google Scholar] [CrossRef]
Claveria, O.; Torra, S. Forecasting Tourism Demand to Catalonia: Neural Networks vs. Time Series Models. Econ. Model 2014, 36, 220–228. [Google Scholar] [CrossRef] [Green Version]
Flores, J.J.; Graff, M.; Rodriguez, H. Evolutive Design of ARMA and ANN Models for Time Series Forecasting. Renew. Energ. 2012, 44, 225–230. [Google Scholar] [CrossRef]
Adhikari, R. A Neural Network Based Linear Ensemble Framework for Time Series Forecasting. Neurocomputing 2015, 157, 231–242. [Google Scholar] [CrossRef]
Kumar, D.A.; Murugan, S. Performance Analysis of Indian Stock Market Index Using Neural Network Time Series Model. In Proceedings of the International Conference on Pattern Recognition, Informatics and Mobile Engineering, Salem, India, 21–22 February 2013.
Yoon, H.; Hyun, Y.; Ha, K.; Lee, K.; Kim, G. A Method to Improve the Stability and Accuracy of ANN- and SVM-Based Time Series Models For Long-Term Groundwater Level Predictions. Comput. Geosci. 2016, 90, 144–155. [Google Scholar] [CrossRef]
Huang, G.B.; Chen, L.; Siew, C.K. Universal Approximation Using Incremental Constructive Feedforward Networks with Random Hidden Nodes. IEEE T. Neural Netw. 2006, 17, 879–892. [Google Scholar] [CrossRef] [PubMed]
Huang, G.B.; Zhu, Q.Y.; Siew, C.K. Extreme Learning Machine: A New Learning Scheme of Feedforward Neural Networks. In Proceedings of the International Joint Conference on Neural Networks, Budapest, Hungary, 25–29 July 2004.
Huang, G.B.; Zhu, Q.Y.; Siew, C.K. Extreme Learning Machine: Theory and Applications. Neurocomputing. 2006, 70, 489–501. [Google Scholar] [CrossRef]
Hosseinioun, N. Forecasting Outlier Occurrence in Stock Market Time Series Based on Wavelet Transform and Adaptive ELM Algorithm. J. Math. Financ. 2016, 6, 127–133. [Google Scholar] [CrossRef]
Dash, R.; Dash, P.K.; Bisoi, R. A Self Adaptive Differential Harmony Search Based Optimized Extreme Learning Machine for Financial Time Series Prediction. Swarm Evol. Comput. 2014, 19, 25–42. [Google Scholar] [CrossRef]
Liang, N.Y.; Huang, G.B.; Saratchandran, P.; Sundararajan, N. A Fast and Accurate Online Sequential Learning Algorithm for Feedforward Networks. IEEE T. Neural Netw. 2006, 17, 1411–1423. [Google Scholar] [CrossRef] [PubMed]
Huang, G.B.; Zhou, H.M.; Ding, X.J.; Zhang, R. Extreme Learning Machine for Regression and Multiclass Classification. IEEE Trans. Syst. Man Cybern. B 2012, 42, 513–529. [Google Scholar] [CrossRef] [PubMed]
Huang, G.; Ding, X.; Zhou, H. Optimization Method Based Extreme Learning Machine for Classification. Neurocomputing. 2010, 74, 155–163. [Google Scholar] [CrossRef]
Deng, C.Y. A Generalization of the Sherman-Morrison-Woodbury Formula. Appl. Math. Lett. 2011, 24, 1561–1564. [Google Scholar] [CrossRef]
Guo, W.; Xu, T.; Tang, K. M-Estimator-Based Online Sequential Extreme Learning Machine For Predicting Chaotic Time Series with Outliers. Neural Comput. Appl. 2016, 2016, 1–18. [Google Scholar] [CrossRef]
EI-Sayed, A.M.; Salman, S.M.; Elabd, N.A. On a Fractional-Order Delay Mackey-Glass Equation. Adv. Differ. Equ. 2016, 2016, 1–11. [Google Scholar]
Berezowski, M.; Grabski, A. Chaotic and Non-chaotic Mixed Oscillations in a Logistic Systems with Delay. Chaos Solitons Fractals 2002, 14, 1–6. [Google Scholar] [CrossRef]
Balasundaram, S.; Kapil, D.G. Lagrangian Support Vector Regression via Unconstrained Convex Minimization. Neural Net. 2014, 51, 67–79. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Block diagram of the adaptive weight online sequential extreme learning machine (AWOS-ELM) algorithm.

Figure 2. Flexible robot arm attributes. (a) System input; (b) system output.

Figure 3. Relationship between testing RMSE and hidden nodes number. (a) Sigmoid node; (b) RBF node.

Figure 4. Comparison of the prediction ability of the OS-ELM algorithm and the AWOS-ELM algorithm. (a) Sigmoid function; (b) RBF function.

Table 1. Performance comparison between AWOS-ELM and OS-ELM on benchmark data sets. RMSE, rooted mean squared errors.

**Table 1.** Performance comparison between AWOS-ELM and OS-ELM on benchmark data sets. RMSE, rooted mean squared errors.
Data Sets	Hidden Node Type	Algorithms	RMSE	$σ$	Hidden Nodes Number	$τ_{r}$	Chunk Size	Training Time/s
Logistic	Sigmoid function	OS-ELM	0.0164	0.0046	50	-	10	0.0924
	Sigmoid function	AWOS-ELM	0.0041	0.0010	50	0.09	10	0.0914
	Radial basis function	OS-ELM	0.0099	0.0023	50	-	10	0.1919
	Radial basis function	AWOS-ELM	0.0039	0.0007	50	0.09	10	0.1913
Mackey-Glass	Sigmoid function	OS-ELM	0.0370	0.0021	25	-	5	0.1638
	Sigmoid function	AWOS-ELM	0.0234	0.0026	25	0.1	5	0.1675
	Radial basis function	OS-ELM	0.0362	0.0028	25	-	5	0.2764
	Radial basis function	AWOS-ELM	0.0199	0.0024	25	0.1	5	0.2845
Sunspot	Sigmoid function	OS-ELM	0.0859	0.0006	40	-	10	0.0612
	Sigmoid function	AWOS-ELM	0.0833	0.0006	40	0.15	10	0.0596
	Radial basis function	OS-ELM	0.0844	0.0005	40	-	10	0.1251
	Radial basis function	AWOS-ELM	0.0831	0.0004	40	0.15	10	0.1236
Pseudo periodic synthetic series	Sigmoid function	OS-ELM	0.0342	0.0007	20	-	15	0.1760
	Sigmoid function	AWOS-ELM	0.0094	0.0004	20	0.1	15	0.1847
	Radial basis function	OS-ELM	0.0365	0.0016	20	-	15	0.2577
	Radial basis function	AWOS-ELM	0.0219	0.0025	20	0.1	15	0.2671
Milk production	Sigmoid function	OS-ELM	0.0561	0.0096	15	-	1	0.0062
	Sigmoid function	AWOS-ELM	0.0394	0.0110	15	0.12	1	0.0056
	Radial basis function	OS-ELM	0.0638	0.0101	15	-	1	0.0125
	Radial basis function	AWOS-ELM	0.0506	0.0097	15	0.12	1	0.0103
Electricity production	Sigmoid function	OS-ELM	0.0301	0.0053	12	-	1	0.0168
	Sigmoid function	AWOS-ELM	0.0265	0.0034	12	0.08	1	0.0165
	Radial basis function	OS-ELM	0.0566	0.0233	12	-	1	0.0315
	Radial basis function	AWOS-ELM	0.0406	0.0124	12	0.08	1	0.0303

Table 2. Mean and standard deviation of the adaptive weights for AWOS-ELM algorithm.

**Table 2.** Mean and standard deviation of the adaptive weights for AWOS-ELM algorithm.
Data Sets	Sigmoid Hidden Node		Radial Basis Function Hidden Node
Data Sets	Mean	Standard Deviation	Mean	Standard Deviation
Logistic	0.7366	0.4403	0.7855	0.4080
Mackey-Glass	0.8750	0.3311	0.8748	0.3313
Sunspot	0.6440	0.4729	0.6447	0.4726
Pseudo periodic synthetic series	0.8726	0.3338	0.8592	0.3476
Milk production	0.8615	0.3435	0.9009	0.2959
Electricity production	0.8857	0.3152	0.9176	0.2741

Table 3. Performance comparison between AWOS-ELM and OS-ELM on the robot arm example.

**Table 3.** Performance comparison between AWOS-ELM and OS-ELM on the robot arm example.
Hidden Node Type	Algorithms	RMSE	$σ$	Hidden Nodes Number	$τ_{r}$	Chunk Size	Training Time/s
Sigmoid function	OS-ELM	0.0207	0.0004	45	-	5	0.1401
Sigmoid function	AWOS-ELM	0.0081	0.0004	45	0.04	5	0.1426
Radial basis function	OS-ELM	0.0198	0.0006	45	-	5	0.3738
Radial basis function	AWOS-ELM	0.0084	0.0007	45	0.04	5	0.4115

© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license ( http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lu, J.; Huang, J.; Lu, F. Time Series Prediction Based on Adaptive Weight Online Sequential Extreme Learning Machine. Appl. Sci. 2017, 7, 217. https://doi.org/10.3390/app7030217

AMA Style

Lu J, Huang J, Lu F. Time Series Prediction Based on Adaptive Weight Online Sequential Extreme Learning Machine. Applied Sciences. 2017; 7(3):217. https://doi.org/10.3390/app7030217

Chicago/Turabian Style

Lu, Junjie, Jinquan Huang, and Feng Lu. 2017. "Time Series Prediction Based on Adaptive Weight Online Sequential Extreme Learning Machine" Applied Sciences 7, no. 3: 217. https://doi.org/10.3390/app7030217

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Time Series Prediction Based on Adaptive Weight Online Sequential Extreme Learning Machine

Abstract

1. Introduction

2. Preliminaries

2.1. Optimization Extreme Learning Machine

2.2. Online Sequential Extreme Learning Machine

3. The Proposed Adaptive Weight Online Sequential Extreme Learning Machine

3.1. Integrated Structure

3.2. Formula Derivation

4. Experiments

4.1. Benchmark Data Sets

4.2. Robot Arm Example

5. Conclusions

Acknowledgments

Author Contributions

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI