*3.2. Deep Belief Network Based TSA Learning-Aided Model*

According to the data structure, the deep belief network (DBN) is an advisable alternative for our goal. DBN is a probability generation model that stacks multiple restricted Boltzmann machines (RBMs) and a fully connected layer. RBM is an unsupervised network composed of a visible and hidden layer, and it can probabilistically reconstruct input features by two-way connections between the two layers.

As an energy-based model, the energy function of RBM is calculated by [25]:

$$E(\mathbf{v}, \mathbf{h}) = -\mathbf{v}^T \mathbf{w} \mathbf{h} - \mathbf{a}^T \mathbf{v} - \mathbf{b}^T \mathbf{h}, \mathbf{w} \in \mathbb{R}^{nh \times n\upsilon}, \mathbf{a} \in \mathbb{R}^{n\upsilon}, \mathbf{b} \in \mathbb{R}^{n\mathbf{h}} \tag{10}$$

where *v*,*h* are the visible and hidden layer matrices; *a*,*b* are the bias matrices of *v*,*h* respectively; and *w* is the weight matrix between two layers. The joint probability distribution *P*(*v*,*h*) of *v* and *h* is formulated by:

$$P(\mathbf{v}, \mathbf{h}) = Z^{-1} \ e^{-E(\mathbf{v}, \mathbf{h})},\\ Z = \sum\_{\mathbf{v}} \mathbf{v}\_{,\mathbf{h}} e^{-E(\mathbf{v}, \mathbf{h})} \tag{11}$$

where *Z* is the normalization factor that ensures the sum of the probability distribution is 1. The marginal probability of *v* and *h*, which are also called the likelihood functions, can be formulated as:

$$P(\mathbf{v}) = Z^{-1} \sum\_{\mathbf{h}} \mathbf{e}^{-E(\mathbf{v}, \mathbf{h})},\\P(\mathbf{h}) = Z^{-1} \sum\_{\mathbf{v}} \mathbf{e}^{-E(\mathbf{v}, \mathbf{h})} \tag{12}$$

Due to the lack of intra-layer connections in RBM, the activations of units in the visible and hidden layers are independent. Therefore, when the visible layer (or hidden layer) units state is given, we can deduce the formulation of the conditional probability that an individual unit of the hidden layer (or visible layer) is activated as:

$$\begin{array}{l} P(h\_i = 1 \mid \mathbf{\upsilon}) = M(b\_i + \sum\_i \mathbf{\upsilon}\_{ij} \cdot \mathbf{\upsilon}\_j), \ l\_i \in \mathfrak{h}, \ \mathbf{\upsilon}\_j \in \mathfrak{v}, \ \mathbf{w}\_{ij} \in \mathfrak{w} \\\ P(\upsilon\_i = 1 \mid h) = M(a\_j + \sum\_j \mathbf{\upsilon}\_{ij} \cdot h\_i), \ a\_j \in \mathfrak{a} \in \mathbb{R}^{n\upsilon}, \ b\_i \in \mathfrak{b} \in \mathbb{R}^{n\hbar} \end{array} \tag{13}$$

where *M*(·) is the activation function, and in the paper, it is the *Sigmoid* function. Then, the conditional probability of *h* (or *v*) given *v* (or *h*) can be obtained:

$$P(\mathbf{h} \mid \boldsymbol{\varpi}) = \prod^{n\mathbf{h}} i\_{\parallel = 1}^{n\mathbf{h}} P(h\_{\parallel} \mid \boldsymbol{\varpi}), \\ P(\boldsymbol{\varpi} \mid \boldsymbol{h}) = \prod^{n\boldsymbol{\varpi}} j\_{\perp = 1}^{n\mathbf{v}} P(\boldsymbol{v}\_{\parallel} \mid \boldsymbol{h}), \\ h\_{\parallel} \in \mathbf{h}, \; \boldsymbol{v}\_{\parallel} \in \boldsymbol{\varpi} \tag{14}$$

where *nh*,*nv* are the number of units in the hidden and visible layer, respectively. Training RBM is to maximize the following likelihood *L*:

$$\ln \text{ln}L = \ln \prod \text{\(\(\sigma\)}\_{\sigma \in Strain} P(\mathfrak{v})\_{\prime} \tag{15}$$

where *Strain* is the training sample set. The commonly used numerical method for maximizing (15) is gradient ascent, which iteratively updates the parameters. Take *w* as an example, and the weight *wij* is updated via Equations (16) and (17):

$$\mathfrak{a}w\_{i\bar{j}} = w\_{i\bar{j}} + \eta \cdot (\partial \text{ltr}(P(\nu))) / (\partial w\_{i\bar{j}}), w\_{i\bar{j}} \in \mathfrak{a}\mathfrak{v} \tag{16}$$

$$(\partial \ln(P(\mathbf{\bar{v}})))/(\partial \mathbf{\bar{v}}\_{\bar{i}\bar{j}}) = P(h\_i = 1 \mid \mathbf{\bar{v}}) \mathbf{\bar{v}}\_{\bar{j}} - \sum\_{\mathcal{V}} P(\mathbf{\bar{v}}) P(h\_i = 1 \mid \mathbf{\bar{v}}) \mathbf{\bar{v}}\_{\bar{j}}, h\_i \in \mathbf{h}, \mathbf{v}\_{\bar{j}} \in \mathbf{v} \tag{17}$$

where *η* is the learning rate.

DBN training consists of two parts: pre-training and fine-tuning. In the pre-training part, any two connected layers except the fully connected layer can be regarded as an RBM. These RBMs are trained to obtain better initial weights and to alleviate the gradient disappearance problem. In the fine-tuning part, the trained RBMs are connected with the fully connected layer. The sample sets [*X*, *Y*] and the global learning algorithm are then used for supervised fine-tuning of the DBN, learning the mapping between input data and labels. Thence, the mathematical model of an l-layer DBN can be simplified by Equation (18):

$$\Psi(\mathbf{X}) = O(M(D\_{l-1}(\dots \, \, \, \_{\mathbf{M}}(D\_1(\mathbf{X})) \dots \dots \, \_{\mathbf{M}})),\tag{18}$$

$$D\_i(\mathbf{x}\_i) = \mathbf{w}\_i \mathbf{x}\_i + \mathbf{b}\_i, \; w\_i \in \mathbb{R}^{m^i \times n(i-1)}, \; i = 1, \dots, l-1 \tag{19}$$

$$\begin{array}{c} \textsf{aw}\_{i} = [\textsf{aw}\_{i}^{1}, \dots, \textsf{aw}\_{i}^{mi}], \textsf{aw}\_{i}^{mi} \in \mathbb{R}^{1 \times n(i-1)}\\ \textsf{b}\_{i} = [\textsf{b}\_{i}^{1}, \dots, \textsf{b}\_{i}^{mi}] \end{array} \tag{20}$$

where *<sup>O</sup>*(·) is the output function of the fully connected layer, and *<sup>O</sup>*(*x*) = *Dl*(*xl*). The loss function can be defined as the weighted sum of the estimated error and *L*<sup>2</sup> norm, i.e.:

$$\text{Min } \alpha \| \| Y - \Psi(\mathbf{X}) \|\| \mathbf{2}^2 + (1 - \alpha) \sum\_{i=1}^n i \mathbf{n}\_i \|\| w\_i \|\| \mathbf{2}^2 \tag{21}$$

Equation (21) can be solved by training and fine-tuning the DBN model [24]. After training, the learning-aided model is reformed as Equation (22):

$$
\Gamma\_{\mathcal{E}} = \Psi\_{\mathcal{E}}(\mathbf{X}) / \mathcal{L} \in \mathcal{S}\_{\mathcal{E}} \tag{22}
$$
