3.2.2. Feature Learning

The feature extractor architecture is composed of the LSTM module and a fully convolutional module. The goal of this phase is to learn effective time series features in a parallel manner through multiple pairs of recurrent layers and convolutional layers in advance.

1. LSTM module: This module contains an LSTM layer, followed by a dropout layer. We employ an LSTM feature extractor to capture temporal patterns of CCS time series with multiscale and multifrequency dimensions. Specifically, the mold level fluctuation input *X* = [*<sup>x</sup>*1, *x*2, ..., *xT*] and the hidden state *Ht*−<sup>1</sup> of the previous time step given for the time step *t*. The definition of input gate *it*, forget gate *ft*, and output gate *ot* is as follows. The input gate controls the extent to which a new value flows into the cell.

$$i\_t = \sigma(X\_t \mathbb{W}\_{xi} + H\_{t-1} \mathbb{W}\_{hi} + b\_i) \tag{3}$$

The forget gate decides what information should be dropped.

$$f\_t = \sigma(X\_t \mathbb{W}\_{xf} + H\_{t-1} \mathbb{W}\_{hf} + b\_f) \tag{4}$$

The output gate determines which parts are useful.

$$
\rho\_t = \sigma(X\_t \mathbb{W}\_{\text{xo}} + H\_{t-1} \mathbb{W}\_{\text{ho}} + b\_0) \tag{5}
$$

The candidate memory cells *C* ˜ *t* at time step *t* are calculated as

$$\tilde{\mathcal{L}}\_t = \tanh(X\_t \mathcal{W}\_{\text{xc}} + H\_{t-1} \mathcal{W}\_{\text{hc}} + b\_{\text{c}}) \tag{6}$$

The calculation of the current time step memory cell *Ct* combines the information of the last time step memory cell and the current time step candidate memory cell, and controls the flow of information through the forgetting gate and the input gate.

$$\mathbf{C}\_{t} = f\_{t} \odot \mathbf{C}\_{t-1} + i\_{t} \odot \tilde{\mathbf{C}}\_{t} \tag{7}$$

The output gate controls the flow of information from memory cells to the hidden state *Ht*, which can be calculated as:

$$H\_t = o\_t \odot \tanh(\mathbb{C}\_t) \tag{8}$$

We feed the raw or transformed mold level fluctuation to LSTM and ge<sup>t</sup> output vector *Ov* = [*<sup>H</sup>*1, *H*2, ..., *HT*] from the last layer of the LSTM. We use output at time step *t* as feature *OTv* = *HT* extracted by LSTM. To prevent overfitting, the output of the LSTM layer is followed by the dropout layer with a dropout rate of 0.8 as shown in Figure 2. With dropout, final feature vector *Fv* can denote as:

$$F\_{\upsilon} = \mathbf{r} \ast \mathbf{O}\_{\upsilon}^{T} \tag{9}$$

$$r\_i \sim Bernoulli(p) \tag{10}$$

Here, ∗ denotes an element-wise product. For output vector at time step *t*, **r** is a vector of independent *Bernoulli* random variables, each of which has probability *p* of being 1.

	- • Convolutional layer with a filter size of 128 or 256, the kernel with a size of 8, 5, 3 and stride of 1.
	- • Batch normalization layer with a momentum of 0.99 and epsilon of 0.001.

• A ReLU activation at the end of the module.

In this module, we utilize convolution kernel *w* ∈ R*m* to slide over the input sequence and extract local features. The output *ci*of the *i*-node in the feature map is defined by

$$c\_i = \sigma(w^T \* \varkappa\_{i:i+m-1} + b) \tag{11}$$

where *xi*:*i*+*m*−1 represents *m*-length subsequence from the *i*th time step to the (*i* + *m* − 1)th time step of input sequence, ∗ denotes the convolution operator, *b* denotes the bias term, and *<sup>σ</sup>*(.) is a nonlinear activation function.

Accordingly, the convolution kernel is slid from the beginning time step to the end and we ge<sup>t</sup> the feature map of the *j*th kernel as

$$\mathbf{c}\_{j} = [\mathbf{c}\_{1}, \mathbf{c}\_{2}, \dots, \mathbf{c}\_{T-m+1}] \tag{12}$$

After convolution, batch normalization followed by a ReLU activation function accelerates fast training speed and improves model generalization ability. The fully convolutional module contains three convolutional blocks which are used as a feature extractor. Then, it performs a one-dimensional global average pooling operation on the feature map of the last block to obtain the vector, which reduces feature dimensions while increasing the receptive field of the kernel. The vector obtained by global average pooling on the final output channel can be expressed as

$$F\_{\mathcal{C}} = [a\_1, a\_2, \dots, a\_k] \tag{13}$$

$$a\_j = \frac{1}{T - m + 1} \sum \mathbf{c}\_j \tag{14}$$

where *k* represents the filter size of the last convolutional block. We concatenate the features extracted by LSTM with a fully convolutional module. As mentioned in the previous section, the original input is transformed at different time scales and frequencies, so we use feature extractors on different input expressions and feed the final features into the next stage as input.
