*3.3. Computational Complexity*

Compared to training on non-enriched data, SFC includes an additional step for model creation. We raise a question of computational complexity of the first SFC step compared to the overall complexity of the network training.

The total number of parameters in a LSTM layer can be calculated as follows [67]:

$$\mathcal{W} = n\_{\mathcal{C}}^2 \times 4 + n\_{\mathcal{i}} \times n\_{\mathcal{C}} \times 4 + n\_{\mathcal{C}} \times n\_{\mathcal{o}} + n\_{\mathcal{o}} \times \mathfrak{Z}\_{\mathcal{A}}$$

where *nc* is the number of memory cells, *ni* is the number of input units, and *no* is the number of output units. The computational complexity of training the LSTM model per weight and per time step with used optimizers is *O*(1). This gives us the computational complexity of *O*(*W*) per time step.

Given the window length *N* and relatively small prediction size, the computational complexity is dominated by the *nc* × (*nc* + *N*) factor. Finally, given the total time series length of *L*, the number of windows scales linearly with it. Assuming we have a constraint on maximum number of epochs, we may postulate that the computational complexity of training an LSTM model would be *O*(*L* × *nc* × (*nc* + *N*)). Calculations are similar for GRU and RNN layers.

At the same time, the computational complexity of the MSM algorithm, see Algorithm A1, on one window of length *N* is *O*(*K* × *N*), where *K* is the number of components. The main computational complexity lies in the updating of auxiliary matrix *g* of the algorithm. It gives us the complexity of *O*(*L* × *K* × *N*) for the MSM analysis of the whole time series. This is comparable to the complexity of neural network training. The MSM algorithm can be tailored for operations with matrices leading to a performance improvement on GPU-assisted systems [68].

These results are confirmed by the practical application of SFC in GPU-assisted computing. MSM model construction required significantly less time than model training: the difference reached a factor of ten or even more. Additionally, SFC statistical models on different windows are independent from each other. It means that already computed models could be cached and would not be changed with the addition of new observations to series. This allows for application of SFC to a real-time tasks with continuous data flows.

## **4. Examples of Real Data Analysis**

#### *4.1. Test Data and Neural Networks' Configurations*

Analyzed datasets consist of two distinct sets. The first set contains data obtained in physical experiments carried on the L-2M stellarator [31]. Time series consists of plasma density fluctuations after the medium had been agitated with an energy discharge. A total of five time series would be analyzed. Each series consists of 60,000 observations that correspond to a time interval from 48 to 60 ms of each experiment. The time gap between two consecutive observations is 0.2 microsecond (μs). Time series from this set had proven to be non-stationary, and the *p*-value of the Dickey–Fuller test [69] obtains up to 0.56. For model correctness, it is necessary to analyze not the entire series, but windows, subsamples for which the necessary assumptions are considered to be satisfied, that is, to use the MSM approach. The typical waveform as well as empirical distributions are presented in Figure 3.

**Figure 3.** Physical time series A19692 (on the **left**) a corresponding histogram (on the **right**).

The experiment consists of three stages: the initiation stage when the impulse agitates the plasma, the main phase, and the relaxation phase. It is worth noting that the distribution of time series has a strongly non-Gaussian form. It can be seen that an excess of kurtosis and asymmetry exists. It would require complicated models to describe such data. The second dataset consists of air–sea fluxes [70], see Figure 4.

**Figure 4.** Tropical-1 time series (on the **left**) a corresponding histogram (on the **right**).

For each spot, two separate time series were collected for latent and hidden fluxes. Each time series consists of approximately 14, 600 observations, and the time gap between two consecutive observations is six hours. Tropical-1 time series and its distribution are shown in Figure 4. These data are highly seasonal in nature, and the distribution is non-Gaussian.

In order to measure the effect of statistical enrichment on the accuracy, two different predictions would be made for each set. Short-term prediction outputs *M* = 12 (see Figure 1) consecutive values given the 200 previous values. For oceanographic data, shortterm prediction would be a prediction of data for three days after 50 days of observations.

Medium-term prediction outputs *M* = 12 consecutive values given the 200 previous values with a skip of 28 observations. Taking the oceanographic data as an example, medium-term prediction would be a prediction of three days on the next week after 50 days of observations.

For the purpose of this research, the size of window *N* = 200 (see Figure 1) was chosen to be the same as the size of input data for short- and medium-term predictions. This allows for a proper comparison of enriched and non-enriched accuracy values as no additional data are supplied to the enrichment process if compared to non-enriched data. The number of components *K* = 3 was selected for all time series as outlined in Section 3. Based on constructed models, four moments were used as additional statistical features for neural networks.

All data are normalized to the range of [0, 1]. Error decrease is measured with the root mean squared error metrics over the normalized data forecasts:

$$RMSE = \sqrt{\frac{1}{n} \sum\_{i=1}^{n} \left(d\_i - f\_i\right)^2}.$$

Here, *n* denotes the number of data points, *di* is the predicted value of *i*-th data point, and *fi* is the true value of the *i*-th data point. Such approach allows for comparison of the relative error decrease among all analyzed sets of data despite their different physical nature.

In order to demonstrate the efficiency of SFC, two neural network sets were constructed for each time series and prediction type. The original set accepts initial observations as input data. For the enriched network set, time series are supplemented by hidden state initialization with statistical moments. In both cases, the output consists of either a shortor medium-term prediction, in total four sets for each time series.

It is known that random search may provide for better results, but, in order to make a proper comparison of accuracy increase, a grid-search method was used for hyperparameter optimization [30]. Each set contains neural networks with all possible hyperparameter combinations, in total about 700 networks in a set. For each time series, error value is compared between best neural networks in original and enriched short-term sets and between best neural networks in original and enriched medium-term sets.

Input data were divided into training, validation, and test data sets in 60%/30%/10% proportion. The customized MSM algorithm and estimation of finite normal mixture parameters were implemented in MATLAB programming language. Neural networks were created, trained, and evaluated with TensorFlow/Keras Python libraries. Every network was ran several times, and the RMSE value was averaged among all runs.

The choice of optimizer varied among observed data sets, but mostly learning speed and accuracy were better with the Adam optimizer. No strong overfitting was observed in all constructed neural networks. A non-zero dropout rate affected learning rate negatively. In all observed cases, the choice of LSTM recurrent layers provided for better results than the use of GRU/RNN layers.
