**2. Finite Normal Mixtures and the MSM Method**

The success of approximating distributions for heterogeneous data using arbitrary mixtures of normal distributions is based on the results for generalized Cox processes [1] and essentially uses the finiteness of variance of process increments. The main task in this area is related to the statistical estimation of mixing distribution random parameters.

It is well known that arbitrary continuous normal mixtures are not identifiable, while, for finite normal mixtures, identifiability holds [42,43]. Therefore, the original ill-posed problem of parameter estimation can be replaced with the solution closest to the true one in the space of finite normal mixtures. Such solution exists and is unique due to the aforementioned identifiability property.

However, the heterogeneity of data arising from the reasons mentioned at the beginning of Section 1 leads to the absence of a universal mixing distribution for a significantly long timescale. Therefore, the initial time series is divided into possibly intersecting subsamples called windows. Then, we can solve the problem of mixing distribution parameter estimation for each of these intervals while moving them along the time axis in the series. This procedure is the essence of the method of moving separation of mixtures.

It can be seen that the mixture itself will evolve during the time-movement of subsamples. This in turn allows us to observe the dynamics of the statistical patterns evolution in the behavior of the studied process.

Created statistical models can serve as qualitative approximations for the distributions of various processes. We propose to use the first four moments of the corresponding distributions as additional features for machine learning algorithms.

Let us consider a subsample (*n*th window) **X** with size 1 × *N* and a cumulative distribution function (a finite normal mixture) of its elements:

$$F(\mathbf{x}, k(n), \mu\_n, \sigma\_{n'}, \mathbf{p}\_n) = \sum\_{i=1}^{k(n)} p\_i \Phi\left(\frac{\mathbf{x} - \mu\_i(n)}{\sigma\_i(n)}\right),\tag{1}$$

where *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>, <sup>Φ</sup>(*x*) = <sup>+</sup><sup>∞</sup> −∞ *e*−*x*2/2 *dx* and standard constraints on parameters

$$\mu\_i(n) \in \mathbb{R}, \quad \sigma\_i(n) \in \mathbb{R}, \quad \sigma\_i(n) > 0, \quad \sum\_{i=1}^{k(n)} p\_i(n) = 1, \quad p\_i(n) \gg 0,$$

hold for all *i* = 1, . . . , *k*(*n*).

Let a random value *Xn* have a cumulative distribution function (1). We will assume that it is an arbitrary element of the sample **X**. We can assign a set of values to each vector (E(*n*) *<sup>X</sup>* , <sup>D</sup>(*n*) *<sup>X</sup>* , *<sup>γ</sup>*(*n*) *<sup>x</sup>* , *<sup>κ</sup>* (*n*) *<sup>x</sup>* ). Those values are defined by the following formulas [44]:

• expectation:

$$\mathbb{E}\_X^{(n)} = \mathbb{E} X\_n = \sum\_{i=1}^{k(n)} p\_i(n)\mu\_i(n);\tag{2}$$

• variance:

$$\mathbb{D}\_X^{(n)} = \mathbb{D}X\_n = \sum\_{i=1}^{k(n)} p\_i(n) \left( \mu\_i(n) - \sum\_{i=1}^{k(n)} p\_i(n)\mu\_i(n) \right)^2 + \sum\_{i=1}^{k(n)} p\_i(n)\sigma\_i^2(n);\tag{3}$$

• skewness:

$$\gamma\_X^{(n)} = \frac{\mathbb{E}X\_n^3 - 3\mathbb{E}\_X^{(n)} \cdot \mathbb{D}\_X^{(n)} - \left(\mathbb{E}\_X^{(n)}\right)^3}{\left(\mathbb{D}\_X^{(n)}\right)^{3/2}} = $$

$$= \left[\sum\_{i=1}^{k(n)} p\_i(n) \left(\mu\_i^3(n) + 3\mu\_i(n)\sigma\_i^2(n)\right) - \left(\sum\_{i=1}^{k(n)} p\_i(n)\mu\_i(n)\right)\right. \\ \left. \times \left(3\sum\_{i=1}^{k(n)} p\_i(n) \left(\mu\_i(n) - \sum\_{i=1}^{k(n)} p\_i(n)\mu\_i(n)\right)\right)^2 + \\ \quad \times 3\sum\_{i=1}^{k(n)} p\_i(n)\sigma\_i^2(n) - \left(\sum\_{i=1}^{k(n)} p\_i(n)\mu\_i(n)\right)^2\right] \times \\ \times \left[\sum\_{i=1}^{k(n)} p\_i(n) \left(\mu\_i(n) - \sum\_{i=1}^{k(n)} p\_i(n)\mu\_i(n)\right)^2 + \sum\_{i=1}^{k(n)} p\_i(n)\sigma\_i^2(n)\right]^{-3/2}; \tag{4}$$

• kurtosis:

$$\kappa\_X^{(n)} = \frac{\mathbb{E}X\_n^4 - 4\mathbb{E}X\_n^{(n)} \cdot \mathbb{E}X\_n^3 + 6\left(\mathbb{E}\_X^{(n)}\right)^2 \cdot \mathbb{E}X\_n^2 - 3\left(\mathbb{E}\_X^{(n)}\right)^4}{\left(\mathbb{D}\_X^{(n)}\right)^2} - 3 = $$

$$= \left[\sum\_{i=1}^{k(n)} p\_i(n) \left(\mu\_i^4(n) + 6\mu\_i^2 \sigma\_i^2(n) + 3\sigma\_i^4(n)\right) - 3\left(\sum\_{i=1}^{k(n)} p\_i(n)\mu\_i(n)\right)^4 - \right.$$

$$\begin{split} -4\left(\sum\_{i=1}^{k(n)} p\_i(n)\mu\_i(n)\right)\left(\sum\_{i=1}^{k(n)} p\_i(n)\left(\mu\_i^3(n) + 3\mu\_i(n)\sigma\_i^2(n)\right)\right) + \\ + 6\left(\sum\_{i=1}^{k(n)} p\_i(n)\mu\_i(n)\right)^2 \left(\sum\_{i=1}^{k(n)} p\_i(n)\left(\mu\_i^2(n) + \sigma\_i^2(n)\right)\right) \Bigg\} \times \\ \times \left[\sum\_{i=1}^{k(n)} p\_i(n)\left(\mu\_i(n) - \sum\_{i=1}^{k(n)} p\_i(n)\mu\_i(n)\right)^2 + \sum\_{i=1}^{k(n)} p\_i(n)\sigma\_i^2(n)\right]^{-2} - 3. \end{split} \tag{5}$$

The argument *n* for each of these values (2)–(5) shows a strict dependence on the step number of the MSM method. That is, these moments are determined not for the entire time series, but only for a subsample of it. They are determined by observations that are separated from the first element **X**—according to its location in the analyzed series—by the value *L* of the moving window of the MSM method.

It is well known that, for the initial moments of a random variable *X* with a normal distribution with parameters *<sup>a</sup>* and *<sup>σ</sup>*<sup>2</sup> (that is, *<sup>X</sup>* <sup>∼</sup> *<sup>N</sup>*(*a*, *<sup>σ</sup>*2)), the following equations hold:

$$\mathbb{E}X^{m} = \begin{cases} a^2 + \sigma^2, & m = 2; \\ a^3 + 3a\sigma^2, & m = 3; \\ a^4 + 6a^2\sigma^2 + 3\sigma^4, & m = 4. \end{cases} \tag{6}$$

For the initial moments of a random variable *Xn* with a cumulative distribution function *F*(*x*, *k*(*n*), *fn*, *σn*, **p***n*) (1), we have (*m* = 1, 2, . . .):

$$\mathbb{E}X\_n^m = \sum\_{i=1}^{k(n)} \frac{p\_i(n)}{\sigma\_i(n)\sqrt{2\pi}} \int\_{-\infty}^{+\infty} z^m \exp\left\{-\frac{(z-\mu\_i(n))^2}{2\sigma\_i^2(n)}\right\} dz = \sum\_{i=1}^{k(n)} p\_i(n) \mathbb{E}X\_{[i]}^m,$$

where *<sup>X</sup>*[*i*] <sup>∼</sup> *<sup>N</sup>*(*μi*(*n*), *<sup>σ</sup>*<sup>2</sup> *<sup>i</sup>* (*n*)). Thus, the analogues of the expressions (6) are as follows:

$$\mathbb{E}X\_n^m = \begin{cases} \sum\_{i=1}^{k(n)} p\_i(n)\mu\_i(n), & m=1; \\ \sum\_{i=1}^{k(n)} p\_i(n) \left(\mu\_i^2(n) + \sigma\_i^2(n)\right), & m=2; \\ \sum\_{i=1}^{k(n)} p\_i(n) \left(\mu\_i^3(n) + 3\mu\_i \sigma\_i^2(n)\right), & m=3; \\ \sum\_{i=1}^{k(n)} p\_i(n) \left(\mu\_i^4(n) + 6\mu\_i^2 \sigma\_i^2(n) + 3\sigma\_i^4(n)\right), & m=4. \end{cases} \tag{7}$$

Substituting these expressions into formulas for variance (3), skewness (4) and kurtosis (5) lead to formulae that depend only on the distribution parameters, namely the values *pi*(*n*), *μi*(*n*) and *σi*(*n*).

Modern computing systems are optimized for performing matrix computations including parameter estimation problems that can be implemented with EM algorithms. Therefore, expressions (2)–(5) can be represented in an equivalent matrix form [30,45]:

• expectation:

$$\mathbb{E}X\_n = \mathbf{p}\_n \,\mu\_n^T;\tag{8}$$

• variance:

$$\mathbb{D}X\_n = \mathbf{p}\_n \left( D\_{\mathbf{a}\_n} \boldsymbol{\mu}\_n^T + D\_{\sigma\_n} \sigma\_n^T \right) - (\mathbf{p}\_n \boldsymbol{\mu}\_n^T)^2;\tag{9}$$

• skewness:

$$\gamma\_{\mathcal{X}\_{n}} = \frac{\mathbf{p}\_{n} D\_{\mathbf{a}\_{n}}^{2} \boldsymbol{\mu}\_{n}^{T} + 3 \,\mathbf{p}\_{n} D\_{\mathbf{a}\_{n}} D\_{\sigma\_{n}} \boldsymbol{\sigma}\_{n}^{T} + 2 \,(\mathbf{p}\_{n} \boldsymbol{\mu}\_{n}^{T})^{2}}{(\mathbf{p}\_{n} (D\_{\mathbf{a}\_{n}} \boldsymbol{\mu}\_{n}^{T} + D\_{\sigma\_{n}} \boldsymbol{\sigma}\_{n}^{T}) - (\mathbf{p}\_{n} \boldsymbol{\mu}\_{n}^{T})^{2})^{3/2}} - $$
 
$$ -3 \cdot \frac{\mathbf{p}\_{n} \boldsymbol{\mu}\_{n}^{T} \mathbf{p}\_{n} D\_{\mathbf{a}\_{n}} \boldsymbol{\mu}\_{n}^{T} + \mathbf{p}\_{n} \boldsymbol{\mu}\_{n}^{T} \mathbf{p}\_{n} D\_{\sigma\_{n}} \boldsymbol{\sigma}\_{n}^{T}}{(\mathbf{p}\_{n} (D\_{\mathbf{a}\_{n}} \boldsymbol{\mu}\_{n}^{T} + D\_{\sigma\_{n}} \boldsymbol{\sigma}\_{n}^{T}) - (\mathbf{p}\_{n} \boldsymbol{\mu}\_{n}^{T})^{2})^{3/2}}, \tag{10}$$

• kurtosis:

$$\begin{split} \kappa\_{\rm X\_{n}} &= \frac{\mathbf{p}\_{n}(D\_{\mathbf{a}\_{n}}^{3}\boldsymbol{\mu}\_{n}^{T} + 6\,\boldsymbol{D}\_{\sigma\_{n}}^{2}\,\mathrm{D}\_{\mathbf{a}\_{n}}\boldsymbol{\mu}\_{n}^{T} + 3\,\boldsymbol{D}\_{\sigma\_{n}}^{3}\boldsymbol{\sigma}\_{n}^{T})}{(\mathbf{p}\_{n}(D\_{\mathbf{a}\_{n}}\boldsymbol{\mu}\_{n}^{T} + \boldsymbol{D}\_{\sigma\_{n}}\boldsymbol{\sigma}\_{n}^{T}) - (\mathbf{p}\_{n}\,\boldsymbol{\mu}\_{n}^{T})^{2})^{2}} - \\ & - \frac{4\,\boldsymbol{E}X\_{n}\,\mathbf{p}\_{n}\,D\_{\mu\_{n}}\,(D\_{\mathbf{a}\_{n}}\boldsymbol{\mu}\_{n}^{T} + 3\,\boldsymbol{D}\_{\sigma\_{n}}\boldsymbol{\sigma}\_{n}^{T})}{(\mathbf{p}\_{n}(D\_{\mathbf{a}\_{n}}\boldsymbol{\mu}\_{n}^{T} + \boldsymbol{D}\_{\sigma\_{n}}\boldsymbol{\sigma}\_{n}^{T}) - (\mathbf{p}\_{n}\,\boldsymbol{\mu}\_{n}^{T})^{2})^{2}} + \\ & \\ & + \frac{6\,(EX\_{n})^{2}\,\mathbf{p}\_{n}(D\_{\mathbf{a}\_{n}}\boldsymbol{\mu}\_{n}^{T} + D\_{\sigma\_{n}}\boldsymbol{\sigma}\_{n}^{T}) - 3\,(EX\_{n})^{4}}{(\mathbf{p}\_{n}(D\_{\mathbf{a}\_{n}}\boldsymbol{\mu}\_{n}^{T} + D\_{\sigma\_{n}}\boldsymbol{\sigma}\_{n}^{T}) - (\mathbf{p}\_{n}\,\boldsymbol{\mu}\_{n}^{T})^{2})^{2}} - 3, \end{split} \tag{11}$$

where

$$\begin{aligned} \mathbf{p}\_n &= \left(p\_{1'}, \ldots, p\_{k(n)}\right)\_{\prime} \quad \mu\_n = \left(\mu\_{1'}, \ldots, \mu\_{k(n)}\right)\_{\prime} \quad \sigma\_n = \left(\sigma\_{1'}, \ldots, \sigma\_{k(n)}\right)\_{\prime}, \\ D\_{\mathbf{a}\_n} &= \operatorname{diag}\left\{\mu\_{1'}, \ldots, \mu\_{k(n)}\right\}\_{\prime} \quad D\_{\sigma\_n} = \operatorname{diag}\left\{\sigma\_{1'}, \ldots, \sigma\_{k(n)}\right\}\_{\prime} \end{aligned}$$

and *diag*{...} denotes diagonal matrices with corresponding elements.

To obtain relations (8)–(11), it is enough to use the matrix representation of expressions (7):

$$\mathbb{E}X\_n^m = \begin{cases} \mathbb{P}\_n \boldsymbol{\mu}\_{n'}^T & m=1; \\ \mathbb{P}\_n (D\_{\mu\_n} \cdot \mathbf{a}\_n^T + D\_{\sigma\_n} \cdot \boldsymbol{\sigma}\_n^T), & m=2; \\ \mathbb{P}\_n \cdot D\_{\mu\_n} (D\_{\mu\_n} \cdot \boldsymbol{\mu}\_n^T + 3 \cdot D\_{\sigma\_n} \cdot \boldsymbol{\sigma}\_n^T), & m=3; \\ \mathbb{P}\_n (D\_{\mu\_n}^3 \cdot \boldsymbol{\mu}\_n^T + 6 \cdot D\_{\sigma\_n}^2 \cdot D\_{\mu\_n} \cdot \boldsymbol{\mu}\_n^T + 3 \cdot D\_{\sigma\_n}^3 \cdot \boldsymbol{\sigma}\_n^T), & m=4. \end{cases}$$

To evaluate the parameters in expressions (2)–(5) and (8)–(11) for every position of moved window various modifications of the EM algorithm can be used [7]. For example, they may include grid modifications of the EM algorithm that were previously implemented by the authors in the form of a computing service [46]. In this article, we use modifications with a random selection of initial approximations [32].

#### **3. Methodology of Statistical Feature Construction**

#### *3.1. Approach for Feature Construction*

The Statistical Feature Construction method is a two step data enrichment algorithm. The first step of SFC is the creation of statistical models that is the estimation of the parameters of finite normal mixtures. It is worth noting that the time series of physical processes can often be non-stationary. Instead of creating one complex statistical model encompassing the whole time series, we implement the set of models. It consists of a sequence of models (1) that describes the evolution of the analyzed process.

Time series are split into shorter pseudo-stationary windows on which the models are constructed. The process of window separation is as follows. Initial data vector *V* = {*V*1, *V*2, ... , *VL*} of *L* observations serves as input data for the process. Let us choose some arbitrary window length *N* (*L* - *N* - 1) and divide *V* into shorter window vectors *<sup>X</sup>*1, *<sup>X</sup>*2, *<sup>X</sup>*3, ... where *Xi* = {*Vi*, *Vi*+1, ... *Vi*+*N*−1} are sequences of *<sup>N</sup>* consecutive observations taken from *V*. We may notice that window vector *Xi* differs from window vector *Xi*+<sup>1</sup> only by two observations, namely the first observation in *Xi* and the last observation in *Xi*+1.

Once the collection of window vectors is obtained, new difference window vectors *<sup>Y</sup>*1,*Y*2,*Y*3, ... may be constructed, *<sup>Y</sup><sup>j</sup> <sup>i</sup>* <sup>=</sup> *<sup>X</sup>j*+<sup>1</sup> *<sup>i</sup>* <sup>−</sup> *<sup>X</sup><sup>j</sup> i* . Applying the same transformation to all window vectors, a collection of difference window vectors is built. Each vector has a length of (*N* − 1). Difference window vectors serve as input data for the MSM algorithm.

After window vector and difference vector sets are created, they can be used to estimate statistical parameters for data enrichment. Such process in the application to neural network forecasting was previously described and explored in [29].

Hyperparameters on the first step of SFC are the following:


Exact choice of window length is open to debate. Window lengths that are too big lead to loss of stationarity across the window vector. Additionally, larger windows may contain observations that have little to no effect on the prediction introducing additional noise to the model. Smaller windows lead to lack of input data for both the machine learning part of the algorithm and to the construction of statistical models. A *K*-component mixture requires the evaluation of 3*K* − 1 statistical parameters which can be hard to perform accurately on smaller windows.

Choice of the component number is also open to debate. Both empirical and classical statistical approaches based on information criteria (AIC [47], BIC [48]) can be used. For physical and oceanographic data, we analyzed cases of mixtures consisting of 3–5 components at each step.

The second step of SFC is the feature expansion; given a statistical model, its characteristics can be used for feature enrichment. As outlined in the previous section, the first four moments are used as additional features for the data enrichment process. The implementation of this approach will be discussed in the next section. We should notice that these moments do not contain information about how the series behaves after the last window observation, and therefore can be correctly used when making forecasts.

Algorithmic representation of SFC is presented in Appendix A, see Algorithms A1–A3. It can be implemented in computing services [49,50].
