**Segmentation of High Dimensional Time-Series Data Using Mixture of Sparse Principal Component Regression Model with Information Complexity**

#### **Yaojin Sun and Hamparsum Bozdogan \***

Department of Business Analytics and Statistics, University of Tennessee, Knoxville, TN 37996, USA; ysun52@vols.utk.edu

**\*** Correspondence: bozdogan@utk.edu

Received: 29 August 2020; Accepted: 13 October 2020; Published: 17 October 2020

**Abstract:** This paper presents a new and novel hybrid modeling method for the segmentation of high dimensional time-series data using the mixture of the sparse principal components regression (*MIX-SPCR*) model with information complexity (ICOMP) criterion as the fitness function. Our approach encompasses dimension reduction in high dimensional time-series data and, at the same time, determines the number of component clusters (i.e., number of segments across time-series data) and selects the best subset of predictors. A large-scale Monte Carlo simulation is performed to show the capability of the *MIX-SPCR* model to identify the correct structure of the time-series data successfully. *MIX-SPCR* model is also applied to a high dimensional Standard & Poor's 500 (S&P 500) index data to uncover the time-series's hidden structure and identify the structure change points. The approach presented in this paper determines both the relationships among the predictor variables and how various predictor variables contribute to the explanatory power of the response variable through the sparsity settings cluster wise.

**Keywords:** high dimensional time-series; segmentation; mixture regression; sparse PCA; entropy-based robust EM; information complexity criteria

#### **1. Introduction**

This paper presents a new and novel method for the segmentation and dimension reduction in high dimensional time-series data. We develop hybrid modeling between *mixture-model cluster analysis* and *sparse principal components regression* (MIX-SPCR) model as an expert unsupervised classification methodology with *information complexity* (ICOMP) criterion as the fitness function. This new approach performs dimension reduction in high dimensional time-series data and, at the same time, determines the number of component clusters.

The research of time-series segmentation and change point positioning has been a hot topic of research for a long time. Different research groups have provided solutions with various approaches in this area, including, but not limited to, Bayesian methods Barber et al. [1], fuzzy systems Abonyi and Feil [2], and complex system modeling Spagnolo and Valenti [3], Valenti et al. [4], S Lima [5], Ding et al. [6]. We group these approaches into two branches, one based on complex systems modeling and the other on the statistical model through parameter estimation and inference. Among the complex systems-based modeling approaches, it is worth noting a series of papers that use the stochastic volatility model by Spagnolo and Valenti [3]. For example, these authors used a nonlinear Hestone model to analyze 1071 stocks on the New York Stock Exchange (1987–1998). After accounting for the stochastic

nature of volatility, the model is well suited to extracting the escape time distribution from financial time-series data. The authors also identified the NES (Noise Enhanced Stability) effect to measure market dynamics' stabilizing effect. The approach we propose in this paper belongs to another branch of using a statistical model on time scales. Along with the empirical analysis, we show a broader view of how different companies/sectors behaved across different periods. In particular, we use a mixture-model based statistical methodology to segment the time-series and determine change points.

The mixture-model cluster analysis of regression models is not new. These models are also known as *"cluster-wise regression"*, *"latent models"*, and *"latent structure models of choice"*. These models have been well-studied among statisticians, machine learning researchers, and econometricians in the last several decades to construct time-series segmentation models and identify change points. They have many useful theoretical and applied properties. Mixture-model cluster analysis of regression models is a natural extension of the standard multivariate Gaussian mixture-model cluster analysis. These models are beneficial to study heterogeneous data sets that involve not just one response variable but can have several responses or target-dependent variables simultaneously with a given set of independent variables. Recently, they have been proven to be a precious class of models in various disciplines in *behavioral and economic research*, *ecology*, *financial engineering*, *process control*, *and monitoring*, *market research*, *transportation systems*. Additionally, we also witness the mixture model's usage in the *analysis of scanner panel*, *survey, and other choice data to study consumer choice behavior and dynamics* Dillon et al. [7].

In reviewing the literature, we note that Quandt and Ramsey [8] and Kiefer [9] studied data sets by applying a mixture of two regression models using moment generating function techniques to estimate the unknown model parameters. Later, De Veaux [10] developed an EM algorithm to fit a mixture of two regression models. DeSarbo and Cron [11] used similar estimating equations and extended the earlier work done on a mixture of two regression models to a mixture of K-component regression models. For an excellent review article on this problem, we refer the reviewers to Wedel and DeSarbo [12].

In terms of these models' applications in the segmentation of time-series, they can be seen in the early work of Sclove [13], where the author applied the mixture model to the segmentation of US gross national product, a high dimensional time-series data. Specifically, Sclove [13] used the statistical model selection criteria to choose the number of classes.

With the currently existing challenges in mind in the segmentation of time-series data, in this paper, our objective and goal are to develop a new methodology which can:


We aim to achieve these objectives by developing the information complexity (ICOMP) criteria as our fitness function throughout the paper for the segmentation of high-dimensional time-series data.

Our approach involves a two-stage procedure. We first make a variable selection by using SPCA with the benefit of sparsity. We then fit the sparse principal component regression (SPCR) model by transforming the original high dimensional data into several main principal components and estimating relationships between the sparse component loadings and the response variable. In this way, the mixture model not only handles the curse of dimensionality but also maintains the model's excessive explanatory power. In this manner, we choose the best subset of predictors and determine the number of time-series segments in the *MIX-SPCR* model simultaneously using ICOMP.

The rest of the paper is organized as follows. In Section 2, we present the model and methods. In particular, we first briefly explain sparse principal component analysis (SPCA) due to Zou et al. [14] in Section 2.1. In Section 2.2, we modify SPCA and develop mixtures of the sparse principal component regression (*MIX-SPCR*) model for the segmentation of time-series data. In Section 3, we present a regularized entropy-based Expectation and Maximization (EM) clustering algorithm. As is well known, the EM algorithm performs through maximizing the likelihood of the mixture models. However, to make the conventional EM algorithm robust (not sensitive to initial values) and converge to global optimum, we use the robust version of the EM algorithm for the *MIX-SPCR* model based on the work of Yang et al. [15]. These authors addressed the robustness issue by adding an entropy term of mixture proportions to the conventional EM algorithm's objective function. While our EM algorithm is in the same spirit of the Yang et al. [15] approach, there are significant differences between our approach and theirs. Yang's robust EM algorithm merely deals with the usual clustering problem without involving any response (or dependent) variable or time factor in the data. We extend it to the case of the *MIX-SPCR* model in the context of time-series data. In Section 4, we discuss various information criteria, specifically the information complexity based criteria (ICOMP). We derive the ICOMP for the *MIX-SPCR* model based on Bozdogan's previous research ([16–20]). In Section 5, we present our Monte Carlo simulation study. Section 5.2 involves an experiment on the detection of structural points, and Section 5.3 presents a large scale Monte Carlo simulation verifying the advantage of the *MIX-SPCR* with statistical information criteria. We provide a real data analysis in Section 6 using the daily adjusted closing S&P 500 index and stock prices from the Yahoo Finance database that spans the period from January 1999 to December 2019. Finally, our conclusion and discussion are presented in Section 7.

#### **2. Model and Methods**

In this section, we briefly present the *sparse principal component analysis* (*SPCA*), *sparse principal component regression* (*SPCR*) as a background. Then, by hybridizing these two methods within the mixture model, we propose the *mixture-model cluster analysis of sparse principal component regression* (abbreviated as *MIX-SPCR* model hereafter), for segmentation of high dimensional time-series datasets. Compared with a simple linear combination of all explanatory variables (i.e., the dense PCA model), the new approach interprets better because it maintains a sparsity specification.

Referring to Figure 1, we first show the overall structure of the model in this paper. The overall processing flow is that we clean and standardize the data after obtaining the time-series data. Subsequently, we specify the number of time-series segments and how many Sparse Principal Components (SPCs) each segment contains. Using the Robust EM algorithm (Section 3), we estimate the model parameters, especially the boundaries (also known as *change points*) of each time segment. The information criterion values are calculated using the method of Section 4. By testing different numbers of time segments/SPCs, we obtain multiple criterion values. According to the calculated information criterion values, we choose the most appropriate model with the estimated parameters.

**Figure 1.** The flowchart of the MIX-SPCR method.

#### *2.1. Sparse Principal Component Analysis (SPCA)*

Given the input data matrix, **X** with *n* number of observations and *p* variables, we decompose **X** using the singular value decomposition (SVD). We write the decomposition procedure as **X** = **UDV***<sup>T</sup>* , where **D** is a diagonal matrix of singular values and orthogonal columns **U** and **V** as the left and right singular vectors. When we perform SVD of a data matrix **X** that has been centered, by subtracting each column's mean, the process is the well-known *principal component analysis* (*PCA*). As discussed by Zou et al. [14], PCA has several advantages as compared with other dimensionality reduction techniques. For example, the PCA can sequentially identify the source of variability by considering the linear combination of all the variables. Because of the orthonormal constraint during the computation, all the calculated *principal components* (*PCs*) have clear geometrical interpretation corresponding to the original data space as a dimension reduction technique. Because PCA can deal with *"the curse of dimensionality"* of high-dimensional data sets, it has been widely used in real-world scenarios, including biomedical and financial applications.

Even though PCA has excellent properties that are desirable in real-world applications and statistical analysis, the interpretation of PCs is often difficult since it includes all the variables as linear combinations of all the original variables in each of the PCs. In practice, the principal components always have a large number of non-zero coefficient values for corresponding variables. To resolve this drawback, researchers proposed various improvements focusing on PCA's sparsity while maintaining the minimal loss of information. Shen and Huang [21] designed an algorithm to iteratively extract top PCs using the so-called *penalized least sum of square* (*PLSS*) criterion. Zou et al. [14] utilized the lasso penalty (via Elastic Net) to maintain a sparse loading of the principal components, which is named *sparse principal component analysis* (*SPCA*).

In this paper, we use the sparse principal component analysis (SPCA) proposed by Zou et al. [14]. Given the data matrix **X**, we minimize the objective function to obtain the SPCA results:

$$\mathbf{H}(\hat{\mathbf{A}},\hat{\mathbf{B}}) = \underset{\mathbf{A},\mathbf{B}}{\text{arg}\min} \sum\_{i=1}^{n} \left\| \mathbf{x}\_{i}^{T} - \mathbf{A}\mathbf{B}^{T}\mathbf{x}\_{i}^{T} \right\|^{2} + \sum\_{j=1}^{k} \lambda\_{1,j} \left\| \mathbf{B}\_{(j)} \right\|\_{1} + \lambda\_{2} \sum\_{j=1}^{k} \left\| \mathbf{B}\_{(j)} \right\|\_{2}^{2} \tag{1}$$

subject to

$$\mathbf{A}^T \mathbf{A} = I\_k. \tag{2}$$

where *I<sup>k</sup>* is the identity matrix. We maintain the hyperparameters *λ*1,*<sup>j</sup>* and *λ*<sup>2</sup> to be non-negative. The **A** and **B** matrices of size (*p* × *k*) are given by

$$\mathbf{B} = \begin{bmatrix} B\_{1,1} & \cdots & B\_{1,k} \\ \vdots & \ddots & \vdots \\ B\_{p,1} & \cdots & B\_{p,k} \end{bmatrix} = \begin{bmatrix} \mathbf{B}\_{(1)} \mid \dots \mid \mathbf{B}\_{(k)} \end{bmatrix} = \begin{bmatrix} \mathbf{B}\_1 \\ \vdots \\ \mathbf{B}\_p \end{bmatrix} \tag{3}$$

and

$$\mathbf{A} = \begin{bmatrix} A\_{1,1} & \cdots & A\_{1,k} \\ \vdots & \ddots & \vdots \\ A\_{p,1} & \cdots & A\_{p,k} \end{bmatrix} = \begin{bmatrix} \mathbf{A}\_{(1)} \mid \dots \mid \mathbf{A}\_{(k)} \end{bmatrix} = \begin{bmatrix} \mathbf{A}\_1 \\ \vdots \\ \mathbf{A}\_p \end{bmatrix}. \tag{4}$$

If we choose the first *k* principal components from the data matrix **X**, then the estimate **B**ˆ (*j*) contains the sparse loading vectors, which are no longer orthogonal.

A bigger *λ*1,*<sup>j</sup>* means a greater penalty for having non-zero entries in *B*ˆ (*j*) . By using different *λ*1,*<sup>j</sup>* , we control the number of zeros in the *j*th loading vector. If *λ*1,*<sup>j</sup>* = 0 for *j* = 1, 2, . . . , *k*, this problem reduces to usual PCA.

Zou et al. [14] proposed a generalized SPCA algorithm to solve the optimization problem in Equation (1). The algorithm applies the Elastic Net (EN) to estimate **B**(*j*) iteratively and update matrix **A**. However, this algorithm is not the only available approach for extracting principal components with sparse loadings. The SPCA could also be computed through dictionary learning by Mairal et al. [22]. By introducing the probability model of principal component analysis, SPCA is equivalent to the *sparse probabilistic principal component analysis* (*SPPCA*) if the prior is Laplacian distribution for each weight matrix element (Guan and Dy [23], Williams [24]). For further discussion on SPPCA, we refer readers to those related publications for more details.

Next, we introduce the *MIX-SPCR* model for the segmentation of time-series data.

#### *2.2. Mixtures of SPCR Model for Time-Series Data*

Suppose the continuous response variable is denoted as **y** = {*y<sup>i</sup>* |1 ≤ *i* ≤ *n*}, where *n* represents the number of observations (time points). Similarly, we have the predictors denoted as **X** = {**x***<sup>i</sup>* |1 ≤ *i* ≤ *n*}. Each observation **x***<sup>i</sup>* has *p* dimensions and is represented as **x***<sup>i</sup>* = [*x*1,*<sup>i</sup>* , *x*2,*<sup>i</sup>* , · · · , *xp*,*<sup>i</sup>* ] *T* . Both the response variable and independent variables are collected sequentially labeled by time points *T* = [*t*1, *t*2, · · · , *tn*].

The finite mixture model allows applying cluster analysis on conditionally dependent data into several classes. In the time-series data scenario, researchers cluster the data ((*t*1, **x**1, *y*1),(*t*2, **x**2, *y*2), · · · ,(*tn*, **x***n*, *yn*)) into several homogeneous groups where the number of groups *G* is unknown in general. Within each group, we apply the SPCA to extract top *k* principal components that each of them has a sparse loading of *p* variable coefficients. The extracted top *k* PCs are denoted as matrix **P***p*×*<sup>k</sup>* . We also use **P***<sup>g</sup>* to represent the principal component matrix obtained from the group indexed by *g* = 1, 2, . . . , *G*.

The SPCR model assumes that each pair (**x***<sup>i</sup>* , *yi*) is independently drawn from a cluster using both the SPCA and the regression model as follows.

$$y\_i = \mathbf{x}\_i^T \mathbf{P}\_{\mathcal{S}} \beta\_{\mathcal{S}} + \varepsilon\_{i, \mathcal{S}'} i = 1, 2, \cdots, n,\tag{5}$$

where *β<sup>g</sup>* = h *βg*,1, *βg*,2, · · · , *βg*,*<sup>k</sup>* i*T* .

For each group *g*, the random error is assumed to be Gaussian distributed. That is, *ǫi*,*<sup>g</sup>* ∼ N (0, *σ* 2 *g* ). If the response variable is multivariate, then the random error is usually also assumed to be a multivariate Gaussian distribution. Thus the probability density function (pdf) of the SPCR model is

$$f(y\_i|\mathbf{x}\_{i\prime}\mathbf{P}\_{\mathcal{S}'}\boldsymbol{\beta}\_{\mathcal{S}}) = \mathcal{N}\left(y\_i|\mathbf{x}\_i^T\mathbf{P}\_{\mathcal{S}}\boldsymbol{\beta}\_{\mathcal{S}'}\sigma\_{\mathcal{S}}^2\right). \tag{6}$$

We emphasize here that the noise (i.e., the error term) included in the statistical model is drawn from a normal distribution independent for each time-series segment, with different values of *σ* 2 *g* for each period. Since we use the EM algorithm to estimate the parameters of the model, the noise parameter *σ* 2 *g* can be estimated accurately as well. Future studies will consider introducing different noise distributions, such as *α*-stable Lévy noise [25], and other non-Gaussian noise distributions to further extend the current model.

We also consider time factor *t<sup>i</sup>* in the SPCR model of time-series data to be continuous. The pdf of the time factor is

$$f(t\_i | v\_{\mathcal{S'}} \sigma\_{\mathcal{S}}^{2, \text{time}}) = \mathcal{N}\left(t\_i | v\_{\mathcal{S'}} \sigma\_{\mathcal{S}}^{2, \text{time}}\right),\tag{7}$$

where *v<sup>g</sup>* is the mean, and *σ* 2,time *<sup>g</sup>* is the variance of the time segment *g*. Apart from the normal distribution, our approach can also be generalized to other distributions for the time factor, such as skewed distributions, Student's t-distribution, ARCH, GARCH time-series models, and so on.

As a result, if we use the *MIX-SPCR* model to perform segmentation of time-series data, the likelihood function of the whole data ((*t*1, **x**1, *y*1),(*t*2, **x**2, *y*2), · · · ,(*tn*, **x***n*, *yn*)) with *G* number of clusters (or segments) is given by

$$L = \prod\_{i=1}^{n} \prod\_{g=1}^{G} \left[ \pi\_{\mathcal{S}} f(y\_i | \mathbf{x}\_i, \mathbf{P}\_{\mathcal{S}}, \boldsymbol{\beta}\_{\mathcal{S}}) f(t\_i | \mathbf{v}\_{\mathcal{S}}, \sigma\_{\mathcal{S}}^{2, \text{time}}) \right]^{z\_{\mathcal{S},i}} \tag{8}$$

where the *π<sup>g</sup>* is the mixing proportion with the constraint that *π<sup>g</sup>* ≥ 0 and *G* ∑ *g*=1 *π<sup>g</sup>* = 1. We follow the definition of missing values by Yang et al. [15] and let **Z** = {*Z*1, *Z*2, · · · , *Zn*}. If *Z<sup>i</sup>* = *g*, then *zg*,*<sup>i</sup>* = 1, otherwise, *zg*,*<sup>i</sup>* = 0. Then the log-likelihood function of the *MIX-SPCR* model models is

$$\begin{split} \mathcal{L}\_{\text{mix}} &= \log \left( L \right) \\ &= \sum\_{i=1}^{n} \sum\_{\mathcal{S}=1}^{G} z\_{\mathcal{S},i} \log \left[ \pi\_{\mathcal{S}} f(y\_{i}|\mathbf{x}\_{i}, \mathbf{P}\_{\mathcal{S}}, \boldsymbol{\beta}\_{\mathcal{S}}) f(t\_{i}|\mathbf{v}\_{\mathcal{S}}, \sigma\_{\mathcal{S}}^{2\text{time}}) \right] \\ &= \sum\_{i=1}^{n} \sum\_{\mathcal{S}=1}^{G} z\_{\mathcal{S},i} \left[ \log \pi\_{\mathcal{S}} + \log f(y\_{i}|\mathbf{x}\_{i}, \mathbf{P}\_{\mathcal{S}}, \boldsymbol{\beta}\_{\mathcal{S}}) + \log f(t\_{i}|\mathbf{v}\_{\mathcal{S}}, \sigma\_{\mathcal{S}}^{2\text{time}}) \right] \\ &= \underbrace{\sum\_{i=1}^{n} \sum\_{\mathcal{S}=1}^{G} z\_{\mathcal{S},i} \log \pi\_{\mathcal{S}}}\_{\mathcal{L}\_{\text{m}}} + \underbrace{\sum\_{i=1}^{n} \sum\_{\mathcal{S}=1}^{G} z\_{\mathcal{S},i} \log f(y\_{i}|\mathbf{x}\_{i}, \mathbf{P}\_{\mathcal{S}}, \boldsymbol{\beta}\_{\mathcal{S}})}\_{\mathcal{L}\_{\text{time}}} + \underbrace{\sum\_{i=1}^{n} \sum\_{\mathcal{S}=1}^{G} z\_{\mathcal{S},i} \log f(t\_{i}|\mathbf{v}\_{\mathcal{S}}, \sigma\_{\mathcal{S}}^{2\text{time}})}\_{\mathcal{L}\_{\text{time}}}. \tag{10} \end{split} \tag{11}$$

We denote **z** = - *zg*,*<sup>i</sup>* where *g* = 1, 2, · · · , *G* and *i* = 1, 2, · · · , *n*.

Given the number of segments, researchers usually apply the EM algorithm to determine the optimal segmentation by setting the objective function as JEM = Lmix (Gaffney and Smyth [26], Esling and Agon [27], Gaffney [28]).

#### **3. Regularized Entropy-Based EM Clustering Algorithm**

The EM algorithm is a method for iteratively optimizing the objective function. As discussed in Section 2.2, by setting the objective function as the log-likelihood function, we can use the EM algorithm to identify optimal segmentation of time series.

However, in practice, the EM algorithm is sensitive to model initialization conditions and cannot estimate the number of clusters appropriately. To deal with the initialization problem, in 2012, Yang et al. [15] proposed using an entropy penalty to stabilize the computation of each step. The improved method is called the *robust EM algorithm*. In this paper, we extend the robust EM algorithm to deal with time-series data for the *MIX-SPCR* model.

In Section 3.1, we discuss the entropy term of the robust EM algorithm. Then, we show the extension of the robust EM algorithm for the *MIX-SPCR* model in Sections 3.2 and 3.3.

#### *3.1. The Entropy of EM Mixture Probability*

As introduced in Equation (8), the *π<sup>g</sup>* represents the mixture probability of each cluster or segment. In other words, the value of *π<sup>g</sup>* is the probability that a data point belongs to group *g*. The clustering complexity is determined by the number of clusters and corresponding probability values, which could be obtained using entropy. Given {*πg*|1 ≤ *g* ≤ *G*}, the entropy of *Z<sup>i</sup>* is

$$H(\mathbf{Z}\_i | \left\{ \pi\_{\mathcal{S}} | 1 \le \mathcal{g} \le \mathcal{G} \right\}) = -\sum\_{\mathcal{g}=1}^{\mathcal{G}} \pi\_{\mathcal{S}} \log(\pi\_{\mathcal{S}}), \text{for } i = 1, 2, \dots, n. \tag{11}$$

Then the entropy of **Z** is written as,

$$\begin{aligned} H(\mathbf{Z} \mid \{\pi\_{\mathcal{S}} | 1 \le \mathcal{g} \le G\}) &= \sum\_{i=1}^{n} H(Z\_i \mid \{\pi\_{\mathcal{S}} | 1 \le \mathcal{g} \le G\}) \\ &= -\sum\_{i=1}^{n} \sum\_{\mathcal{g}=1}^{G} \pi\_{\mathcal{S}} \log(\pi\_{\mathcal{S}}) \\ &= -n \sum\_{\mathcal{g}=1}^{G} \pi\_{\mathcal{S}} \log(\pi\_{\mathcal{S}}). \end{aligned} \tag{12}$$

The objective function of the robust EM algorithm is

$$\mathcal{L}\_{\text{Robust-EM}} = \mathcal{L}\_{\text{mix}} - \lambda\_{\text{Robust-EM}} H(\mathbf{Z} | \left\{ \pi\_{\mathcal{S}} | 1 \le \mathbf{g} \le \mathbf{G} \right\}), \tag{13}$$

where *λ*Robust-EM ≥ 0. The log-likelihood term Lmix is from Equation (9), which gives the goodness-of-fit. Next, we present the steps of the EM algorithm for maximizing the objective function in Equation (13).

#### *3.2. E-Step (Expectation)*

From a Bayesian perspective, we let <sup>b</sup>*zg*,*<sup>i</sup>* denote the posterior probability of the true cluster membership that a dataset triplet (*t<sup>i</sup>* , **x***<sup>i</sup>* , *yi*) is drawn from group *g*. Using the Bayes theorem, we have

$$\hat{\mathbf{E}}\_{\mathcal{S}^I} = \mathbb{E}(\mathbf{Z}\_i = \underset{\cdot}{\mathbf{g}} | \mathbf{y}\_i, \mathbf{x}\_i, \mathbf{P}\_{\mathcal{S}^I} \boldsymbol{\mathcal{S}}\_{\mathcal{S}}) \tag{14}$$

$$=\frac{\pi\_{\mathcal{S}}\mathcal{N}\left(y\_{\mathcal{i}};\mathbf{x}\_{\mathcal{i}}\mathbf{P}\_{\mathcal{S}}\boldsymbol{\beta}\_{\mathcal{S}},\sigma\_{\mathcal{S}}^{2}\right)\mathcal{N}\left(t\_{\mathcal{i}}|v\_{\mathcal{Y}},\sigma\_{\mathcal{S}}^{2,\text{time}}\right)}{\sum\limits\_{h=1}^{G}\pi\_{h}\mathcal{N}\left(y\_{\mathcal{i}};\mathbf{x}\_{\mathcal{i}}\mathbf{P}\_{h}\boldsymbol{\beta}\_{\mathcal{Y}},\sigma\_{\mathcal{h}}^{2}\right)\mathcal{N}\left(t\_{\mathcal{i}}|v\_{\mathcal{h}},\sigma\_{\mathcal{h}}^{2,\text{time}}\right)}.\tag{15}$$

#### *3.3. M-Step (Maximization)*

Using the robustified derivation of *<sup>π</sup>*b*g*, the estimated mixture proportion, we have

$$\hat{\pi}\_{\mathcal{g}}^{\text{new}} = \hat{\pi}\_{\mathcal{g}}^{\text{EM}} + \hat{\lambda}\_{\text{Robust-EM}} \hat{\pi}\_{\mathcal{g}}^{\text{old}} \left( \log(\hat{\pi}\_{\mathcal{g}}^{\text{old}}) - \sum\_{h=1}^{G} \left( \hat{\pi}\_{h}^{\text{old}} \log(\hat{\pi}\_{h}^{\text{old}}) \right) \right), \tag{16}$$

where

$$
\widehat{\pi}\_{\mathcal{S}}^{\mathsf{EM}} = \frac{\sum\_{i=1}^{n} \widehat{\pi}\_{\mathcal{S},i}}{n}. \tag{17}
$$

We follow the recommendation of Yang et al. [15] for the value of b*λ* new Robust-EM as

$$\widehat{\lambda}\_{\text{Robust-EM}}^{\text{new}} = \min \left\{ \frac{\sum\_{h=1}^{G} \exp\left( -\eta n \left| \widehat{\pi}\_{\text{g}}^{\text{new}} - \widehat{\pi}\_{\text{g}}^{\text{old}} \right| \right)}{G}, \frac{1 - \max \left\{ \sum\_{i=1}^{n} \widehat{\pi}\_{h,i}^{\text{old}} / n \left| h = 1, 2, \cdots, G \right\}}{-\max \left\{ \widehat{\pi}\_{h}^{\text{old}} \left| h = 1, 2, \cdots, G \right\} \sum\_{h=1}^{G} \widehat{\pi}\_{h}^{\text{old}} \log \widehat{\pi}\_{h}^{\text{old}}} \right\}, \tag{18}$$

where

$$\eta = \min\left\{1, 0.5^{\lfloor p/2 - 1 \rfloor}\right\},\tag{19}$$

and *p* is the number of variables in the model.

We iterate E-step and M-step several times until convergence to obtain the parameter estimates. In particular, the *β<sup>g</sup>* values get updated by maximizing the JRobust-EM from Equation (13). Since we fix the number of segments and principal components during each E-step and M-step, the updated values of *β<sup>g</sup>* and *σ<sup>g</sup>* can be calculated using Lmix directly. The estimated values of *β<sup>g</sup>* and *σ<sup>g</sup>* are given as follows.

$$
\begin{split}
\widehat{\boldsymbol{\beta}}\_{\mathcal{S}}^{\text{new}} &= \left[\sum\_{i=1}^{n} \widehat{\boldsymbol{\varepsilon}}\_{\mathcal{S},i}^{\text{old}} (\mathbf{x}\_{i}^{T}\mathbf{P}\_{\mathcal{S}})^{T} (\mathbf{x}\_{i}^{T}\mathbf{P}\_{\mathcal{S}})\right]^{-1} \sum\_{i=1}^{n} \widehat{\boldsymbol{\varepsilon}}\_{\mathcal{S},i}^{\text{old}} (\mathbf{x}\_{i}^{T}\mathbf{P}\_{\mathcal{S}})^{T} \boldsymbol{y}\_{i} \\ &= \left[\sum\_{i=1}^{n} \widehat{\boldsymbol{\varepsilon}}\_{\mathcal{S},i}^{\text{old}} \mathbf{P}\_{\mathcal{S}}^{T} \mathbf{x}\_{i} \mathbf{x}\_{i}^{T}\mathbf{P}\_{\mathcal{S}}\right]^{-1} \sum\_{i=1}^{n} \widehat{\boldsymbol{\varepsilon}}\_{\mathcal{S},i}^{\text{old}} \mathbf{P}\_{\mathcal{S}}^{T} \mathbf{x}\_{i} \boldsymbol{y}\_{i},
\end{split}
\tag{20}
$$

$$\widehat{\sigma}\_{\mathcal{S}}^{2,\text{new}} = \sum\_{i=1}^{n} \widehat{z}\_{\mathcal{S},i}^{\text{old}} \left\| y\_i - \mathbf{x}\_i^T \mathbf{P}\_{\mathcal{S}} \widehat{\boldsymbol{\beta}}\_{\mathcal{S}}^{\text{new}} \right\|\_{2}^{2} / \sum\_{i=1}^{n} \widehat{z}\_{\mathcal{S},i}^{\text{old}}.\tag{21}$$

For the time factor, the estimated mean *<sup>v</sup>*b*<sup>g</sup>* and variance <sup>b</sup>*<sup>σ</sup>* 2,time *<sup>g</sup>* are

$$
\widehat{v}\_{\mathcal{S}} = \frac{\sum\_{i=1}^{n} \widehat{z}\_{\mathcal{S},i} t\_i}{\sum\_{i=1}^{n} \widehat{z}\_{\mathcal{S},i}} \, \tag{22}
$$

$$
\widehat{\sigma}\_{\mathcal{S}}^{2, \text{time}} = \frac{\sum\_{i=1}^{n} \widehat{z}\_{\mathcal{S},i} \left( t\_i - \widehat{v}\_{\mathcal{S}} \right)^2}{\sum\_{i=1}^{n} \widehat{z}\_{\mathcal{S},i}}. \tag{23}
$$

As discussed above, our approach is flexible in considering other distributional models for the time-series factor, which we will pursue in separate research work.

#### **4. Information Complexity Criteria**

Recently, the statistical literature recognized the necessity of introducing model selection as one of the technical areas. In this area, the entropy and the Kullback–Leibler [29] information (or KL distance) play a crucial role and serve as an analytical basis to obtain the forms of model selection criteria. In this paper, we use information criteria to evaluate a portfolio of competing models and select the best-fitting model with minimum criterion values.

One of the first information criteria for model selection in the literature is due to the seminal work of Akaike [30]. Following the entropy maximization principle (EMP), Akaike developed the Akaike's Information Criterion (AIC) to estimate the expected KL distance or divergence. The form of AIC is

$$\text{Al}\!\!\!\!\/ = -2\log L(\hat{\theta}) + 2k,\tag{24}$$

where *L*( ˆ*θ*) is the maximized likelihood function, and *k* is the number of estimated free parameters in the model. The model with minimum AIC value is chosen as the best model to fit the data.

Motivated by Akaike's work, Bozdogan [16–20,31] developed a new information complexity (ICOMP) criteria based on Van Emden's [32] entropic complexity index in parametric estimation. Instead of penalizing the number of free parameters directly, ICOMP penalizes the covariance complexity of the model. There are several forms of ICOMP. In this section, we present the two general forms of ICOMP criteria based on the estimated inverse Fisher information matrix (IFIM). The first form is

$$\begin{split} \mathsf{I}\{\mathsf{COMP}(\|\mathsf{F}\|\mathsf{M}) = -2\log L(\hat{\theta}) + 2\mathsf{C}(\hat{\Sigma}\_{model}) \\ = -2\log L(\hat{\theta}) + 2\mathsf{C}\_1(\hat{\mathcal{F}}^{-1}), \end{split} \tag{25}$$

where *L*( <sup>ˆ</sup>*θ*) is the maximized likelihood function, and <sup>C</sup>1(Fb−<sup>1</sup> ) represents the entropic complexity of IFIM. We define <sup>C</sup>1(Fb−<sup>1</sup> ) as

$$\mathcal{L}\_1(\hat{\mathcal{F}}^{-1}) = \frac{s}{2} \log \left( \frac{tr \hat{\mathcal{F}}^{-1}}{s} \right) - \frac{1}{2} \log \left| \hat{\mathcal{F}}^{-1} \right|,\tag{26}$$

and where *<sup>s</sup>* <sup>=</sup> rank(Fb−<sup>1</sup> ). We can also give the form of <sup>C</sup>1(Fb−<sup>1</sup> ) in terms of eigenvalues,

$$\mathcal{L}\_1(\hat{\mathcal{F}}^{-1}) = \frac{s}{2} \log \left( \frac{\bar{\lambda}\_d}{\bar{\lambda}\_{\mathcal{S}}} \right),\tag{27}$$

where *λ*¯ *<sup>a</sup>* is the arithmetic mean of the eigenvalues, *λ*1, *λ*2, . . . , *λ<sup>s</sup>* , and *λ*¯ *<sup>g</sup>* is the geometric mean of the eigenvalues.

We note that ICOMP penalizes the lack of parsimony and the profusion of the model's complexity through IFIM. It offers a new perspective beyond counting and penalizing number of estimated parameters in the model. Instead, ICOMP takes into account interaction (i.e., correlation) among the estimated parameters through the model fitting process.

We define the second form of ICOMP as

$$\text{L\usubset} \text{MPP}\{\text{IFM}\}\_{\mathbb{C}\_{1\mathsf{F}}} = -2\log L(\hat{\theta}) + 2\mathsf{C}\_{1\mathsf{F}}(\hat{\mathcal{F}}^{-1}),\tag{28}$$

where <sup>C</sup>1F(Fb−<sup>1</sup> ) is given by

$$\mathsf{C}\_{1\mathsf{F}}(\hat{\mathcal{F}}^{-1}) = \frac{s}{4} \frac{\frac{1}{s} tr\left(\left(\hat{\mathcal{F}}^{-1}\right)^{T} \left(\hat{\mathcal{F}}^{-1}\right)\right) - \left(\frac{tr\left(\hat{\mathcal{F}}^{-1}\right)}{s}\right)^{2}}{\left(\frac{tr\left(\hat{\mathcal{F}}^{-1}\right)}{s}\right)^{2}}.\tag{29}$$

In terms of the eigenvalues of IFIM, we write <sup>C</sup>1F(Fb−<sup>1</sup> ) as

$$\mathfrak{C}\_{\mathbf{1}\mathbf{F}}(\hat{\mathcal{F}}^{-1}) = \frac{1}{4\bar{\lambda}\_a^2} \sum\_{j=1}^s \left(\lambda\_j - \bar{\lambda}\_a\right)^2. \tag{30}$$

We want to highlight some features of <sup>C</sup>1F(Fb−<sup>1</sup> ) here. The term <sup>C</sup>1F(Fb−<sup>1</sup> ) is a second-order equivalent measure of complexity to the original term <sup>C</sup>1(Fb−<sup>1</sup> ). Additionally, we note that <sup>C</sup>1F(Fb−<sup>1</sup> ) is scale-invariant and <sup>C</sup>1F(Fb−<sup>1</sup> ) <sup>≥</sup> 0 with <sup>C</sup>1F(Fb−<sup>1</sup> ) = 0 only when all *λ<sup>j</sup>* = *λ*¯ *<sup>a</sup>*. Furthermore, <sup>C</sup>1F(Fb−<sup>1</sup> ) measures the relative variation in the eigenvalues.

These two forms of ICOMP provide us an easy to use computational means in high dimensional modeling. Next, we derive the analytical forms of ICOMP in the *MIX-SPCR* model.

#### *4.1. Derivation of Information Complexity in* MIX-SPCR *Model for Time-Series Data*

We first consider the log-likelihood function of the *MIX-SPCR* model given in Equation (9),

$$
\mathcal{L}\_{\rm mix} = \mathcal{L}\_{\pi} + \mathcal{L}\_{\rm SPCR} + \mathcal{L}\_{\rm time}.\tag{31}
$$

After some work, the estimated inverse Fisher information matrix (IFIM) of the mixture probabilities is

$$
\hat{\mathcal{F}}\_{\pi}^{-1} = \begin{bmatrix}
\hat{\pi}\_1^{-1} & 0 & 0 & 0 \\
0 & \hat{\pi}\_2^{-1} & 0 & 0 \\
0 & 0 & \ddots & 0 \\
0 & 0 & 0 & \hat{\pi}\_G^{-1}
\end{bmatrix}.\tag{32}
$$

Similarly, for each segment *<sup>g</sup>*, the estimated IFIM, <sup>F</sup>b−<sup>1</sup> *<sup>g</sup>*,SPCR, is

$$\hat{\mathcal{F}}\_{\mathbf{g,\mathbf{S}\mathbf{P}\mathbf{C}\mathbf{R}}^{-1}}^{-1} = \begin{bmatrix} \hat{\sigma}\_{\mathcal{S}}^{2} \left[ \sum\_{l=1}^{n} \hat{z}\_{\mathcal{S}^{\mathcal{J}}} \left( \mathbf{x}\_{i}^{\mathrm{T}} \mathbf{P}\_{\mathcal{S}} \right)^{\mathrm{T}} \left( \mathbf{x}\_{i}^{\mathrm{T}} \mathbf{P}\_{\mathcal{S}} \right) \right]^{-1} & \mathbf{0} \\\ \mathbf{0}^{\mathrm{T}} & 2 \hat{\sigma}\_{\mathcal{S}}^{4} \left( \sum \hat{z}\_{\mathcal{S}^{\mathcal{J}}} \right)^{-1} \end{bmatrix}, \ g = 1, 2, \ldots, \mathrm{G}. \tag{33}$$

Note that the IFIM should include both the SPCR models <sup>F</sup>b−<sup>1</sup> *<sup>g</sup>*,SPCR and the time factor <sup>F</sup>b−<sup>1</sup> *<sup>g</sup>*,time for each segment.

For each segment *g*, the time factor is under the univariate Gaussian distribution. As a result, the IFIM of the time factor is

$$
\hat{\mathcal{F}}\_{\mathcal{g},\text{time}}^{-1} = \begin{bmatrix}
\hat{\sigma}\_{\mathcal{g}}^{2,\text{time}}/n & 0 \\
0 & \frac{2}{n}\hat{\sigma}\_{\mathcal{g}}^{4,\text{time}}
\end{bmatrix}.\tag{34}
$$

By combining the two IFIMs for the SPCR model and the time factor, we have the inverse Fisher information

$$
\hat{\mathcal{F}}\_{\mathcal{S}}^{-1} = \begin{bmatrix}
\hat{\mathcal{F}}\_{\mathcal{S}, \textbf{SPCR}}^{-1} & \mathbf{0} \\
\mathbf{0}^{T} & \hat{\mathcal{F}}\_{\mathcal{S}, \textbf{time}}^{-1}
\end{bmatrix}.
\tag{35}
$$

Overall, the inverse of the estimated Fisher information matrix (IFIM) for the *MIX-SPCR* model becomes

$$
\hat{\mathcal{F}}^{-1} \cong \begin{bmatrix}
\hat{\mathcal{F}}\_{\pi}^{-1} & 0 & 0 & \cdots & 0 \\
0 & \hat{\mathcal{F}}\_{1}^{-1} & 0 & \cdots & 0 \\
0 & 0 & \hat{\mathcal{F}}\_{2}^{-1} & \cdots & 0 \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
0 & 0 & 0 & \cdots & \hat{\mathcal{F}}\_{G}^{-1}
\end{bmatrix}.\tag{36}
$$

Using the above definition of ICOMP(IFIM) and the properties of block-diagonal matrices with their trace and determinant, we have

$$\text{MCOMP}\{\text{IFM}\} = -2\mathcal{L}\_{\text{mix}} + 2\mathsf{C}\_1(\hat{\mathcal{F}}^{-1}),\tag{37}$$

where

$$\mathbb{C}\_{1}(\hat{\mathcal{F}}^{-1}) = \frac{s}{2} \log \left[ \frac{tr(\hat{\mathcal{F}}\_{\pi}^{-1}) + \sum\_{g=1}^{G} tr(\hat{\mathcal{F}}\_{\mathcal{S}}^{-1})}{s} \right] - \frac{1}{2} \left[ \log \left| \hat{\mathcal{F}}\_{\pi}^{-1} \right| + \sum\_{g=1}^{G} \log \left| \hat{\mathcal{F}}\_{\mathcal{S}}^{-1} \right| \right],\tag{38}$$

and where *<sup>s</sup>* <sup>=</sup> rank(Fb−<sup>1</sup> ) = *r<sup>π</sup>* + *G* ∑ *g*=1 *<sup>r</sup><sup>g</sup>* <sup>=</sup> dim(Fb−<sup>1</sup> ).

Similarly, we derive the second equivalent form of ICOMP(IFIM)C1F as

$$\text{ICOMP}(\text{IFOM})\_{\mathbb{C}\_{1\mathsf{F}}} = -2\mathcal{L}\_{\mathsf{mix}} + 2\mathsf{C}\_{1\mathsf{F}}(\hat{\mathcal{F}}^{-1}).\tag{39}$$

Using the properties of the block-diagonal matrices, we have

$$\operatorname{tr}\left(\left(\widehat{\mathcal{F}}^{-1}\right)^{T}\left(\widehat{\mathcal{F}}^{-1}\right)\right) = \operatorname{tr}\left(\widehat{\mathcal{F}}\_{\pi}^{-1}\right)^{2} + \sum\_{\mathcal{g}=1}^{G} \operatorname{tr}\left(\widehat{\mathcal{F}}\_{\mathcal{\mathcal{S}}}^{-1}\right)^{2}.\tag{40}$$

Thus, an open computational form of ICOMP(IFIM)C1F becomes

$$\text{L'}\text{COMP}(\text{FIM})\_{\mathsf{C}\_{\text{IF}}} = -2\mathcal{L}\_{\text{mix}} + \frac{\frac{1}{s} \left[ tr\left(\hat{\mathcal{F}}\_{\pi}^{-1}\right)^{2} + \sum\_{g=1}^{G} tr\left(\hat{\mathcal{F}}\_{\mathcal{S}}^{-1}\right)^{2} \right]}{2} - \left[ \frac{tr(\hat{\mathcal{F}}\_{\pi}^{-1}) + \sum\_{g=1}^{G} tr(\hat{\mathcal{F}}\_{\mathcal{S}}^{-1})}{s} \right]^{2} \tag{41}$$

$$\left[ \frac{tr(\hat{\mathcal{F}}\_{\pi}^{-1}) + \frac{c}{s^{-1}}}{s} \right]^{2}$$

We note that in computing both forms of ICOMP above, we do not need to build the full inverse of the estimated Fisher information matrix (IFIM) for the *MIX-SPCR* model given in Equation (36). All one requires is the computation of IFIM for each segment, which is appealing.

We also use AIC and CAIC (Bozdogan [33]) for comparison purposes given by

$$\text{Al}\!\!\!\!\!\/ = -2\mathcal{L}\_{\text{mi}\,\text{x}} + 2\text{s}^\*\,,\text{and}\,,\tag{42}$$

$$\mathsf{CAIC} = -2\mathcal{L}\_{\mathsf{mi}} + s^\* \left( \log n + 1 \right), \tag{43}$$

where *s* ∗ = *G*(*k* + 3) is the number of estimated parameters in the *MIX-SPCR* model and log denotes the natural logarithm of the sample size *n*.

Next, we show our numerical examples starting with a detailed Monte Carlo simulation study.

#### **5. Monte Carlo Simulation Study**

We perform numerical experiments in a unified computing environment: Ubuntu 18.04 operating system, Intel I7-8700, and 32 GB of RAM. We use the programming language Python and the scientific computing package NumPy [34] to build a computational platform. The size of the input data directly affects the running time of the program. At *n* = 4000 time-series observations, the execution time for each EM iteration is about 0.9 s. Parameter estimation can reach convergence within 40 steps of iterations, with a total machine run time of 37 s.

#### *5.1. Simulation Protocol*

In this section, we present the performance of the proposed *MIX-SPCR* model using synthetic data generated from a segmented regression model. Our simulation protocol has *p* = 12 variables and four actual latent variables. Two segmented regression models determine the dependent variable *y*, and each segment is continuous and has its own specified coefficients (*β*<sup>1</sup> and *β*2). Our simulation set up is as follows:


We set the total number of time-series observations, *n* = 4000. The first segment has *n*<sup>1</sup> = 2800, and the second segment has *n*<sup>2</sup> = 1200 time-series observations. We randomly draw error term from a Gaussian distribution with zero mean and *σ* <sup>2</sup> = 9. Among all the variables, the first six observable variables explain the first segment, and the remaining six explanatory variables primarily determine the second segment. We set the mixing proportions *π*<sup>1</sup> = 0.7 and *π*<sup>2</sup> = 0.3 for two time-series segments, respectively.

#### *5.2. Detection of Structural Change Point*

In the first simulation study, we limit the actual number of segments equal to two, which means that the first segment expands from the starting point to a structural change point, and the second segment expands from the change point to the end. By design, each segment is continuous on the time scale, and different sets of independent variables explain the trending and volatility. We run the *MIX-SPCR* model to see if it can successfully determine the position of the change point using the information criteria. If a change point is correctly selected, we expect that the information criteria is minimized at this change point.

Figures 2 and 3 show our results from the *MIX-SPCR* model. Specifically, it shows the sample path of the information criteria at each time point. We note that all the information criteria values are minimized from *t* = 2800 to *t* = 3000, which covers the time-series's actual change point position. As the *MIX-SPCR* model selects different change points, the penalty term of AIC and CAIC remain the same because both the number of model parameters and the number of observations do not change. In this simulation scenario, the fixed penalty term means that the AIC and CAIC reflect the changes only in the "lack of fit" term of various models without considering model complexity. This indicates that using AIC-type criteria just counting and penalizing the number of parameters may be necessary but not sufficient in model selection. As a comparison, however, we note that the penalty term of information complexity-based criteria, C1 and C1F, are adjusted in selecting different change points. They are varying but not fixed.

**Figure 2.** The plot of two-segment simulated time-series data. We show the plot of the simulated time-series data through the whole-time scale. Note that the first segment is from the starting point *t* = 1 to the change point *t* = 2800, and the second time segment expands from the change point *t* = 2801 to the end *t* = 4000.

**Figure 3.** Sample path of information criteria for the simulated time-series data. The horizontal coordinate represents the position of the possible change points, and the vertical coordinate represents the corresponding information criterion (IC) values. The lower the IC value, the more likely the selected position of the change point is the real position. The real change point is *t* = 2800.

#### *5.3. A Large-Scale Monte Carlo Simulation*

Next, we perform a large-scale Monte Carlo simulation to illustrate the *MIX-SPCR* model's performance in choosing the correct number of segments and the number of latent variables. A priori, in this simulation, we pretend that we do not know the actual structure of the data and use the information criteria to recover the actual construction of the *MIX-SPCR* model. To achieve this, we follow the above simulation protocol using a different number of time points by varying *n* = 1000, 2000, 4000. As before, there are twelve explanatory variables drawn from four latent variable models generated from a multivariate Gaussian distribution given in Equation (47). The simulated data again consist of two time-series segments with mixing proportions *π*<sup>1</sup> = 0.7 and *π*<sup>2</sup> = 0.3, respectively. For each data generating process, we replicate the simulation one hundred times and record both information complexity-based criteria (ICOMP(IFIM) & ICOMP(IFIM)C1F ) and classic AIC-type criteria (AIC & CAIC). In Table 1, we present how many times the *MIX-SPCR* model selects different models in the one hundred simulations. In this way, we can assess different information criteria by measuring the hit rates.

Looking at Table 1, we see that when the sample size *n* = 1000 (small), AIC selects the correct model (*G* = 2, *k* = 4) 69 times, CAIC selects 80 times, ICOMP(IFIM) selects 48 times, and ICOMP(IFIM)C1F selects 76 times, respectively, in 100 replications of the Monte Carlo simulation. When the sample size is small, ICOMP(IFIM) tends to choose a sparser regression model sensitive to the sample size. However, as the sample size increases, when *n* = 2000 and *n* = 4000, ICOMP(IFIM) consistently outperforms other information criteria in terms of hit rates. The percentage of the correctly identified model is above 90%, as reported above.


**Table 1.** Frequency of the choice of the true model with information criteria in 100 replications of the experiment for each sample size (*n*) of time-series observations. The true model is *G* = 2 and *k* = 4.

Our results show that the *MIX-SPCR* model works well in all settings to estimate the number of time-series segments and the number of latent variables.

Figure 4 illustrates how the *MIX-SPCR* model performs if the number of segments and the number of sparse principal components are unknown beforehand.

The choice of the number of segments (*G*) has a significant impact on the results. For all the simulation scenarios, the correct choice of the number of segments (*G* = 2) has information criterion values less than the incorrect choice (*G* = 3). This pattern emerges consistently among all the sample sizes, both the classical ones and information-complexity based criteria.

In summary, the large-scale Monte Carlo simulation analysis highlights the performance of the *MIX-SPCR* model. As the sample size increases, the *MIX-SPCR* model improves its performance. As shown in Figure 3, the *MIX-SPCR* model can efficiently determine the structural change point and estimate the mixture proportions when the number of segments is unknown beforehand. Another key finding is that, by using the appropriate information criteria, the *MIX-SPCR* model can correctly identify the number of segments and the number of latent variables from the data. In other words, our approach can extract the main factors not only from the intercorrelated variables but also classify the data into several clearly defined segments on the time scale.

**Figure 4.** Plot of average and 1SD (standard deviation) of information criterion values over different sample sizes in all simulations with three Sparse Principal Components (SPCs) and *G* = 2 segments. The red line indicates the estimated *MIX-SPCR* model based on two groups (*G* = 2). Correspondingly, the black line indicates the estimated *MIX-SPCR* model for three groups (*G* = 3). Horizontal coordinates represent different numbers of SPCs.

#### **6. Case Study: Segmentation of the S&P 500 Index**

#### *6.1. Description of Data*

The financial market often generates a large amount of time-series data, and in most cases, the generated data is high-dimensional. In this paper, we use the S&P 500 index and its related hundreds of company stocks categorized into eleven sectors, which are high dimensional time-series data. The index value is the response variable mixed by plenty of companies' variations at each time point. These long time-series values often consist of different regimes and states. For example, the stock market experienced a boom period from 2017 to 2019, which is a dramatic change compared with the stock market during the 2008 financial crisis. If we analyze each sector or company, some industries perform more actively than others during a particular period.

In this section, we implement the *MIX-SPCR* model on the adjusted closing price of the S&P 500 (^GSPC) as a case study. We extract the daily adjusted closing prices from the Yahoo Finance database (https://finance.yahoo.com/) that spans the period from 1 January 1999 to 31 December 2019. By removing weekends and holidays, there are *n* = 5292 tradable days in total. The main focus of this section is to split the time-series into several self-contained segments. Besides, we expect the extracted sparse principal components to explain the variance and volatility in each segment.

#### *6.2. Computational Results*

To have a big picture of how the S&P 500 index values reflect the changes of 506 company stock prices, Figure 5 shows the plot of the normalized values of adjusted closing prices. We use the *MIX-SPCR* model with the information criteria to determine the number of segments and the number of sparse principal components. To achieve interpretable results, we limit our search space to a maximum of seven time-series and six sparse principal components. Table 2 shows the optimal combination of three self-contained segments and three sparse principal components for each of the segments by using the information complexity ICOMP(IFIM). The other three information criteria also choose this combination as the best-fitting model. Figure 6 illustrates the probability and time range of each segment. We can see that the first segment is from 1 January 1999, to 26 October 2007. The second time-series segment spans from 29 October 2007, to the end of 2016. The last segment extends from 30 December 2016 to 31 December 2019.

**Figure 5.** Normalized S&P 500 index and stock prices from January 1999 to December 2019.

**Figure 6.** Segmented periods and probability. The plot's vertical coordinate indicates the probability that an individual time-series data point belongs to each segment.


**Table 2.** The ICOMP(IFIM) values of segmentation results for S&P 500 index data (Lower is better).

We emphasize that many factors may explain the stock market variation, and this is not a research on how the socioeconomic events influence the S&P 500 index. However, it does raise our interest in the distribution of two structural change points from the segmentation results. The first change point is October 2007, which is the early stage of the 2008 financial crisis. The second structural change point is December 2016, the transitional period of the USA presidential election. Identification of these two change points shows that our proposed method can detect the underlying physical and structural change from the available time-series data.

Table 3 lists the estimated coefficients (*βg*) from sparse principal component regression. Because all the collected stock prices and S&P 500 index values are standardized before implementing the *MIX-SPCR* model, we make dimension reduction, remove the constant term, and perform regression analysis using the SPCR model. The *R* <sup>2</sup> values are above 0.8 across all three different time segments.


**Table 3.** SPCR coefficients (*βg*) of three different segments.

#### *6.3. Interpretation of Computational Results*

One may ask a question, "Can the *MIX-SPCR* model identify the key variables from the hundreds of companies?" If the constructed model is dense, the selected companies would include all the sectors whereby the dense model is limiting the interpretation of the data. Our analysis identifies all the companies with non-zero coefficient values and maps them back to each of the sectors in Tables A1–A3. Each calculated sparse principal component vector consists of around fifty companies, much less than the original data dimension (*p* = 506). We observe that these selected companies are grouped into a few sectors within different time segments. For example, energy companies load in the first sparse principal component vector from 1999 to 2007 (segment 1) and diminish after that.

To have a detailed analysis of how different sectors perform across three segments, we do the stem plot to show the sparse principal component coefficients **P***<sup>g</sup>* of four sectors, namely financials, real estate, energy, and information technology (IT). Figures 7 and 8 indicate a similar behavior that happened in financial and real estate companies. Both sectors play an essential role in the first two time-series segments but have no contribution in the third segment, which is the period after December 2016. Notice that in Figure 9, energy companies act as an essential player before 2016. However, during the recession in 2008, energy company loadings are negated from the first SPC to the second SPC. Compared with

other industries, the variation in energy company stock prices does not contribute to the S&P 500 index after 2016.

Another question is "What sector/industry is the main contributing factor after the 2016 United States presidential election?" A possible answer is, as shown in Figure 10, the SPC coefficients of information technology companies. From 1999 to the recession in 2008, IT companies work mainly on the second SPC and the third SPC, which do not contribute much to the main variation. After the recession, the variations of IT companies do not contribute compared with other sectors. However, after December 2016, companies from the IT industry play an essential role in the primary stock price volatility.

**Figure 7.** Stem plot of SPC coefficients **P***g* for financial companies within each time segment. From top to bottom, the three panels represent different segmented periods, respectively. The horizontal axis of each panel indicates the company in the industrial sector. The vertical axis shows the SPC coefficient values.

**Figure 8.** Stem plot of SPC coefficients **P***g* for real estate companies within each time segment. From top to bottom, the three panels represent different segmented periods, respectively. The horizontal axis of each panel indicates the company in the industrial sector. The vertical axis shows the SPC coefficient values.

**Figure 9.** Stem plot of SPC coefficients **P***g* for energy companies within each time segment. From top to bottom, the three panels represent different segmented periods, respectively. The horizontal axis of each panel indicates the company in the industrial sector. The vertical axis shows the SPC coefficient values.

**Figure 10.** Stem plot of SPC coefficients **P***g* for information technology companies within each time segment. From top to bottom, the three panels represent different segmented periods, respectively. The horizontal axis of each panel indicates the company in the industrial sector. The vertical axis shows the SPC coefficient values.

As discussed above, Figures 7–10 provide a clear picture of how different sectors perform (via coefficient **P***g*) without considering the effects on the S&P 500 index. It might raise the interest in how the SPCR coefficient **P***gβg* changes before/after certain socioeconomic events. We follow the research implemented by Aït-Sahalia and Xiu [35] about how the Federal Reserve addressing heightened liquidity from March 10 to 14 March 2008, affects the stock market. The data analyzed by Aït-Sahalia and Xiu [35] are the S&P 100 index values using the traditional PCA, and the authors grouped stocks into financial and non-financial categories. Instead of PCA, we apply the SPCR model on the S&P 500 index and analyze how eleven sectors react before/after Federal Reserve operations. Figure 11 shows that financials, consumer discretionary, real estate, and industrials experienced more significant perturbations than other sectors in terms of SPCR coefficients **P***gβg*. This conclusion is consistent with the results from Aït-Sahalia and Xiu [35] that the average loadings of first and second principal components of financial companies

are distinct from non-financial companies. However, considering that we have 506 companies in the raw data and make a sparse loading of companies for comparison, the excessive explanatory power is still maintained in this high-dimensional case using the SPCR model, which is more interpretable.

Sector

**Figure 11.** Overlay plot of the SPCR coefficients before/after 2008 financial crisis.

#### **7. Conclusions and Discussions**

In this paper, we presented a new and novel method to segment high-dimensional time-series data into different clusters or segments using the mixture model of the sparse principal components model (*MIX-SPCR*). The *MIX-SPCR* model considers both the relationships among the predictor variables and how various predictor variables contribute the explanatory power to the response variable through the sparsity settings. Information criteria have been introduced and derived for the *MIX-SPCR* model. These criteria are applied to study their performance under different sample sizes and to select the best-fitting model.

Our large-scale Monte Carlo simulation exercise showed that the *MIX-SPCR* model could successfully identify the real structure of the time-series data using the information criteria as the fitness function. In particular, based on our results, the information complexity-based criteria—i.e., ICOMP(IFIM) and ICOMP(IFIM)C1F—outperformed the conventional standard information criteria, such as the AIC-type criteria as the data dimension and the sample size increase.

Later, we empirically applied the *MIX-SPCR* model to uncover the S&P 500 index data (from 1999 to 2019) and identify two change points of this data set.

We observe that the first change point physically coincides with the early stages of the 2008 financial crisis. The second change point is immediately after the 2016 United States presidential election. This structural change point coincides with the election of President Trump and his transition.

Our findings showed how the S&P 500 index and company stock prices react within each time-series segment. The *MIX-SPCR* model presents excessive explanatory power by identifying how different sectors fluctuated before/after the Federal Reserve's addressing heightened liquidity from 10 March to 14 March 2008.

Although this is not a traditional event study paper, it is the first paper to use the sparse principal component regression model with mixture models in the time-series analysis. The proposed new and novel *MIX-SPCR* model enlightens us to explore more interpretable results on how macroeconomic

factors/events influence the stock prices on the time scale. Later, in a separate paper, we will incorporate the event study in the *MIX-SPCR* model as our future research initiative.

This paper's time segmentation model builds on time-series data, constructs likelihood functions, and performs parameter estimation by introducing error information unique to each period. Researchers have recently realized that environmental background noise can positively affect the model building and analysis under certain circumstances ([36–42]). For example, in Azpeitia and Wagner [40], the authors highlighted that the introduction of noise is necessary to obtain information about the system. In our next study, we would like to explore this positive effect of environmental noise even further and use it to build better statistical models for analyzing high-dimensional time-series data.

**Author Contributions:** Conceptualization, H.B. and Y.S.; methodology, H.B. and Y.S.; software, Y.S.; validation, H.B. and Y.S.; formal analysis, H.B. and Y.S.; investigation, H.B. and Y.S.; resources, H.B. and Y.S.; data curation, Y.S.; writing–original draft preparation, H.B. and Y.S.; writing–review and editing, H.B. and Y.S.; visualization, H.B. and Y.S.; supervision, H.B.; project administration, H.B. and Y.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** The first author expresses his gratitude to Bozdogan in bringing this challenging problem to his attention as part of his doctoral thesis chapter and spending valuable time with him that resulted in this joint work. We also express our thanks to Ejaz Ahmed for inviting us to make a contribution to the Special Issue of Entropy. We extend our thanks and gratitude to anonymous reviewers. Their constructive comments further improved the paper.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **Appendix A. Tables**


**Table A1.** Sparse Principal Component (SPC) of Segment 1 (1 January 1999 ∼ 26 October 2007).

**Table A2.** Sparse Principal Component (SPC) of Segment 2 (27 October 2007 ∼ 29 Decmeber 2016).




#### **References**



**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Forecasting Financial Time Series through Causal and Dilated Convolutional Neural Networks**

#### **Lukas Börjesson and Martin Singull \***

Department of Mathematics, Linköping University, 581 83 Linköping, Sweden; lukbo072@student.liu.se

**\*** Correspondence: martin.singull@liu.se

Received: 31 August 2020; Accepted: 25 September 2020; Published: 29 September 2020

**Abstract:** In this paper, predictions of future price movements of a major American stock index were made by analyzing past movements of the same and other correlated indices. A model that has shown very good results in audio and speech generation was modified to suit the analysis of financial data and was then compared to a base model, restricted by assumptions made for an efficient market. The performance of any model, trained by looking at past observations, is heavily influenced by how the division of the data into train, validation and test sets is made. This is further exaggerated by the temporal structure of the financial data, which means that the causal relationship between the predictors and the response is dependent on time. The complexity of the financial system further increases the struggle to make accurate predictions, but the model suggested here was still able to outperform the naive base model by more than 20% and 37%, respectively, when predicting the next day's closing price and the next day's trend.

**Keywords:** deep learning; financial time series; causal and dilated convolutional neural networks

#### **1. Introduction**

Deep learning has brought a new paradigm into machine learning in the past decade and has shown remarkable results in areas such as computer vision, speech recognition and natural language processing. However, one of the areas where it is yet to become a mainstream tool is in forecasting financial time series. This despite the fact that time series does provide a suitable data representation for deep learning methods such as a convolutional neural network (CNN) [1]. Researchers and market participants (Market participants is a general expression for individuals or groups who are active in the market, such as banks, investors, investment funds, or traders (for their own account); often, we use the term *trader* as a synonym for *market participant* [2]) are still, to the most part, sticking to more historically well known and tested approaches, but there has been a slight shift of interest to deep learning methods in the past years [3]. The reason behind the shift, apart from the structure of the time series, is that the financial market is an increasingly complex system. This means that there is a need for more advanced models, such as deep neural networks, that do a better job in finding the nonlinear relations in the data.

Omar Berat Sezer et al. [3] gives a very informative review of the published literature on the subject between 2005 and 2019 and states that there has been a trend towards more usage of deep learning methods in the past five years. The review covers a wide range of deep learning methods, applied to various time series such as stock market indices, commodities and forex. From the review, it is clear that CNNs is not the topmost used method and that developers have focused more on recurrent neural networks (RNNs) and long short-term memory (LSTM) networks. The CNNs are, however, very good at building up high-level features from lower-level features that were originally found in the input data, which is not something a LSTM network is primarily designed to do.

However, the CNNs and LSTM networks do not need to be used as two separate models, but they could be used as two separate parts of the same network. An example is to use a CNN to preprocess the data, in order to extract suitable features, which could then be used as the input to the LSTM part of the network [4].

Furthermore, the WaveNet structure considered in [5] suggests that the model can catch longand short-term dependencies, which is what the LSTM is designed to do as well. Although this is something that will not be further explored in this paper, it does provide further research questions, such as if the WaveNet can be used as a sort of preprocessing to a LSTM network. Another example would be to process the data through a CNN and a LSTM separately and then combine them, before the final output, in a suitable manner. This is something that is explored, with satisfactory results, in [6], and the CNN part of the network is in fact an implementation of the WaveNet here as well. However, only the LSTM part of the network handles the exogenous series; therefore, for future work, it would be interesting to see if the performance could be improved by making the WaveNet handle the exogenous series as well.

Papers where the WaveNet composes the whole network instead of just being a component in one exist as well. Two examples are [7,8], and these models take into consideration exogenous series as well. However, they used ReLU as the activation function instead of SeLU, but they implemented normalization of the network in a similar way as will be done in this paper. Furthermore, when considering the exogenous series, their approach regarding the output from each residual layer was different. Instead of extracting the residual from each exogenous series, which will be done in this paper, only the combined residual was used. See Section 3.2 for more details.

In contrast to the approach taken in this paper, and by all who try to fit statistical models on financial time series, there are those who state that complexity is not the issue, but instead advocate for the Efficient Market Hypothesis (EMH) [9]. A theory that essentially suggests that no model, no matter how complex, can outperform the market, since the price is based on all available information. The theory rests upon three key assumptions—(1) no transaction costs, (2) cost is not a barrier to obtain available information and (3) all market participants agree on the implications of the available information—which are stated to be sufficient, but not necessary (sufficient but not necessary means, in this context, that the assumptions do not need to be fulfilled at all times; for example, the second assumption might be loosened from including all to only a sufficient number of traders). These assumptions, even with modifications, are very bold, and there are many who have criticized the theory over the years. However, whether one agrees with the theory or not, one would probably agree with the statement that a model which satisfies the assumptions made in EMH would indeed be suitable as a base model. This means that such a model can be used as a benchmark in order to assess the accuracy of other models.

Traders and researchers alike would furthermore agree that the price of any asset is, apart from its inner utility, based on the expectation of its future value. For example, the price of a stock is partially determined by the company's current financials, but also by the expectation of future revenues or future dividends. This expectation is, by the neoclassical economics, seen as completely rational, giving rise to the area of rational expectation [10]. However, the emergence of behavioral economics has questioned this rationality and proposes that traders (or more generally, decision-makers who act under risk and uncertainty) are sometimes irrational and many times affected by biases [11].

A trader that sets out to exploit this irrationality and these biases can only do so by looking into the past and, thereby, also go against the hypothesis of the efficient market. Upon reading this, it should be fairly clear that making predictions in the financial markets is no trivial task, and it should be approached with humility. However, one should not be discouraged since the models proposed in [3] do provide promising or, oftentimes, positive results.

An important note about the expectation mentioned above is that the definition of a trader, provided by Paul and Baschnagel, is not limited to a human being; it might as well be an algorithm. This is important since the majority of transactions in the market are now made by algorithms. These algorithms are used in order to decrease the impact of biases and irrationality in decision making. However, the algorithms are programmed by people and are still making predictions under uncertainty, based on historical data, which means that they are by no means free of biases. Algorithms are also more prone to get stuck in a feedback loop, which has been exploited by traders in the past. An interesting example is the two Norwegians, Svend Egil Larsen and Peder Veiby, who in 2010 were accused of manipulating algorithmic trading systems. They were, however, acquitted in 2012, since the court found no wrongdoing.

The aim of this paper is to expand the research in forecasting financial time series, specifically with a deep learning approach. To achieve this, two models, which greatly differ in the approach towards the effectiveness of the market, are compared. The first model is restricted by the assumptions made on the market by the EMH and is seen as the base model. The second model is a CNN, inspired by the WaveNet structure [5], and is influenced by a model developed for audio and speech generation by researchers at Google. The models set out to predict the next day's closing price as well as the trend (either up or down) of Standard and Poor's 500 (S&P 500), a well-known stock market index comprising 500 large companies in the US.

The outline of this paper is as follows. In Section 2 we give a brief background to the theory needed and in Section 3 formulate the considered models and discuss the methodology. The results of the study are given in Section 4 with a discussion in Section 5. We conclude the paper in Section 6.

#### **2. Theoretical Background**

#### *2.1. Time Series*

When using time series as a forecasting model, one makes the assumption that future events, such as the next day's closing price of a stock, can be determined by looking at past closing prices in the time series. Most models, however, include a random error as a factor, meaning that there is some noise in the data which cannot be explained by past values in the series.

Furthermore, the models can be categorized as parametric or non-parametric, where the parametric models are the ones most regularly used. In the parametric approach, each data point in the series is accompanied by a coefficient, which determines the impact the past value has on the forecast of future values in the series.

The linear autoregressive model of order *p*, written as *AR*(*p*), is given by

$$X\_t = c + \sum\_{i=1}^p \varphi\_i X\_{t-i} + \varepsilon\_{t\prime} \tag{1}$$

with unknown parameters *ϕ<sup>i</sup>* and *ε<sup>t</sup>* as white noise. This is one of the most well known time series models, and it is a model where the variable is regressed against itself (auto meaning "oneself", when used as a prefix). It is often used as a building block to more advanced time series models, such as the autoregressive moving average (ARMA) or the generalized autoregressive conditional heteroskedasticity (GARCH) models. However, the AR process will not be considered as a building block in the models proposed in this paper. Instead, the proposed CNN models in this paper can be represented as the nonlinear version of the AR model, or *NAR*(*p*) for short, given as

$$X\_t = \mathbf{c} + f(X\_{t-1}, \dots, X\_{t-p}) + \varepsilon\_{t\prime} \tag{2}$$

with a nonlinear function *f*(·, . . . , ·). In fact, a large number of machine learning models, when applied to time series, can be seen as AR or NAR models. This might seem obvious to some, but it is something that is seldom mentioned in the scientific literature. Furthermore, the models can be generalized to a nonlinear autoregressive exogenous (NARX) model. Given *Z<sup>k</sup>* as an exogenous time series and *ψ<sup>k</sup>* its accompanied coefficient, we have the *NARX*(*p*,*r*) model

$$X\_t = \varepsilon + f(X\_{t-1}, \dots, X\_{t-p}, Z\_{t-1}, \dots, Z\_{t-r}) + \varepsilon\_t. \tag{3}$$

When determining the coefficients in the autoregressive models, most models need the underlying stochastic processes {*X<sup>t</sup>* : *t* ≥ 0} to be stationary or at least weak-sense stationary. This means that we are assuming that the mean and the variance of *X<sup>t</sup>* are constant over time. However, when looking at historical prices in the stock market, one can clearly see that this is not the case, for either the variance or the mean. All of the above models can be generalized to handle this non-stationarity by applying suitable transformations to the series. These transformations, or integrations, are a necessity when determining the values for the coefficients, for most of the well-known methods, although this need not be the case when using a neural network [12].

#### *2.2. Neural Networks*

The neural network, when applied on a supervised problem, sets out to minimize a certain predefined loss function. The loss function used in this paper, when calculating the next day's closing price, was the mean absolute percentage error (MAPE)

$$
\epsilon(\mathbf{w}) = \frac{100}{n} \sum\_{i=1}^{n} \left| \frac{y\_i - t\_i}{t\_i} \right| \,\prime
$$

where *t<sup>i</sup>* is the *i*th target value and *y<sup>i</sup>* is the model's prediction. The reason for this is that the errors are now made proportional with respect to the target value. This is important, since the mean and variance of financial series cannot be assumed to be stationary, and this would otherwise skew the accuracy of the model, unproportionally, to times characterized by a low mean. Meanwhile, the loss function used for classifying the next day's trend was the binary crossentropy

$$\epsilon(\mathbf{w}) = \frac{1}{n} \sum\_{i=1}^{n} t\_i \log(y\_i) + (1 - t\_i) \log(1 - y\_i)).$$

The loss function is with respect to the weight **w**, and the loss is minimized when choosing the weights that solve the function

$$\frac{\partial \epsilon(\mathbf{w})}{\partial \mathbf{w}} = 0.$$

However, this algebraic solution is seldom achievable, and numerical solutions are more often used. These numerical methods set out to find points in close proximity to a local (hopefully global, but probably not) optima.

Moreover, instead of calculating the gradient with respect to each weight individually, backpropagation uses the chain rule, where each derivative can be computed layerwise backward. This leads to a decrease in complexity, which is very important, since it is not unusual that the number of weights might be counted in thousands or in tens of thousands.

The neural network is, unless stated otherwise, considered to be a fully connected network, which means that all weights, in two adjacent layers, are connected to each other. Although backpropagation did a remarkable job in decreasing the complexity, the fully connected models are not good at scaling to many hidden layers. This problem can be solved by having a sparse network, which means that not all weights are connected. The CNN model, further explained in the next section, is an example of a sparse network, where not all units are connected and where some units also share weights.

#### *2.3. Convolutional Neural Networks*

The input to the CNN, when modeling time series, is a three-dimensional tensor, i.e., (nr of observations)×(width of the input)×(nr of series). The number of series is here the main series, for which the predictions will be made over, plus optional exogenous series.

Furthermore, in the CNN model, there is an array of hyperparameters that defines the structure and complexity of the network. Below is a short explanation of the most important parameters to be acquainted with in order to understand the networks proposed in this paper.

#### 2.3.1. Activation Function

In its simplest form, when it only takes on binary values, the activation function determines if the artificial neuron fires or not. More complex activation functions are often used, and the sigmoid and tanh functions

$$\begin{aligned} g(\mathbf{x}) &= \frac{e^{\mathbf{x}}}{e^{\mathbf{x}} + \mathbf{1}}, \\ g(\mathbf{x}) &= \tanh(\mathbf{x}), \end{aligned} \tag{4}$$

are two examples, which have been used to a large extent in the past. They are furthermore two good examples of activation functions that can cause the problem of vanishing gradients (studied by Sepp Hochreiter in 1991, but further analyzed in 1997 by Sepp Hochreiter and Jürgen Schmidhuber [13]), which of course is something that should be avoided. A function that does not exhibit this problem is the rectified linear unit (ReLU) function

$$g(\mathfrak{x}) = \begin{cases} 0 & \text{if } \mathfrak{x} \le 0, \\ \mathfrak{x} & \text{otherwise}. \end{cases}$$

which has gained a lot of traction in recent years, and is today the most popular one for deep neural networks. One could easily understand why ReLU avoids the vanishing gradient problem, by looking at its derivative

$$g'(x) = \begin{cases} 0 & \text{if } x \le 0, \\ 1 & \text{otherwise}. \end{cases}$$

and from it conclude that the gradient is either equal to 0 or 1. However, the derivative also shows a different problem that comes with the ReLU function, which is that the gradient might equal zero, and that the output from many of the nodes might in turn become zero. This problem is called the dead ReLU problem, and it might cause many of the nodes to have zero impact on the output. This can be solved by imposing minor modifications on the function, and it therefore now comes in an array of different flavors. One such flavor is the exponential linear unit (ELU)

$$\mathbf{g}(\mathfrak{x}) = \begin{cases} \mathfrak{a}(e^{\mathfrak{x}} - 1) & \text{if } \mathfrak{x} \le \mathbf{0}, \\ \mathfrak{x} & \text{otherwise}. \end{cases}$$

where the value of alpha is often chosen to be between 0.1 and 0.3. The ELU solves the dead ReLU problem, but it comes with a greater computational cost. A variant of the ELU is the scaled exponential linear unit (SELU)

$$\log(\mathfrak{x}) = \lambda \begin{cases} \mathfrak{a}e^{\mathfrak{x}} & \text{if } \mathfrak{x} \le 0, \\ \mathfrak{x} & \text{otherwise,} \end{cases} \tag{5}$$

which is a relatively new activation function, first proposed in 2017 [14]. The values of *α* and *λ* have been predefined by the authors, and the activation also needs the weights in the network to be initialized in a certain way, called lecun\_normal. lecun\_normal initialization means that the start value for each weight is drawn from a standard normal distribution.

Normalization can be used as a preprocessing of the data, due to its often positive effect on the model's accuracy, and some networks also implement batch normalization at some point or points inside the network. This is what is called external normalization. However, the beauty of SeLU is that the output of each node is normalized, and this process is fittingly called internal normalization. Internal normalization proved to be more useful than external normalization for the models in this paper, which is why SeLU was used throughout the network in the final models.

#### 2.3.2. Learning Rate

The learning rate, often denoted by *η*, plays a large role during the training phase of the models. After each iteration, the weights are updated by a predefined update rule such as gradient descent

$$\mathbf{w}\_{i+1} = \mathbf{w}\_i - \eta \nabla \epsilon(\mathbf{w}\_i)\_{\prime}$$

where ∇*ǫ*(**w***t*) is the gradient for the loss function at the *i*th iteration. The learning rate, *η*, can here be seen as determining the rate of change in every iteration. Gradient descent is but one of many update rules, or optimizers (as they are more often called), and it is by far one of the simplest. More advanced optimizers are often used, such as the adaptive moment estimation (Adam) [15], which has, as one if perks, individual learning rates for each weight. The discussion about optimizers will not continue further in this paper, but it should be clear that the value of the learning rate and the choice of optimizer have a great impact on the overall performance of the model.

#### 2.3.3. Filters

The filter dimensions need to be determined before training the model, and the appropriate dimensions depend on the underlying data and the model of choice. When analyzing time series, the filter needs to be one dimensional, since the time series is just an ordered sequence. The developer needs to determine just two things: the width of the filters (Figure 1 shows a filter with the width equal to two) and then how many filters to use for each convolutional layer. The types of features that the convolutional layer "searches" for are highly influenced by the filter dimensions, and having multiple filters means that the network can search for more features in each layer.

**Figure 1.** Dilated convolutional layers for an input series of length 16.

#### 2.3.4. Dilation

A dilated convolutional filter is a filter that, not surprisingly, is widened but still uses the same number of parameters. The filter is widened by neglecting certain inputs, and an example of this can be observed in Figure 1. The bottom layer represents the input, in the form of a time series *x* = (*x*1, *x*2, . . . , *xn*) (for some time *n*, onto which repeated dilated convolutions, with increasing dilation rates, are applied; the filter width is again set to equal two in the observed model). The first hidden layer applies dilated convolutions with the dilation rate equal to one, meaning that the layer applies the filter onto two adjacent elements, *x<sup>i</sup>* and *xi*+<sup>1</sup> , of the input series. The second layer applies dilated convolutions, with the rate now set to equal two, which means that the filter is applied onto elements *x<sup>i</sup>* and *xi*+<sup>2</sup> (notice here that the number of parameters remains the same, but the filter width has been "widened"). Lastly, the third and fourth layer have rates equal to four and eight, so the filter is applied onto elements *x<sup>i</sup>* and *xi*+<sup>4</sup> , and *x<sup>i</sup>* and *xi*+<sup>8</sup> , respectively.

#### 2.3.5. Dropout

The dropout factor is a way to prevent the model from overfitting to the training data, and it does this by setting a fraction of the weights in a certain layer to zero. This leads to a decrease in complexity, but the developer does not have control over which nodes will be set to zero (i.e., the weights are chosen at random). Hence, dropout is not the same as changing the number of nodes in the network. For more details, see [16].

#### *2.4. WaveNet*

The CNN models proposed in this paper are inspired by the WaveNet structure, modeled by van den Oord et al. in 2016 [5]. The main part of the network in a WaveNet can be visualized in Figure 2, which incorporates a dilated (and causal) convolution and a 1 × 1 convolution (i.e., the width of the filter set to equal one). The input from the left side is the result of a casual convolution, with filter size equal to two, which has been applied to the original input series as a sort of preprocessing. The output on the right side of the layer is the residual, which can be used as the input to a new layer, with an identical set up. The number of residual connections must be predetermined by the developer, but the dilated convolution also sets an upper limit on how many connections can be used. Figure 1 displays repeated dilations on an input series with length equal to 16, and we can see that the number of layers has an upper limit of four.

**Figure 2.** Overview of the residual layer, when only the main series is used as input.

Furthermore, the output from the bottom of each layer is the skip, which is the output that is passed on to the following layers in the network. If four layers are used, as in Figure 1, then the network would end up with four skip connections. These skip connections are then added (element-wise) together to form a single output series. This series is then passed through two 1 × 1 convolutions, and the result of this will be the output of the model.

The WaveNet has three important characteristics: it is dilated, causal and has residual connections. This means that the network is sparsely connected, that calculations can only include previous values in the input series (which can be observed in Figure 1) and that information is preserved across multiple layers. The sparsity is also further increased by having the width of the filters equal to only either one or two.

The WaveNet sets out to maximize the joint probability of the series **x** = (*x<sup>t</sup>* , . . . , *xt*−*p*) *T* , for any time *t* and length equal to *p*, which is factorized as a product of conditional probabilities

$$p(\mathbf{x}) = \prod\_{i=1}^{t} p(\mathbf{x}\_i | \mathbf{x}\_1, \dots, \mathbf{x}\_{i-1})\_\prime$$

where the conditional probability distributions are modeled by convolutions. Furthermore, the joint probability can be generalized to include exogenous series

$$p(\mathbf{x}|\mathbf{h}) = \prod\_{i=1}^{t} p(\mathbf{x}\_i|\mathbf{x}\_1, \dots, \mathbf{x}\_{i-1}, h\_1, \dots, h\_{i-1})\_t$$

where **h** = (*h<sup>t</sup>* , *ht*−1, . . . , *h*1) *T* is the exogenous series.

The WaveNet, as proposed by the authors, uses a gated activation unit on the output from the dilated convolution layer in Figure 2

$$\mathbf{z} = \tanh(\mathbf{w}\_{t,k} \ast \mathbf{x}) \odot \sigma(\mathbf{w}\_{s,k} \ast \mathbf{x})\_{\prime\prime}$$

where ∗ is a convolution operator, ⊙ is an element-wise multiplication error, *σ* is a sigmoid function, **w**∗,*<sup>k</sup>* is the weights for the filters and *k* denotes the layer index. However, the model proposed in this paper will be restricted to only use a single activation function

$$\mathbf{z} = \mathrm{SeLU}(\mathbf{w}\_k \ast \mathbf{x}) \tag{6}$$

and the reason behind this is, again, that the gated activation function did not generalize well to the analyzed time series data.

When using an exogenous series to help improve the predictions, the authors introduce two different ways to condition the main series by the exogenous series. The first way, termed global conditioning, uses a conditional latent vector **l** (not dependent on time), accompanied with a filter **v***<sup>k</sup>* , and can be seen as a type of bias that influences the calculations across all timesteps

$$\mathbf{z} = \mathrm{Se} L \mathcal{U} (\mathbf{w}\_k \ast \mathbf{x} + \mathbf{v}\_k \ast \mathbf{l}) .$$

The other way, termed local conditioning, uses one or more conditional time series **h** = (*h<sup>t</sup>* , *ht*−1, . . . , *h*1) *T* , that again influences the calculations across all timesteps

$$\mathbf{z} = \mathrm{S}eL\mathrm{I}(\mathbf{w}\_k \ast \mathbf{x} + \mathbf{v}\_k \ast \mathbf{h})\tag{7}$$

and this is the approach that has been taken in this paper. This approach can further be observed in Figure 3.

**Figure 3.** Overview of the residual layer, when the main series is conditioned by an exogenous series.

Lastly, the WaveNet originally used a softmax activation function on the last output of the network, with the target values (raw audio) quantized into 256 different values. However, the softmax did not generalize well to the predictions for the financial time series used here, where the use of no activation

function performed better when predicting the next day's closing price of the S&P 500. In addition, when classifying the trend, the activation on the last output was seen as a hyperparameter, where the sigmoid and SeLU activations were compared against each other.

#### *2.5. Walk-Forward Validation*

Walk-forward validation, or walk-forward optimization, was suggested by Robert Pardo [17] and was brought forward since the ordinary cross-validation strategy is not well suited for time series data. The reason behind why cross-validation is not optimal for time series data is because temporal correlations exist in the data, and it should then be considered as "cheating" if one were to use future data points to predict past data points. This, most likely, leads to a lower training error, but should result in a higher validation/test error, i.e., it leads to poorer generalization due to overfitting. In order to avoid overfitting, the model should then, when making predictions at (or past) time *t*, only be trained on data points that were recorded before time *t*.

Depending on the data and the suggested model, one may choose between using all past observations (until the time of prediction) or using a fixed number of most recent observations as training data. The walk-forward scheme, using only a fixed number of observations, can be observed in Figure 4.

**Figure 4.** Walk-forward validation with five folds.

#### *2.6. Efficient Market Hypothesis*

Apart from the three sufficient assumptions, Eugen Fama (who can be seen as the father of modern EHM), lays out in [9] three different types of tests for the theory: weak form, where only past price movements are considered; semi-weak form, where other publicly available information is included, such as quarterly or annual reports; strong form, where some actors might have monopolistic access to relevant information. The tests done in this paper, outlined in the introduction, are clearly of the weak form.

Fama also brings to light three models that have historically been used to explain the movements of asset prices in an efficient market: the fair game model, the submartingale and the random walk. The fair game is by far the most general of the three, followed by the submartingale and then the random walk. However, this paper does not seek to explain the movements of the market, but merely to predict them, which means that any of the models can be used as the base model in the tests ahead.

Given the three assumptions on the market, the theory indicates that the best guess for any price in the future is the last known price (i.e., the best guess for tomorrow's price of an asset is the price of that asset today). This can be altered to include a drift term, which can be determined by calculating the mean increment for a certain number of past observations, and the best guess then changes to be the last known price added with that mean increment.

#### **3. Model Formulations and Methodology**

#### *3.1. Base Model*

The base model in this paper, when predicting the next day's closing price, was chosen to be that of a random walk, and this model can be modeled as an AR(1) process

$$X\_t = c + \varphi\_1 X\_{t-i} + \varepsilon\_{t\nu}$$

where *ϕ*<sup>1</sup> needs to equal one. *ε<sup>t</sup>* is here again a white noise, which accounts for the random fluctuations of the asset price. The *c* parameter is the drift of the random walk, and it can be determined by taking the mean of *k* previous increments

$$c = \frac{1}{k} \sum\_{j=1}^{k} (X\_j - X\_{j-1}).$$

The best guess of the next day's closing is obtained by taking the expectation of the random walk model (*ϕ*<sup>1</sup> equal to one)

$$E(X\_t) = E(\mathcal{c} + X\_{t-1} + \varepsilon\_t) = \mathcal{c} + X\_{t-1},\tag{8}$$

which is the prediction that the base model used.

In the case of predicting the trend, the base model implemented a passive buy-and-hold strategy, which means that the model always predicts the trend to be up.

#### *3.2. CNN Model*

As stated in the introduction, a CNN model, inspired by the WaveNet, was compared to the base model, and two different approaches were needed in order to answer the research questions. The first approach was to structure the CNN as a univariate model, which only needed to be able to handle a single series (the series to make predictions over). This model can be expressed as a NAR model, which can be observed by studying Equation (2). Each element *x<sup>t</sup>* in the sequence is determined by a non-linear function *f* (the CNN in this case), which takes the past *p* elements in the series as input. The second approach was to structure the CNN as a multivariate model, which needed to be able to handle multiple series (the series to make predictions over, together with exogenous series). This model, on the other hand, can be expressed as a NARX model, which can be observed by studying Equation (3). Again, each *x<sup>t</sup>* is determined by a non-linear function *f* , which here takes the past *p* and *r* elements in the main and exogenous series as inputs.

The NAR and NARX models were here, for convenience, named the single- and multi-channel models. However, two different variants of the multi-channel model were tested in order to compare the different structures implemented in the original WaveNet paper and in [7,8].

As was mentioned in the introduction, the difference between the two variants is how the residuals from the exogenous series are handled. The implementation found in the WaveNet paper takes into account both the main series residuals as well as the exogenous series residuals, while the implementation in the second variant only takes into account the main series residuals. This can be visualized by observing Figure 3 and then ignoring the residual for each exogenous series. This, in turn, leads to each residual layer beyond the first layer having a similar structure to that of Figure 3. The two variants were named multi-channel-sparsely-connected model (multi-channel-sc) and multi-channel-fully-connected model (multi-channel-fc) in this paper. Furthermore, all models have adopted a dropout factor of 0.2 for all layers, since this leads to better performances for all three models.

The three models (the single-channel as well the two variants of the multi-channel) can be further observed in Figure 5, which is a side view of the dilated layer shown in Figure 1. (a) represents the single-channel model, where no exogenous series are considered; therefore, the model only has to handle a single residual. (b) represents the multi-channel-sc model, where exogenous series

are considered, but again, only a single residual is taken into account, while (c) represents the multi-channel-fc, where all residuals are taken into account.

**Figure 5.** Side view of the dilated convolutional layers in Figure 1. (**a**) Only the main series is used; (**b**,**c**) when the main series is conditioned by an exogenous series, as in the model proposed in [7,8] and in the original WaveNet [5].

#### *3.3. Data Sampling and Structuring*

The financial data, for the single-channel model as well as the more complex multi-channel models, were collected from Yahoo! Finance. The time interval between the observations was chosen to equal a single day, since the objective was to predict, for any given time, the next day's closing price and the next day's trend (i.e., the time series **x** = (*x<sup>t</sup>* , . . . , *xt*−*p*) *T* , at any time *t*, was used to predict *xt*+<sup>1</sup> in the first case and *yt*+<sup>1</sup> in the second case, where *yt*+<sup>1</sup> = {0, 1}).

For the single-channel model, the series under consideration, at any time *t*, was **x** = (*x<sup>t</sup>* , . . . , *xt*−*p*) *T* , which is composed of ordered closing prices from S&P 500. In the multi-channel models, different combinations of ordered OHLC (open, high, low and close) prices, of the S&P 500, VIX (implied volatility index of S&P 500), TY (10 year treasury note) and TYVIX (implied volatility index of TY) were considered. As mentioned before, the closing price of the S&P 500 was the main series, while the other series **Z** = (**z**1, . . . , **z***m*) were the exogenous series, where *i*, **z***<sup>i</sup>* = (*zi*,*<sup>t</sup>* , . . . , *zi*,*t*−*r*) *T* , for every *i*. The values of *p* and *r* determine the orders of the NAR and NARX, and different values will be tested during the validation phase. However, only combinations where *p* and *r* are equal will be tested, and *p* will therefore be used to denote the length for both the main and exogenous series in the continuation of this paper.

The time span of the observations was chosen to be between the first day in 2010 and the last day in 2019, which resulted in 2515 × *m* observations, where again *m* denotes the number of exogenous input series. Furthermore, since the models require *p* preceding observations, **x** = (*x<sup>t</sup>* , . . . , *xt*−*p*) *T* , to make a prediction and then an additional observation, *xt*+1, to evaluate this prediction, the number of time series that could be used for predictions were decreased to (2515 − *p* − 1) × *m*. These observations were then structured into time series, resulting in a tensor with dimension (2515 − *p* − 1) × *p* × *m*.

The resulting tensor was then divided into folds of equal size, which were used in order to implement the walk-forward scheme. The complete horizontal bars in Figure 4 should here be seen as the whole tensor, while the subsets consisting of the blue and red sections are the folds. The blue and red sections (training and test set of that particular fold) should be seen as a sliding window that "sweeps" across the complete set of time series. Whenever the model is done evaluating a certain fold, the window sweeps a specific number of steps (determined by the size of the test set) in time in order to evaluate the next fold.

By further observing Figure 4, it should become clear that the number of folds is influenced by the size of the training and test sets. The size of each fold could (and most likely should) be seen as a hyperparameter. However, due to the interest of time, the size of each fold was set to 250 series, which means that each fold had a dimension of 250 × *p* × *m*. Each fold was then further divided into a training set (first 240 series) and a test set (last 10 series), where the test set was used to evaluate the model for that particular fold.

The sizes chosen for the training and test sets gave as a result 226 folds. These folds were then split in half, where the first half was used in order to validate the models (i.e., determine the most appropriate hyperparameters), and the second half was used to test the generalization of the optimal model found during the validation phase.

One last note about the data sampling is that when predicting the closing price, the time series were the original prices collected from Yahoo!, while when predicting the trend, the time series were changed to model the increments each day.

#### *3.4. Validation and Backtesting*

During the validation phase, different values for the length of the input series (i.e., the value of *p*), the number of residual connections (i.e., number of layers stacked upon each other, see Figures 1 and 2) and the number of filters (explained in the theory section for CNNs) in each convolutional layer were considered. The values considered for *p* were 4, 6, 8 and 12, while the number of layers considered were 2 and 3, and the number of filters considered were 32, 64 and 96. When classifying the trend, the activation function applied to the last output was seen as a hyperparameter as well, and the sigmoid and SeLU activations were considered.

For the multi-channel models, all permutations of different combinations of the exogenous input series were considered. However, it was only for the mutli-channel-fc model that the exogenous series were seen as a hyperparameter. The optimal combination of exogenous series for the multi-channel-fc model was then chosen for the multi-channel-sc model as well. One final note regarding the hyperparameters is that the dilation rate was set to a fixed value equal to two, which is is the same rate as was proposed in the original WaveNet model, and the resulting dilated structure can be observed in Figure 1.

As was stated in the previous section, the validation was made on the first 113 folds. The overall mean for the error of these folds, for each combination of the hyperparameters above, was used in order to compare the different models, and the model with the lowest error was then used during the backtesting.

The batch size was set to equal one for all models, while the number of epochs was set to 300 in the single-channel model and 500 in the multi-channel models. The difference in epochs is here due to the added complexity that the exogenous series brings. An important note regarding the epochs and the evaluation of the models is that the model state, associated with the epoch with the lowest validation/test error, was ultimately chosen. This means that if a model made predictions over 300 epochs, but the lowest validation/test error was found during epoch 200, the model state (i.e., the value of the model's weights) associated with epoch 200 were chosen as the best performing model for that particular fold.

#### *3.5. Developing the Convolutional Neural Network*

The networks were implemented using the Keras API, from the well known open-source library TensorFlow. Keras provides a range of different models to work with, where the most intuitive might be the Sequential model, where developers can add/stack layers, and then compile them into a network. However, the Sequential model does not provide enough freedom to construct the complexity introduced in the residual and skip part of the WaveNet. Keras functional API

(more information regarding Keras functional API can be found on Keras official documentation https://keras.io/models/model/) might be less intuitive at first, but it does provide more freedom, since the order of the layers in the network is defined by having the output of every layer explicitly defined as an input parameter to the next layer in the network.

Furthermore, Keras comes with TensorFlow's library of optimizers, which are used in order to estimate the parameters in the model and taken as an input parameter when compiling the model. The optimizer used here was the Adam optimizer, and the learning rate was set to equal 0.0001.

#### **4. Results**

#### *4.1. Validation*

Table 1 displays the validation error when predicting the next day's closing price, while Table 2 displays the validation accuracy when classifying the trend. The *p* is again the length of the input time series, while *l* is the number of layers in the residual part of the network (see Figure 1), and *f* is the number of filters used in each convolutional layer. In Table 2, *a* denotes the used activation function.


**Table 1.** Validation error for the different models, where *p* is the length of the time series, *l* is the number of residual layers and *f* is the number of filters.


**Table 2.** Validation accuracy for the different models, where *a* is the activation function, *p* is the length of the time series, *l* is the number of residual layers and *f* is the number of filters.

The lowest validation error was achieved with *p*, *l* and *f* equal to 12, 3 and 96 for the single-channel model, while 8, 3, 64 and 8, 3, 32 were the optimal parameters for the multi-channel-sc model and the multi-channel-fc model, respectively. The highest validation accuracy was achieved with SeLU as the activation for all models and *p*, *l* and *f* equal to 12, 3 and 96 for the single-channel model, while 8, 3, 96 and 12, 3, 96 were the optimal parameters for the multi-channel-sc model and the multi-channel-fc model, respectively.

The validation error and validation accuracy for the multi-channel models are displayed only for the best combination of exogenous series found for the multi-fc-channel model, which proved to be just the highest daily value of the VIX in both cases.

#### *4.2. Testing*

Figure 6 shows the cumulative mean, of the test error, for all 113 test folds, while Figure 7 shows the cumulative mean for the last 50. These two figures paint two different pictures of the single-channel and multi-channel models. The means across all 113 folds were 0.5793, 0.4877, 0.4707 and 0.4621, for the base, the single-channel, multi-channel-sc and multi-channel-fc models, respectively, while the means across the last 50 were 0.6572, 0.5468, 0.5250 and 0.5416. By looking at these numbers, one can see that the performance of the multi-channel-fc model to the base model, when predicting the next day's closing price, was worse in the last 50 folds than for all 113 folds, while the reverse can be said about the single-channel and the multi-channel-sc models.

**Figure 6.** Cumulative mean of the MAPE for all 113 test folds.

**Figure 7.** Cumulative mean of the MAPE for the last 50 test folds.

Figures 8 and 9, on the other hand, show the cumulative mean of the validation error for all 113 test folds and the last 50 test folds. Again, the multi-channel-fc model started out well, but the performance compared to the single-channel and multi-channel-sc model worsened over time. The means across all 113 test folds were 0.5442, 0.7504, 0.7451 and 0.7496 for the base, single-channel, multi-channel-sc and multi-channel-fc models, respectively, while the means for the last 50 test folds were 0.5600, 0.7580, 0.7600 and 0.7360.

**Figure 8.** Cumulative mean for the accuracy for all 113 test folds.

**Figure 9.** Cumulative mean for the accuracy for the last 50 test folds.

Both the single-channel and multi-channel models outperformed the base model over the test folds, which accounts for almost five years of observations. Furthermore, the multi-channel models clearly performed better than the single-channel model, when looking at the performance across all test folds. However, the positive effects of including the exogenous series seemed to wear off in the last folds for the complex multi-channel-fc model, while it actually increased for the simpler multi-channel-sc model. This suggests that the problem of generalization for the multi-channel-fc model probably lies in that the relationship between the main series and the exogenous series has been altered, which, interestingly enough, only affects the more complex model.

Figure 10 shows the gains for the models when applied to the test data. Both the single- and multi-channel models clearly outperformed the passive buy-and-hold strategy, which is to be expected, since the test accuracy for the base model was well below the other models. The gains were 1.5596, 26.3642, 25.0388 and 26.9360 for the base, single-channel, multi-channel-sc and multi-channel-fc, respectively, meaning that if one was to implement any of the WaveNet inspired models, during the specified period, he or she would have a profit of more than 24 times the original amount.

**Figure 10.** The calculated gain across all test folds, for all of the four models.

Lastly, while the multi-channel-fc model outperformed all other models across all folds, it is also of interest, for further work, to see in what settings the multivariate model performed the best and the worst. Figures 11 and 12 give an example of these settings, where it shows the folds for which the multi-channel-fc model outperformed (fold 139) and underperformed (fold 148) the most against the base model.

**Figure 11.** Predictions for the 10 test observations in test fold 25.

**Figure 12.** Predictions for the 10 test observations in test fold 34.

#### **5. Discussion**

There is no real scientific basis for having the training size equal to 240 and validation/test size equal to 10, although it did perform better than having the sizes equal 950 and 50 respectively. It might seem odd to someone, with little or no experience in analyzing financial data, that one would choose to have a limit on the training size and why the models, evidently, perform better using fewer observations, since having a larger set of training observations is generally seen as a good thing. However, the financial markets are ever-changing, and the predictors (the past values in the time series) usually change with it. New paradigms find their way into the markets, while old paradigms may lose their impact over time. These paradigms can be imposed by certain events, such as an increase in monetary spending, the infamous Brexit or the current Covid-19 pandemic (especially the response, by the governments and central banks, to the pandemic). Paradigms can also be recurrent, such as the ones that are imposed by where we are in the short- and long-term debt cycle. Because of these shifts, developers are restricted in how far back in time they can look and, therefore, need to put restrictions on the training size.

The paper brings forward two very positive results, which are that the predictions for the next day's closing price as well as the trend are made significantly better by the CNN models. However, the fact that the performance of the more complex multi-channel model decreases over time, against the other models, for both predicting the closing price and the trend, is indeed concerning. This became obvious when studying the change in performance between all 113 and the last 50 test folds. If both the multi-channel-sc and multi-channel-fc models performed worse in the last 50 folds, then it would have been easy to again "blame" the temporal structure of the financial data and, more specifically, the temporal dependencies between the main and exogenous series. However, only the more complex multi-channel model's performance degraded, which means that the complexity (i.e., the intermingling between the series in all residual layers) is the primary issue. A solution might then be to have the complexity as a hyperparameter as well and not differentiate between the two structures, as was done here. In other words, the two models might more appropriately be seen as two extreme cases of the same model, in a similar way as having the number of filters set to 32 and 96 (see Table 1, 32 and 96 are the extreme cases for the number of filters). By looking at Figure 5 (with *p* equal to 16 in this case), one can see that the hyperparameter for the complexity has two more values to chose from (having the exogenous series to directly influence the second and third hidden layers). It would also be appropriate to compare the models against different time frames and asset classes to see if the less complex model indeed generalizes better over time, or if the result here was just a special case. However, viewing the complexity as a hyperparameter could prove to be beneficial in both cases.

The tests made in this paper were not primarily intended to judge the suitability of the models as trading systems, but rather if a deep learning approach could perform better than a very naive base model. However, the multi-channel models outperformed the base model by more than 20%, and this difference is quite significant and begs the question of what changes could be done in order for the model to be used as a trading system. While most of the predictions in fold 25, Figure 11, are indeed very accurate, the predictions in fold 34, Figure 12, would be disheartening for any trader to see if it were to be used as a trading system. This suggests that one should try to look for market conditions, or patterns, similar to the ones that were associated with low error rates in the training data. This could be done by clustering the time series, in an unsupervised manner, and then assign a score for each class represented by the clusters, where the score can be seen as the probability of the model to make good predictions during the conditions specific to that class. A condition classed in a cluster with a high score, such as the pattern in Figure 11, would probably prompt the trader to trust the system and to take a position (either long or short, depending on the prediction), while a condition classed in a cluster with a lower score would prompt the trader to stay out of the market or to follow another trading system that particular day.

#### **6. Conclusions**

The deep learning approach, inspired by the WaveNet structure, proved successful in extracting information from past price movements of the financial data. The result was a model that outperformed a naive base model by more than 20%, when predicting the next day's closing price, and by more than 37% when predicting the next day's trend.

The performance of the deep learning approach is most likely due to its exceptional ability to extract non-linear dependencies from the raw input data. However, as the field of deep learning applied to the financial market progresses, the predictive patterns found in the data might become increasingly hard to find. This would suggest that the fluctuations in the market would come to more and more mirror a system, where the only predictive power lies in the estimation of the forces acting on the objects, which are heavily influenced by the current sentiment in the market. A way to extract the sentiment, at any current moment, might be to analyze unstructured data, extracted from, for example, multiple news sources or social media feeds. Further study in text mining, applied to financial news sources, might therefore be merited and might be an area that will become increasingly important to the financial sector in the future.

**Author Contributions:** Conceptualization, M.S. and L.B.; methodology, L.B.; software, L.B.; validation, L.B.; data curation, L.B.; writing—original draft preparation, L.B.; writing—review and editing, M.S.; project administration, M.S. and L.B.; funding acquisition, M.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** The authors would like to thank the anonymous reviewers for several valuable and helpful suggestions and comments to improve the presentation of the paper.

**Conflicts of Interest:** The authors declare no conflicts of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Entropy* Editorial Office E-mail: entropy@mdpi.com www.mdpi.com/journal/entropy

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18

www.mdpi.com ISBN 978-3-0365-3192-2