*Proceeding Paper* **Price Dynamics and Measuring the Contagion between Brent Crude and Heating Oil (US-Diesel) Pre and Post COVID-19 Outbreak †**

**Claudio Marcio Cassela Inacio, Jr. \*,‡ and Sergio Adriani David \*,‡**

Institute of Mathematics and Computer Science, University of São Paulo, Av. Trabalhador São-Carlense 400, São Carlos 13566-590, Brazil


**Abstract:** The objective of this work is to analyze the price dynamics and the level of association between the Brent crude oil prices and heating oil (HO), i.e., US diesel. The data series are obtained from daily future contract prices of Chicago Mercantile Exchange (CME) group exchanges and the Intercontinental Exchange (ICE). A continuous evaluation of the Detrended Cross-Correlation Analysis (DCCA) between Brent crude oil prices vis-a-vis HO is proposed by means of the rolling window approach, allowing a dynamic analysis of their cross-correlations covering two periods, namely from January 2018 to December 2019 (before the COVID-19 pandemic) and from January 2020 to December 2021 (during the COVID-19 pandemic). The results indicate that there is a strong evidence of contagion in cross-correlation due to the initial impact of the pandemic, but the HO–Brent correlation fully recovered after approximately 200 days. However, lower time scales (*n*) are also sensitive to supply shortages in the short term and can be most reliable for agents that might not take long positions. Measuring this dynamic cross-correlation can provide useful information for investors and agents in the oil and energy markets.

**Keywords:** cross-correlation; DCCA method; oil derivatives; energy

#### **1. Introduction**

Since the first propositions about the relationship between oil prices and economic activity proposed by Hamilton [1], a significant number of researchers have dedicated themselves to exploring the connection between variations in its price and its effects on global economic activities. According to Zhang, Lai and Wang [2], oil is a resource known for large price fluctuations, where prices increases usually cause an increase in inflation and harm the economies of importing countries. On the other hand, price drops usually cause economic recessions and political instability in exporting countries, as their economic development can be jeopardized or delayed. In addition to price levels, another relevant factor is their volatility, since a relatively small increase can cause considerable economic losses [3]. Oil price variations are influenced by several factors. The dynamics between supply and demand is one of the main factors that affect price movement, which is also sensitive to exogenous factors such as the weather and irregular events [4,5] and also to political aspects and the expectations of market agents [6,7]. Such factors make the price movement non-linear and non-stationary, which makes its analysis more challenging and an important strategy for importers, exporters, investors and governments. While crude oil prices have historically been a fundamental component of economic analysis, the variation in crude oil prices also affects a country's economy and politics [8]. For this reason, it is pertinent to understand how crude oil prices relate to its derivatives. In this context, the

**Citation:** Inacio, C.M.C., Jr.; David, S.A. Price Dynamics and Measuring the Contagion between Brent Crude and Heating Oil (US-Diesel) Pre and Post COVID-19 Outbreak. *Eng. Proc.* **2022**, *18*, 8. https://doi.org/10.3390/ engproc2022018008

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 21 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

objective of this work is to analyze the relationship between Brent crude and heating oil (US diesel) prices, covering two periods. The first period (P1) precedes the COVID-19 crisis and includes data from January 2018 to December 2019. The second (P2) addresses the period from January 2020 to December 2021, covering the COVID-19 crisis period. The present study expands the existing literature, empirically examining the relationship between the price of oil and its derivatives in light of a continuous evaluation by means of adoption of the rolling window approach [9–11] applied to the DCCA (Detrended Cross-Correlation Analysis) method [11–20]. Such a perspective becomes relevant as the price of oil and its derivatives is the basis for decision-making in many countries. Indeed, in practical terms, the knowledge of the level of association between the prices of these products can help in the anticipation and formulation of strategies for companies and consumers. This paper is organized as follows: in Section 2, the data are introduced and the DCCA method and statistical test are presented. In Section 3, the results of the DCCA analysis are discussed. Finally, in Section 4, the main conclusions are outlined.

#### **2. Methods**

#### *2.1. Data Characteristics*

In this study, we use time series (TS) to represent daily prices of future market settlements related to the first available contract (C1) from CME group exchanges (NYMEX) and the ICE exchange for the HO and Brent, respectively. Each contract of the selected pair represents the most negotiated future contracts for diesel and crude oil worldwide. In general, price imbalances in the crude oil market tend to rapidly transfer to its derivatives. The reason is that the HO–Brent differential, also known as 'crack-spread', can be applied as a representation of the refinery margin to buy crude oil and produce diesel/heating oil.

In order to analyze the price dynamics of such pairs, we considered two distinguished periods, *P*<sup>1</sup> and *P*2, where the first denotes the two-year period prior to the COVID-19 outbreak (January 18 to December 2019) and the second denotes the two-year period after the COVID-19 outbreak (January 20 to December 2021).

#### *2.2. Detrended Cross-Correlation Analysis*

In recent years, the concept of fractals in TS has been investigated by means of the Hurst exponent (*H*) and Auto-Regressive Fractional Integrated Moving Average (ARFIMA) processes [7,21–30]. Several computational algorithms have been proposed to explore this field [31–36]. For example, when it comes to non-stationary TS, the Detrended Fluctuation Analysis (DFA) and its respective scaling coefficients yield satisfactory results to avoid the spurious detection of correlations or self-similarity [31,32]. This process is related to the Brownian and fractional Brownian motions, which allow us to quantify the long-range dependence in the analyzed TS.

A generalization of the DFA method was proposed by Podobnik and Stanley in 2008 [37], the so-called Detrended Cross-Correlation Analysis (DCCA), which is based on the detrended covariance between two TS. This method provides the quantification of long-range cross-correlations in the presence of non-stationarity. Considering two longrange cross-correlated TS *yi* and *y <sup>i</sup>* of equal length N, the values can be approached in the integrated form:

$$Y\_k = \sum\_{i=1}^k y\_i \tag{1}$$

$$Y\_{\;k}^{\prime} = \sum\_{i=1}^{k} y\_{\;i}^{\prime} \tag{2}$$

where *k* = 1, ... , *N*. The entire TS are fractioned into *N* − *n* overlapping boxes with *n* + 1 values. The box starting at the position *i* and landing at the position *i* + *n* is defined as the "local trend". Moreover, we can define the *Y*ˆ *<sup>k</sup>*,*<sup>i</sup>* and *Y*ˆ *<sup>k</sup>*,*i*(*i* ≤ *k* ≤ *i* + *n*) as the ordinate

points of the linear least-squares fit. For each box, it is possible to calculate the covariance of the residual as follows:

$$f\_{DCCA}^{2}(n,i) = \frac{1}{(n-1)} \sum\_{k=i}^{i+n} (Y\_k - \hat{Y}\_{k,i})(Y'\_k - Y'\_{k,i}) \tag{3}$$

Hence, the detrended covariance is calculated by summing over all overlapping *N* − *n* boxes of size *n* as:

$$F\_{DCCA}^2(n) = \frac{1}{(N-n)} \sum\_{N-n}^{i=1} f\_{DCCA}^2(n, i) \tag{4}$$

When a long-range cross-correlation appears between the two TS, then *FDCCA* <sup>∼</sup> *<sup>n</sup>λ*, where *λ* ≈ (*HDFA* + *H DFA*)/2. The *λ* exponent quantifies the long-range power-law correlations, but does not quantify the level of cross-correlation [37–39]. For this matter, Zebende [39] proposed the DCCA cross-correlation coefficient, defined by:

$$\rho\_{\rm DCCA} \equiv \frac{F\_{\rm DCCA}^2}{F\_{\rm DFA}\{\mathcal{Y}\_i\}} F\_{\rm DFA}\{y'\_i\} \tag{5}$$

These coefficient values are interpreted similarly to Pearson's correlation and can be summarized as follows: (a) −1 ≤ *ρDCCA* ≤ 1, (b) *ρDCCA* = 1 for a perfect cross-correlation, (c) *ρDCCA* = 0 for no cross-correlation presented between the TS, and (d) *ρDCCA* = −1 for a perfect anti-cross-correlation.

#### *2.3. Rolling Window Approach and the Statistical Test for* Δ*ρDCCA*

Different statistical tests have been adopted to evaluate the detrended cross-correlation coefficients [30,38,40,41]. In this work, we applied the statistical test proposed by Guedes et al. [9] to evaluate Δ*ρDCCA*. This test allows us to analyze two distinct moments separated by a phenomenon, such as the economic crisis caused by the COVID-19 pandemic. The coefficient is represented by:

$$
\Delta \rho\_{\rm DCCA}(n) = \rho\_{\rm DCCA}^{P\_2}(n) - \rho\_{\rm DCCA}^{P\_1}(n) \tag{6}
$$

where *ρ P*2 *DCCA*(*n*) and *ρ P*1 *DCCA*(*n*) are the DCCA coefficients for the periods *P*<sup>1</sup> and *P*2, respectively. The subsequent test consists in calculating the probability distribution function (PDF) of the Δ*ρDCCA*(*n*), supposing that they obey a normal distribution and follow the below steps [9]:


In general, the PDF of Δ*ρDCCA*(*n*) converges to a normal distribution, as shown by [9]. However, we decided to conduct D'Agostino and Pearson's normality test [42,43] to verify the normality of the distribution. Hereafter, the following contagion hypothesis is tested with a T-test for the mean of the Δ*ρDCCA*(*n*) parametric group and the Wilcoxon signed-rank test for the non-parametric group:

*H*0: Δ*ρDCCA*(*n*) = Δ*ρDCCA* (contagion does not exist);

*H*1: Δ*ρDCCA*(*n*) = Δ*ρDCCA* (contagion exists);

where Δ*ρDCCA* is the sample mean, which is approximately equal to zero. Thus, for each PDF defined by window size N (in this study, W1 = 50 days, W2 = 100 days, W3 = 150 days, W4 = 200 days, W5 = 250 days) and *n* time scales, we can obtain the positive critical point defined as Δ*ρc*(*n*) for 90%, 95%, and 99% confidence levels as follows:

$$
\langle \Delta \rho\_{\rm DCCA} \rangle \pm Z\_{\alpha 1/2} \frac{SD}{\sqrt{N}} \tag{7}
$$

where *Zα*1/2 is the value for the chosen confidence level *α*, *SD* is the standard deviation, and *N* is the sample size.

#### **3. Results and Discussion**

Figure 1 shows the *ρDCCA*(*n*) behavior for HO–Brent during periods *P*<sup>1</sup> and *P*<sup>2</sup> for every presented time scale (*n*) and different sliding window sizes (W1–W5). From Figure 1a,c, one can note that considering a window size of 50 and 100 days, the prices showed a weaker relation during the beginning of 2019, which is not applied to larger sizes of *W*, and it is an indication of short-term effects. Moreover, we can notice that all the window sizes (W1–W5) exhibited a fall in cross-correlation in the period that preceded the COVID-19 outbreak.

Regarding the COVID-19 period (*P*2), Figure 1b,d,f,h,j allow us to observe a loss of cross-correlation from March to April of 2020, when both markets presented an intense fall in prices due to lockdowns worldwide, especially the US market. Moreover, a considerable amount of market agents took a bearish (selling) position in these contracts due to the lack of global demand predictability during this period. However, one of the reasons for the price dissolution likely may have come from the specific characteristics of the diesel market. For example, heating oil—as the name suggests—can be used for heating purposes during severe US cold winters. Differently, the same product in Europe—namely gasoil—is applied for driving, such as gasoline for the US market. Therefore, during the lockdowns and with a lack of driving demand for fuel, the HO's price movement may have diverged from that of crude oil, gasoline, and gasoil.

Moreover, the 50-day and 100-day rolling windows are shown in Figure 1b,d, which showed another strong price dissolution between May and June of 2020. In addition, one can also observe that shorter window sizes are sensitive to short-term effects, which one can note during the year 2021. These effects are related to the US Gulf diesel supply shortage presented during the cold weather at the beginning of 2021 and also during the Ida hurricane effects in the second half of 2021 [44]. This might suggest that short-term supply shortages of diesel in the US Gulf can affect the HO–Brent cross-correlation, similarly to the restricted demand period caused by COVID-19. However, the supply short-term effects are not observed when using larger rolling window sizes, which is not the case for the initial pandemic effects that are displayed for every tested window. In general, the larger windows presented a cross-correlation recovery for the pair after the first half of 2020 until the end of 2021. One can also note that the greater time scales (*n*) diverge from the lower time scales and cannot encapsulate the complete price dynamics of both periods, since both markets are mostly interdependent in the long term compared to the short term [10].

Table 1 summarizes the descriptive statistics for the Δ*ρDCCA* distributions as a function of *n* with different sizes of *W*. As suggested by Guedes et al. [9], the observed mean values are approximately close to zero and the standard deviation (SD) decreases for greater *W* sizes. However, mostly skewness and kurtosis diverged from values observed from normal distributions, i.e., *Kurtosis* ≈ 3 and *Skewness* ≈ 0 for different combinations of *n* and *W*, which tends to affect the normality of the distributions. For this reason, we conducted D'Agostino and Pearson's normality test and the results are shown in Table 2. It can be seen that all the applied window sizes (*W*) presented non-normality for most tested time scales (*n*).

**Figure 1.** The Brent–HO *ρDCCA* TS comparison of *P*<sup>1</sup> vs. *P*<sup>2</sup> for W1 to W5.


**Table 1.** The *Brent-HO* descriptive summary of Δ*ρDCCA* for W1 to W5.

**Table 2.** The *Brent-HO* normality test of Δ*ρDCCA* for W1 to W5. Significance level of 95% (*p*-value < 0.05) rejects the null hypothesis of normality.


Thus, the contagion hypothesis can be tested for each Δ*ρDCCA* distribution. Table 3 depicts the significance test, where the *T*-test is applied to parametric (normal) distributions and the Wilcoxon signed-rank test for non-parametric (non-normal) distributions. One can note that there is evidence of a contagion predominance for time scales *n* < 32, which suggests short-term effect spillover when comparing *P*<sup>1</sup> and *P*2. However, there is no strong evidence of the contagion effect for values of *n* ≥ 32 days, which suggests that the market imbalances caused by COVID-19 did not affect the HO–Brent cross-correlation in the long term as much as the short term. Figure 2a–c confirms the alternative hypothesis (Δ*ρDCCA*(*n*) = 0), where it is possible to notice a prevalence of Δ*ρDCCA*(*n*) > 0 for the first 150 days of comparison. The Δ*ρDCCA*(*n*) overpasses the critical limits for most parts of the periods (see Table 4). On the other hand, from Figure 2d,e, one can observe that greater values of *n* and *W* tend to smooth the curves and have no clear pattern. However, for every time scale (*n*), the correlations are shown to be lower during the beginning of *P*<sup>2</sup> if compared to the same period in *P*1, in addition to the lower Δ*ρDCCA*(*n*) in the last 50 days of the curves, which indicates that the HO–Brent fully recovered in terms of correlation after 200 days of the COVID-19 outbreak.

**Table 3.** The *Brent-HO* significance test of Δ*ρDCCA* for W1 to W5. Significance level of 95% (*p*-value < 0.05) rejects the null hypothesis of Δ*ρDCCA* = 0.


**Table 4.** The *Brent-HO* critical values of Δ*ρDCCA* with 90%, 95% and 99% confidence level (CL) for W1 to W5.


**Figure 2.** The *Brent-HO* Δ*ρDCCA* TS for different time scales (*n*).

#### **4. Conclusions**

This work employed Detrended Cross-Correlation Analysis in the study of the future contract price dynamics between the US diesel (HO) and Brent crude oil during the periods pre- and post-COVID-19. The results indicate that there is strong evidence of contagion in cross-correlation due to the initial impact of the pandemic, but the HO–Brent correlation fully recovered after approximately 200 days. However, lower time scales (*n*) are also sensitive to supply shortages in the short term and can be most reliable for agents that might not take long positions. Therefore, this indicates that, despite the pair being highly correlated, the initial global lack of crude oil demand generated by the lockdowns caused a fall in crude oil prices, but the same dynamics appeared in the US diesel market only after a delay.

**Author Contributions:** Conceptualization, S.A.D.; methodology, C.M.C.I.J. and S.A.D.; software, C.M.C.I.J.; formal analysis, C.M.C.I.J. and S.A.D.; writing—original draft preparation, C.M.C.I.J. and S.A.D.; writing—review and editing, C.M.C.I.J. and S.A.D.; visualization, C.M.C.I.J.; supervision, S.A.D. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** The authors wish to acknowledge the Institute of Mathematics and Computer Science of the University of São Paulo (ICMC-USP), CeMEAI-USP and MECAI-USP for the general and financial support.

**Conflicts of Interest:** The authors declare no conflicts of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


### *Proceeding Paper* **Hybrid K-Mean Clustering and Markov Chain for Mobile Network Accessibility and Retainability Prediction †**

**Amel Salem Omer 1,\*, Tesfaye Addisie Yemer <sup>2</sup> and Dereje Hailemariam Woldegebreal <sup>1</sup>**


**Abstract:** To provide reliable services, mobile network operators (MNOs) continuously collect vital mobile network performance data to monitor and analyze the functioning of their radio access networks (RANs). RAN is a critical infrastructure for mobile networks and its performance is measured by key performance indicators (KPIs) such as accessibility, retainability, availability, integrity, and mobility. The standard practice is that network managers utilize KPIs to identify failures or unusual events that can significantly degrade the quality of service delivery and the end-users' experiences. However, taking corrective steps based on monitored performance parameters is a reactive approach that contributes to network and service degradations until corrective actions are taken. With the monitoring and automation of RAN infrastructure performance in mind, this paper presents the Markov chain, a widely used probabilistic modeling approach, as a systematic method for jointly predicting network accessibility and retainability status, two of the crucial RAN performance measures. The novel joint prediction is proposed to have a single operation for both accessibility and retainability. Real-time hourly KPIs data was collected from 1530 cells (base stations) run by an operator's network in Addis Ababa, Ethiopia, for 4 months, from 1 November 2020 to 28 February 2021. The cells are scattered across the capital city, where factors such as land use, settlement patterns, and customers behaviors differ. To capture the spatial variation of the KPIs without escalating the computational complexity much, the dataset is separated into six clusters using the K-mean clustering approach. The Markov chain KPIs status prediction models are formulated on a cluster level. The results reveal that the proposed models can predict the KPIs status with 94.61 percent accuracy. Because the data is already available and can be collected at any time using the operator's network management system (NMS), this is a cost-effective technique to proactively improve mobile network performance.

**Keywords:** accessibility; retainability; Markov chain; K-mean clustering

#### **1. Introduction**

The demand for dependable mobile network services is growing and is projected to continue to grow in the coming years. To meet this rising service demand, mobile network operators (MNOs) are expanding their networks and use a centralized network management system (NMS) to monitor the performance of the radio access network (RAN) and core network, two critical components of mobile network infrastructure. NMS is a network monitoring and control tool, with fault management and performance management its two essential functionalities [1]. Fault management is the need for fault-free operation and has three aspects, namely fault identification, fault isolation, and fault correction. The fault identification is conducted with the help of network alarms, while fault isolation of the network's remaining components from the failure is needed so that the isolated network can continue to function normally. Fault correction requires repairing or replacing failed

**Citation:** Omer, A.S.; Yemer, T.A.; Woldegebreal, D.H. Hybrid K-Mean Clustering and Markov Chain for Mobile Network Accessibility and Retainability Prediction. *Eng. Proc.* **2022**, *18*, 9. https://doi.org/10.3390/ engproc2022018009

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 21 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

components. Performance management, on the other hand, includes network monitoring to observe network activities and network control to take mitigations that increase network performance. Some of the network manager's performance concerns include determining the amount of capacity utilization, traffic monitoring, throughput status, and reaction time status, among others [1].

Although the NMS offers critical information through various management subsystems, most operators find it challenging to manage the data collected from the system and take corrective actions in a timely manner. Operators select key performance indicators (KPIs) and monitor them at hourly, daily, weekly, or monthly intervals to discover problems or unusual events that might drastically affect service delivery and end-user experience. KPIs are further grouped to assess network performance, and the widely used performance measures are accessibility, retainability, availability, integrity, and mobility. NMS holds vast amounts of historical network performance data, from which possible trends and patterns can be revealed using cutting-edge data mining techniques.

In mobile networks, Markov chain is used for call admission control [2], quality of experience (QoE) modeling [3], quality of service (QoS) modeling [4], efficient resource utilization [5], prediction of user mobility [6], handover management, and network operation status monitoring [7]. In [8,9], Markov chain is proposed to forecast radio resource controller (RRC) setup success and call setup success rates (CSSR) for the Long-term Evolution (LTE) mobile network. The status or state as per Markov's terminology of a cell (base stations) is classified as "Good/High," "Moderate/Acceptable," or "Bad/Low" based on the RRC success rate. Data collected from an operator's network is used to create the Markov chain–based models for RRC and CSSR future state predictions. A given cell is in one of the three states depending on the time of a day, the cell's geographic location, network capacity, and other user- and network-related factors. In addition to the Markov chain, cluster-based approaches, decision trees, and artificial neural networks were employed in [10–12] to estimate a network accessibility-related parameters. These and papers such as [13–15] addressed related performance measures for various generations of cellular mobile systems.

This paper's primary goal is to forecast mobile network accessibility and retainability status using real-time data gathered from the NMS of a major network operator in the capital city of Addis Ababa, Ethiopia. Specifically, the data were collected on an hourly basis from 1530 cells for 4 months' duration, from 1 November 2020 to 28 February 2021. The states of these two critical RAN performance parameters are defined based on the International Telecommunication Union's (ITU's) recommendations for network accessibility and retainability. As the cells are scattered across different geographic regions of the capital city, K-mean clustering technique is used to group cells having spatially correlated performances. The per-cluster averaged data are used to construct the Markov chain prediction model. Two approaches are used for the model formulation, and one is a separate approach so that two Markov models are built for accessibility and retainability. In the joint modeling, a single model is used to predict both parameters. Using either of the two approaches, we can compute the state of the network and the number of transitions until a steady-state is reached. The essential contributions of the research are mentioned here.


many times as the number of cells, we employed K-mean clustering to identify related cells. The Markov chain is then applied to the per-cluster averaged data. Prediction aids in analyzing the status of the considered mobile network.

The remaining paper is organized as follows. Section two discusses fundamental concepts and formulas in accessibility and retainability. Section three introduces some basic concepts of discrete Markov chains. Section four presents and discusses the results obtained. Finally, Section five concludes the paper by identifying possible future directions.

#### **2. Accessibility and Retainability KPIs**

KPIs obtained from network counters can be grouped into accessibility, retainability, integrity, mobility, and other factors in order to manage and track the performance of the network [17]. According to ITU, service accessibility is "*the ability of a service to be obtained, within specified tolerances and other given conditions, when requested by the user*." Service retainability is "*the ability of a service, once obtained, to continue to be provided under given conditions for a requested duration*" [18].

#### *2.1. Accessibility*

The accessibility KPI is expressed in probabilities, which indicate how likely a user is able to access the mobile service during specific service times and conditions. Accessibility measures the network's performance during call setup or before establishing a bearer [19]. For data availability reason, this paper focuses on the Third Generation (3G) mobile networks. RRC, radio access bearer (RAB), Enhanced Universal Terrestrial RAN RAB (ERAB), and CSSR are critical accessibility parameters, as presented below.

• RRC setup success rate (RRC SSR) evaluates the call success rate in a cell or cluster. The formula for this KPI is:

$$RRC\ SSR = \frac{Number\ of\ RRC\ setup\ success}{number\ of\ RRC\ connection\ attempt} \times 100\%.\tag{1}$$

• RAB setup success rate (RAB SSR) evaluates the success rate of assigning a RAB during a call setup procedure. The formula for this KPI is given as follows:

$$RAB\ SSR = \frac{Number\ of\ RAB\ setup\ success}{number\ of\ RAB\ setup\ attempt} \times 100\%.\tag{2}$$

• CSSR is used to evaluate the call setup success at the cell or cluster level. This KPI is calculated based on RRC SSR and RAB SSR for the case of third generation (3G) networks and ERAB SSR for the case of LTE networks.

$$Accibility = CSR = RR \, SR \, SR \times RAB \, SSR \times 100\%.\tag{3}$$

#### *2.2. Retainability*

Retainability assesses a network's performance after RAB is established and indicates the proportion of calls that serve the essential service without call drops.

$$\text{Retainability} = \left(1 - \frac{\text{Number of RAB abnormal release}}{\text{d total number of RAB release}}\right) \times 100\%.\tag{4}$$

Equation (4) fraction shows the call drop rate (CDR) value.

#### **3. Discrete-Time Markov Chain**

A Markov chain is a particular class of a stochastic process with random variables designating the states or outputs of the system [7,20]. The probability of the system transitioning from its current state to a future state depends only on the current state. The collection of states forms a state space of alphabet size *N*. Let {*a*1, *a*2, ..., *aN*} designate

For the Markov chain fulfilling the memoryless assumption, the transition probability is expressed as [21]:

$$\begin{array}{lcl}P(\mathcal{S}\_{n+1} &= a\_{\bar{j}} | \mathcal{S}\_{n} = a\_{\bar{l}\prime} \mathcal{S}\_{n-1} = a\_{h\prime} \quad \mathcal{S}\_{n-2} = a\_{\mathcal{S}\prime} \dots \dots) \\ &= \mathbb{P}\left(\mathcal{S}\_{n+1} = a\_{\bar{j}} | \mathcal{S}\_{n} = a\_{\bar{l}}\right) \end{array} \tag{5}$$

where 1 ≤ *i*, *j* ≤ *N*. We learn from the Markov property that only the most recent state matters to predict the next or future state. From Equation (5), the transition probability from state *ai* to state *aj* is designated as:

$$P\_{i\bar{j}} = \mathbf{P}\left(S\_{n+1} = a\_{\bar{j}} \middle| S\_n = a\_{\bar{i}}\right). \tag{6}$$

For all *i* and *j*, the summation of all transition probabilities in a row must be equal to one, i.e., *<sup>n</sup>*

$$\sum\_{j=1}^{n} p\_{ij} = 1.\tag{7}$$

#### *3.1. Transition Probability Matrix*

The collection of the transition probabilities *Pij* forms the probability transition matrix (TPM), *P* (See Equation (8)). Each entry of the matrix shows the probability that the system will transition or remain in the same state. *P* is a square matrix with the same dimension as the number of states.

$$P = \begin{bmatrix} P\_{11} & P\_{12} & P\_{13} & \dots & P\_{1N} \\ P\_{21} & P\_{22} & P\_{23} & \dots & P\_{2N} \\ \cdot & \cdot & \cdot & \cdots & \cdot \\ \cdot & \cdot & \cdot & \cdots & \cdot \\ P\_{N1} & P\_{N2} & P\_{N3} & \dots & P\_{NN} \end{bmatrix} \tag{8}$$

The transition probability *Pij* is computed from empirical data by counting the number of transitions from state *i* to state *j* and dividing the result by the count of all transitions from state *i* [7].

#### *3.2. Initial (Probability or State) Distribution*

The initial state distribution is usually expressed as a probability distribution vector, *U* of dimension 1 × *n*, as shown in Equation (9), with entries that indicate the probability that the system is in a given state at a given initial time. Each entry of the vector is non-negative and the sum of the all entries should be unity.

$$\mathcal{U}I = [P\_1 \ \ P\_2 \ \dots \ \dots \ P\_N]. \tag{9}$$

Without accurate knowledge of the initial distribution, the system can be considered to be in one state with absolute certainty, i.e., probability of unity.

#### *3.3. Steady-State Distribution*

One of the fascinating aspects of systems that obey the Markov chain is that, after a sufficient number of iterations/transitions, the chain converges to a steady-state, stable, equilibrium, or static distribution [7]. A steady-state condition is one in which the probability of the next state is the same regardless of the present state.

With knowledge of the transition matrix *P* and the initial probability vector *U*, the probability distribution of the chain after *k* transitions in the future is given by [7].

$$
\mathcal{U}^{(k)} = \mathcal{U}\mathcal{P}^k \tag{10}
$$

*P<sup>k</sup>* is the result of multiplying the transition matrix *k* times by itself. Each element of *U*(*k*), designated as *P*(*k*) *ij* , is the probability of going from state *i* to state *j* in *k* iteration. As we keep iterating through state transitions by applying *Pk*, the probability vectors *U*(*k*) converge to some fixed value, say *π*(*k*). That is called the *steady-state distribution* and mathematically written in the form as in Equation (11) below.

$$\lim\_{k \to \infty} \mathcal{U}^{(k)} = \mathcal{U}P^k = \pi\_{(k)}.\tag{11}$$

We note from Equation (11) that the Markov chain probabilistically predicts the system's future state based on knowledge of state space, initial distribution, and transition matrix.

#### *3.4. Transition Diagram*

A transition diagram, which illustrates all of the system's transitions, is another way to display the TPM. A directed arrow shows the presence of a transition from one state to another state, and each node represents a state of the Markov chain. The edge represents the current state, and the arrow points towards the next state [7].

#### **4. Results and Discussions**

This section covers data collection and accessibility and retainability status prediction using a four- and sixteen-state Markov chain model.

#### *4.1. Data Collection and Preprocessing*

The performance report system (PRS) installed in the operator's network was used to collect real-time hourly data from 1530 cells for 4 months' duration.


**Figure 1.** One-week RRC and RAB attempts.


#### *4.2. Clustering*

It takes time to analyze individual cell performance and patterns. In this research, we suggest K-mean clustering as a method for grouping cells with similar accessibility and retainability properties. Model construction and prediction are based on per-cluster averaged accessibility and retainability. The Elbow approach in K-mean clustering is used to identify the number of clusters by changing the parameter from 2 to 18. Each cell was randomly assigned to several clusters to vary the centroid of each center. The procedure was repeated until the cluster variation in the data could no longer be reduced by adjusting the cluster centroid. We discovered that a clustering value of 6 is adequate. Hourly data acquired from each cell varies from 0% to 100%; however, if no voice or data service requests are received in a cell for 1 h, all counter values for that hour are zero, as illustrated in Figure 1.

#### *4.3. KPI Threshold for States Definition*

Operators set threshold values for several KPIs based on the ITU's recommendations, considering variables such as capital expenditures, operational expenses, QoS, and customer satisfaction. Tables 1 and 2 display a threshold value for the considered operator's accessibility and retainability. Based on the values in the two tables, the states of accessibility and retainability are generated.

**Table 1.** Possible values of call setup attempt and CSSR.


**Table 2.** Possible values of RAB setup success and CDR.


#### *4.4. Separate Prediction*

As indicated above, the accessibility and retainability predictions at the cluster level can be made separately and jointly. Four states are required for the separate case. Hence, the corresponding transition matrices are 4 × 4. The state transition probability diagram for the sixth cluster is given in Figure 2 below, which is obtained after developing the model.

**Figure 2.** Transition probability diagram of cluster 6. (**a**) Accessibility. (**b**) Retainability.

Note that there are missing arrows in the two figures. As an example, there is no arrow in Figure 2a pointing from state A to state I, indicating that such a transition does not exist or the system has never landed in an idle state if it was initially in the Acceptable state.

#### *4.5. Joint Prediction*

The different state combinations of accessibility and retainability can be seen via joint estimation. For example, a Bad state of in accessibility and a Bad state in retainability can occur at the same time. When all possible combinations are considered, the total number of states rises to 16, and the resulting transition matrix is shown in Figure 3.

**Figure 3.** Sixteen-state TPM of cluster 6.

#### *4.6. State Prediction*

After creating the transition matrices and knowing the current/initial state distribution, the next state and steady state distributions are predicted using Equation (10). If the current state is (assumed to be) in a Good state, then the value of *π*<sup>0</sup> is,

$$
\pi\_0 = \begin{bmatrix} I \ G \ A \ B \ \end{bmatrix} = \begin{bmatrix} 0 \ 1 \ 0 \ 0 \ \end{bmatrix} \tag{12}
$$

Then, using Equations (10) and (12), the next accessibility probability is *π*<sup>1</sup> for one of the clusters when computed, and the result is:

$$
\begin{aligned}
\begin{bmatrix}
[0\ 1\ 0\ 0\ ]
\end{bmatrix} & \times \begin{pmatrix}
0.1111\ 0.8333\ 0.0000\ 0.0556\\0.0095\ 0.9780\ 0.0089\ 0.0036\\0.0000\ 0.8333\ 0.1667\ 0.0000\\0.0000\ 1.0000\ 0.0000\ 0.0000\\ 
\begin{bmatrix}
0.0095\ 0.9780\ 0.0089\ 0.0036\ \end{bmatrix}
\end{bmatrix}
\end{aligned}
\tag{13}
$$

According to the result, the system has a 0.95 percent chance of going to the Idle state, a 97.80 percent chance of staying in a Good state, a 0.89 percent chance of going to the Acceptable state, and a 0.36 percent chance of going to the Bad state.

Equation (11) is used to find the steady-state distribution calculated iteratively until the next and previous state values are equal. Tables 3 and 4 display the steady-state results for the four-state Markov chain regarding accessibility and retainability. Table 5 depicts the cluster 1 steady-state distribution using the sixteen-state Markov chain. For both scenarios, 70% of the data are used as a training set.


**Table 3.** Steady-state vector of accessibility using four-state Markov chain.

**Table 4.** Steady-state vector of retainability using four-state Markov chain.


**Table 5.** Steady-state vector of Cluster 1 using sixteen-state Markov chain.


In Table 3, the maximum value in the Good state from the six clusters is 99.26% in cluster 1, and the minimum value is 97.22% in clusters 2 and 6. The maximum value of the Bad state is 1.29% in cluster 2, and the minimum value is 0.1% in cluster 3. From this cluster, one cell is at the top in the Good state, and cluster 2 cells are at the top in the Bad state. Though cluster 6 cells are the least in the Good state, they are not at the top in the Bad state because, next to the Good state, cluster 6 cells have a high probability (1.09%) of being in the Idle state. So, if optimization or maintenance work is needed, the schedule and priority should be given based on the steady-state vector values of each cluster.

Steady state distribution for the sixteen-state Markov chain follows the same approach. Cluster 1's steady-state outcome is shown in Table 5. The first letter stands for accessibility, while the second stands for retainability, and 99.26% of the time, accessibility and retainability were in the Good state, while for 0.4% of the time accessibility was in the Acceptable state and retainability was in the Good state. Furthermore, for 0.2% of the time, accessibility and retainability were both in the Acceptable state, while 0.15% of the time, accessibility was Bad, and retainability was in the Good state. As a result, the table provides cell information relating to accessibility and retainability, allowing operators to quickly sort cells that perform poorly in either or a combination of the two performance measures.

#### *4.7. Evaluation Metric*

The accuracy of a model was assessed using Equation (14) [21], which calculates the percentage of correctly forecasting the next state given the current state.

$$Accuracy = \frac{\text{Correct predictions}}{\text{Total number of examples}} \times 100\%. \tag{14}$$

Table 6 below shows the accuracy results for different combinations of training data proportion, clusters, four vs sixteen states modeling and the two KPIS considered. As an example, we note that a minimum value of 96.09% prediction accuracy is achieved in cluster 2 in predicting accessibility when 60% training set is used, while 96.87% prediction accuracy is achieved in cluster 5 in predicting retainability when the 80% training set is used. A 94.61% prediction accuracy is achieved in cluster 6 in predicting both accessibility and retainability when 80% of the data are used for training and when the modeling is the case of the sixteen-state Markov chain.

**Table 6.** Prediction accuracy for sixteen-state Markov chain.


#### **5. Conclusions**

In this paper, the two important mobile network KPI parameters of accessibility and retainability are predicted by formulating the Markov chain in four states and sixteen states. The sixteen-state Markov chain is formulated in a bid to jointly estimate both KPIs in a single operation. Moreover, in order to capture the spatial behaviors of these KPIs, K-mean clustering is applied to cluster the data from 1530 cells into 6 clusters. States are created based on threshold values set by operators and the developed models are validated by splitting the data for training and testing. We hope the approach provides significant insight on how to use data available within an operator's NMS to better understand the status of a network.

This work might be improved in some ways. Conducting the prediction for a large number of cells in a computationally efficient manner and to obtain per-cell level information is one research area. The clustering and joint approach may not scale well as the number of cells grows. Moreover, applying the approach for other KPIs, network types, and services is an area worth exploring. Finally, future research should employ the hidden Markov model for status modeling and prediction.

**Author Contributions:** Conceptualization, T.A.Y. and D.H.W.; methodology, T.A.Y. and D.H.W.; software, T.A.Y. and A.S.O.; validation, A.S.O. and T.A.Y.; formal analysis, A.S.O., T.A.Y. and D.H.W.; investigation, A.S.O., T.A.Y. and D.H.W.; resources, T.A.Y.; data curation, T.A.Y.; writing—original draft preparation, A.S.O. and T.A.Y.; writing—review and editing, A.S.O. and D.H.W.; visualization, A.S.O.; supervision, D.H.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Restrictions apply to the availability of the data used for the research. Data was obtained from ethio telecom and are available from the second author with the permission of ethio telecom.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Proceeding Paper* **A Multivariate Approach for Spatiotemporal Mobile Data Traffic Prediction †**

**Bethelhem S. Shawel 1,2,\* , Endale Mare 3, Tsegamlak T. Debella 1,4 , Sofie Pollin <sup>2</sup> and Dereje H. Woldegebreal <sup>1</sup>**

	- <sup>3</sup> Ethio Telecom, Addis Ababa 1047, Ethiopia; endale.mare@ethiotelecom.et
	- <sup>4</sup> ENSISA, Institute for Research in Informatics, Mathematics, Automation and Signal, Université de Haute Alsace, 68093 Mulhouse, France
	- **\*** Correspondence: bethelhem.seifu@aait.edu.et
	- † Presented at the 8th International Conference on Time Series and Forecasting, Gran Canaria, Spain, 27–30 June 2022.

**Abstract:** Widespread deployment of spectrally efficient mobile networks, advancements in mobile devices, and proliferation of attractive applications has led to an exponential increase in mobile data traffic. Mobile Network Operators (MNOs) benefit from the associated revenue generation while putting efforts to meet customers' expectations of delivered services. Having a clear knowledge of the traffic demand is critical for network dimensioning, optimization, resource allocation, market planning, and the like. As the traffic demand, among others, is a function of customers' behavior and settlement patterns, land use, and time of the day, capturing traffic characteristics in both temporal and spatial dimensions is needed. Moreover, other parameters, such as the number of users and data throughput, inherently contain traffic-related information, necessitating a multivariate approach for understanding the traffic demand. Realizing the multidimensional and multivariate nature of the mobile data traffic, in this paper, we propose a multivariate and hybrid Convolutional Neural Network and Long Short-Term Memory network (CNN-LSTM) data traffic prediction model. The model is built on mobile traffic data collected from a Network Operator for Long-Term Evolution (LTE) network. The results confirm that the proposed model outperforms its univariate counterparts in Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE) by 58% and 50%, respectively. Moreover, the model is further compared with CNN-only univariate and multivariate models, which it also outperforms. The comparisons substantiate the achievable improvements because of the hybrid and multivariate nature of the prediction algorithm.

**Keywords:** mobile data traffic; multivariate prediction; temporal; spatial; CNN; LSTM

### **1. Introduction**

The global need for mobile data traffic is increasing for a variety of reasons, including the continuous growth of smarter mobile phones, emergence of machine-to-machine connections, and the availability of appealing and data-intensive applications [1]. Constant optimization, capacity enhancement, and efficient utilization of scarce resources are approaches by Mobile Network Operators (MNOs) to maintain service quality and avoid capacity crunch because of this ever-growing data demand. Moreover, network densification, traffic offloading, spectral efficiency improvement, and using more radio spectrum are techniques to improve the poor quality of service (QoS) that rises due to capacity crunch [2]. MNOs select the appropriate method based on their customer demand and financial capability. Current and future data traffic demand knowledge is one critical input for the design and implementation of the above-mentioned approaches.

**Citation:** Shawel, B.S.; Mare, E.; Debella, T.T.; Pollin, S.; Woldegebreal, D.H. A Multivariate Approach for Spatiotemporal Mobile Data Traffic Prediction. *Eng. Proc.* **2022**, *18*, 10. https://doi.org/10.3390/ engproc2022018010

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 21 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Time-series prediction methods play a vital role to forecast future demands for several real-world applications, including mobile data traffic demand. Data prediction models are broadly grouped as conventional and computational intelligence models [3]. Autoregressive Integrated Moving Average (ARIMA) and its extensions such as Seasonal ARIMA (SARIMA) are conventional methods. Computational intelligence techniques, on the other hand, include machine learning and deep learning-based models such as Long Term Short Memory (LSTM) and Convolutional Neural Network (CNN) networks. The time-series prediction method can be deveoped based on univariate or multivariate variables (or features). In the univariate case, there is one observation (a dependent feature, which in our case is data traffic) available for different time instants, while in the multivariate case there are multiple observations observed over different time instants. Multivariate time series prediction becomes popular in many real-world applications such as energy, finance, and weather and smoothens model building by increasing the model's performance [4].

Several researchers used machine learning methods, such as deep learning and data clustering, and multivariate approaches to model the dynamics of mobile data traffic in temporal and spatiotemporal domains. Based on data collected from an operator's network, ref. [3] proposed LSTM and Gated Recurrent Unit (GRU) to capture the dynamics in mobile data traffic. By comparing with Adaptive Neuro-Fuzzy Inference System (ANFIS) and Artificial Neural Network (ANN), the authors demonstrated the performance gain by using the proposed model. Similar to [3], ref. [5] also applied LSTM and Recurrent Neural Network (RNN) to predict data traffic demand of a 4G network run by an operator. In both papers, the prediction is done on a per-base-station basis and in temporal dimension only.

To separately estimate the linear and non-linear part of mobile data traffic, ref. [6] proposed a hybrid model using Double SARIMA (DSARIMA) and LSTM in which the DSARIMA handles the linear part whereas the LSTM predicts the nonlinear part of data traffic. To capture correlation among temporal traffic data taken from different bases stations that are spatially separated, K-Means clustering is used to group the base stations having similar data traffic. The result shows that the hybrid model outperforms the DSARIMA and LSTM-only models. A similar clustering-based approach was also used in [7] to assess the effectiveness of different time series prediction models for efficient deployment of base stations.

A multivariate and LSTM-based prediction approach is proposed in [8] to collect scheduling information of users. The multivariate features considered are: number of resource blocks, transport block size, and modulation and coding schemes. The results show the effectiveness of the LSTM network in capturing temporal variation for multivariate input features. Though for different applications, refs. [9–11] demonstrated the capability of multivariate and hybrid CNN-LSTM model to predict residential energy consumption and forecasting particulate matter, respectively. Univariate models are used as a benchmark for comparison and the results confirm that multivariate features greatly improve the model performance.

In summary, in a bid to improve prediction accuracy, from the survey we understood the need to incorporate multiple variables, data clustering, and blend LSTM and CNN to capture traffic dynamics in spatiotemporal dimensions. In this work, a hybrid CNN-LSTM mobile data traffic-prediction model that takes multiple traffic-related variables is proposed. A total of 4 months of Long-Term Evolution (LTE) network data traffic that is collected from the network operator is used to build and validate the model. To the best of our knowledge, there is no prior work that applies a hybrid CNN-LSTM model for such types of neural networks. Understandably, the multivariate features are technologyand application-dependent. Hence, we used our experience and availability of data to determine the features.

The remainder of the paper is organized as follows. The characteristics of mobile data traffic and associated data preprocessing are described in Section 2, followed by the discussions of mobile data traffic prediction approaches in Section 3. Section 4 contains the results and discussion, while the conclusion of the paper is presented in Section 5.

#### **2. Analysis of Mobile Data Traffic**

#### *2.1. Mobile Data Traffic Characteristics*

Mobile data traffic exhibits different properties in both time and spatial domains. Trend and seasonality are used to demonstrate the temporal properties of time-series data. The trend shows a long-term increase or decrease in the data, whereas seasonality is a repeating pattern with a fixed period such as daily, weekly and yearly. Figure 1 illustrates sample downlink data traffic, measured in Gigabytes, from two LTE radio base stations, called eNodeBs, measurement taken for a duration of 9 days. We observe that, even if the average daily traffic differs for different days, there is a daily seasonality observed in the data.

**Figure 1.** Data traffic pattern for sample LTE sites located at A and B.

#### *2.2. Data Traffic in Spatial Dimension*

We observe from Figure 1 a variation in data traffic demand at the two locations, motivating the need for additional investigation of the traffic pattern in the spatial dimension. Since mobile users constantly move within a given cellular network, the traffic pattern across neighboring base stations are correlated or complemented, such that developing in both the spatial and temporal dimensions would provide better information for telecom operators [12]. Spatiotemporal data traffic prediction incorporates different user behavior such as mobility and network behavior, such as the number of handovers in the network [13].

For spatial analysis, a grid-based or cluster-based approach can be used. In the former approach, a given service area is partitioned into (usually) uniform grids, and eNodeBs that fall into one cell of a grid are considered as one unit. However, because of the nonuniform distribution of eNodeBs, it is difficult to formulate models for large areas with fine-granularity grids.

The clustering approach is another option to incorporate all eNodeBs. In this approach, eNodeBs with similar traffic load patterns are grouped together and those eNodeBs within the same cluster have similar characteristics. The eNodeBs can be clustered based on either geographical location, also called spatial clustering, or on temporal behavior [6,7]. The assumption in spatial clustering is that neighboring eNodeBs exhibit similar temporal properties. In temporal-based clustering, the clustering is done based on temporal behavior irrespective of geographical location [6]. Considering more than one eNodB in time series clustering incorporates the spatial information of the data traffic. After clustering the base stations, the data traffic prediction model is developed per cluster level. In this paper, we have applied the temporal-based clustering approach.

#### *2.3. Multivariate Features Selection*

The data used in this paper is collected from an operator's LTE network for 4 months from October 2020 to January 2021 in an hourly granularity. The multivariate dataset

incorporates eight features: download downlink (DL) traffic, which is the traffic to be predicted; DL throughput; average and maximum number of users in a cell; number of attempted, successful, and setup failure Radio Access Bearers (RABs); uplink (UL) data traffic; and location information of the eNodeBs.

Pearson's-based correlation analysis is applied to select features and the result of the correlation analysis is illustrated in Figure 2. A correlation threshold value of 0.5 and above is used to select features. Moreover, for features whose correlation coefficient values are closer, e.g., cell average user of 0.83 value and cell maximum user of 0.82, only one is considered. Among the multivariate features DL traffic, a number of successful RABs, cell average user, and UL traffic are selected as they are highly correlated with downlink data traffic.

**Figure 2.** Feature correlation results.

#### *2.4. Data Preparation for CNN-LSTM Model*

In data preparation, missing values in the multivariate dataset are imputed with the Kalman filter, preserving the strong seasonality and trend of the data traffic. The features in the dataset are scaled with a standard scaler so all data points fall within a certain range. Since some machine learning algorithms that use distance metrics are affected by the span of the value found in the dataset, feature scaling is critical for improving model performance. Furthermore, the time series prediction problem is framed as supervised learning makes it suitable to train and test deep learning models.

#### *2.5. Time Series-Based Clustering*

Clustering a dynamic dataset differs from static data since the former changes over time. Different approaches such as Hierarchical Clustering, K-Means Clustering, and Fuzzy C Means Clustering are used for time series data clustering. Each method has its advantages and disadvantages. Among those methods, K-Means Clustering is used in several works for fast convergence even for a large number of datasets [14]. In this work, K-Means clustering is used to group the eNodeBs according to the daily data traffic volume and four distinct clusters are obtained based on K-Means clustering for the dataset.

#### **3. Mobile Data Traffic Prediction Methods**

Deep learning models such as LSTM, GRU, and CNN are becoming popular in dealing with sequential or time-series data such as text, speech, and often images [15]. The basics of LSTM and CNN networks that are used to develop the proposed model are revised in the following subsections.

#### *3.1. One Dimensional CNN Model*

CNN models are typically employed to analyze spatial or multidimensional data. However, one-dimensional CNN (1D CNN) can also be used to analyze texts and timeseries data [16]; 1D CNN can extract salient and representative features of time-series data by performing 1D convolution operations using multiple filters [17]. Figure 3 shows the difference between 1D CNN and 2D CNN. The kernel (filter) in 2D moves in both directions while it moves only in one direction for 1D CNN. The input for 2D CNN is an image, while multivariate time series features can be inputs for 1D CNN.

**Figure 3.** 2D and 1D CNN model input type and kernel slide direction [18].

#### *3.2. LSTM Model*

RNN is designed for handling sequential data by feeding the output of the previous layer as an input to the next layer, allowing the network to capture the dependency of sequential data [19]. LSTM is a type of RNN network that was modeled to solve shortterm dependency problems as well as exploding and vanishing gradient problems. LSTM network has three gates (*Forget gate ft*, *Input gate it* and *Output gate ot*) that decide which information to add or remove from the cell state, and the *Cell state*, *Ct*, memory stores the desired information. The mathematical expression for the LSTM network at time *t*, is described as follows:

$$f\_t = \sigma(\mathbf{W}\_f \cdot \mathbf{X}\_t + \mathbf{U}\_f \cdot h\_{t-1} + b\_f) \tag{1}$$

$$\dot{\mathbf{u}}\_t = \sigma(\mathbf{W}\_i \cdot \mathbf{X}\_t + \mathbf{U}\_i \cdot \mathbf{h}\_{t-1} + \mathbf{b}\_i) \tag{2}$$

$$S\_{\ell} = \tanh(\mathcal{W}\_{\mathcal{C}} \cdot \mathcal{X}\_{\ell} + \mathcal{U}\_{\mathcal{C}} \cdot h\_{\ell-1} + b\_{\mathcal{C}}) \tag{3}$$

$$\mathbf{C}\_{t} = \dot{\mathbf{r}}\_{t} \odot \mathbf{S}\_{t} + \boldsymbol{f}\_{t} \odot \mathbf{S}\_{t-1} \tag{4}$$

$$\rho\_l = \sigma(\mathcal{W}\_o \cdot \mathcal{X}\_l + \mathcal{U}\_o \cdot h\_{t-1} + V\_o \cdot \mathcal{C}\_t + b\_o) \tag{5}$$

$$h\_l = o\_l \odot \tanh\left(\mathbb{C}\_l\right) \tag{6}$$

where tanh(·) and *σ* are activation functions while *it*, *ft*, and *ot* represent input gate, the forget gate, and the output gate values at time *t*, whereas *bi*, *bf* , *bc*, and *bo* are bias vectors for the input gate, forget gate, cell state, and output gate, respectively. *Xt* is the input vector to the memory cell at time t while the parameters *Wf* , *Wi*, *Wc*, *Wo*, *Uf* , *Ui*, *Uc*, *Uo*, and *Vo* are weight matrices for gates and cell state.

#### *3.3. Proposed CNN-LSTM Model*

The CNN model is well known for its ability to automatically learn and extract features from raw sequence or time-series data. It is possible to combine this capability of the CNN model with the LSTM model. The LSTM network captures long-term and short-term dependency of temporal features more efficiently. The CNN model accepts input data sequences and extracts important feature information, whereas the LSTM model connected in tandem interprets and provides an output [20]. This combination of CNN and LSTM models is called a CNN-LSTM model. The general approach followed in this paper is illustrated in Figure 4.

#### **Figure 4.** Proposed CNN-LSTM Hybrid Model.

Common performance evaluation metrics for regression models are Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE). In this work, RMSE and MAPE are used as evaluation metrics, and the formula for those metrics are:

$$RMSE = \sqrt{\frac{1}{n} \sum\_{i=1}^{n} (y\_i - \mathfrak{H}\_i)^2} \tag{7}$$

$$MAPE = \frac{1}{n} \sum\_{i=1}^{n} |\frac{y\_i - \hat{y}\_i}{y\_i}| \tag{8}$$

where *y*ˆ*<sup>i</sup>* and *yi* corresponds to the actual and predicted values and *n* is the number of predicted instances.

#### **4. Experimental Results and Discussion**

#### *4.1. Clustering*

Among different clustering methods, in this work K-Means clustering is selected. Elbow method and silhouette score are used to determine the optimal number of clusters in K-Means clustering and to evaluate the goodness of a clustering technique, respectively. For our data, the optimal number of clusters is selected to be four and the sites are grouped into four clusters, as shown in Figure 5. We note how sites from various geographical areas are grouped into the same cluster because of similarities in their traffic patterns. Moreover, some base stations found in the same locations are grouped into different clusters.

**Figure 5.** 4G eNodeBs geographical distribution in each cluster.

#### *4.2. Per Cluster Time-Series Predictions*

Figure 6 shows the actual and predicted values of mobile data traffic, using multivariate features in Figure 6a, and univariate feature Figure 6b. In both cases, the predicted data traffic has a similar pattern, in terms of daily seasonality, when compared with the actual data traffic. Comparing predicted results in (a) and (b), we note that including multiple features in a multivariate manner helps to capture the irregularities and edges that occur during peak hours. The improved result with multivariate features also demonstrates the ability of the deep learning model, CNN-LSTM, to extract salient information from complex data required for prediction. Table 1 depicts a comparison in terms of RMSE and MAPE.

**Figure 6.** Mobile data traffic prediction with CNN-LSTM model (**a**) multivariate features (**b**) univariate feature.

The proposed model performance is compared with the CNN-only model with univariate and multivariate features models as also summarized in Table 1. The result confirms the performance improvements because of the hybrid CNN-LSTM model as well as the consideration of multivariate features.



Furthermore, the impact of filling the missing values and input time steps are analyzed. The results in Figure 7 and Table 2 show the model performance with and without imputing the missing values in the datasets. The model output shows that, while the model captures traffic variation for the imputed dataset well, not filling in the missing values degrades the prediction result.

**Figure 7.** Model prediction for with and without imputation of missing value.

**Table 2.** Effect of filling missing value.


The effect of the input time steps while developing a prediction model is investigated with two input time steps of 24 h and 168 h. Figure 8 and Table 3 illustrate the data traffic prediction for the CNN-LSTM model using 168 h input time steps compared to the actual data traffic, and it captures the data traffic variation well, including for irregular shapes and sharp edges at both ends. However, this modest performance improvement comes at the expense of computational time. The model with 168 input time steps took more time to train the model.

**Table 3.** Model performance comparison for 24 and 168 input time steps.


**Figure 8.** CNN-LSTM model prediction output for 168 input time steps.

#### **5. Conclusions**

Due to the increasing demand for mobile data traffic, the cellular network capacity is changing continuously and predictive models become inevitable in capturing the dynamics of mobile data traffic. In this paper, a deep learning-based model, CNN-LSTM, is proposed for mobile data traffic prediction using multivariate features. The hybrid CNN-LSTM networks leverage the power of the CNN model to extract salient features in the complex and nonlinear dataset as well as an LSTM to capture long–short dependency for time series data. The study shows the prediction capability of the CNN-LSTM model for mobile data traffic demand along with multivariate input features as compared to univariate features.

Future studies could include investigating the impact of other variants of clustering methods on model performance improvement. Furthermore, incorporating more specific multivariate features such as the amount of spectrum used and RAB attributes such as maximum source data, traffic type, and maximum bit rate might increase model performance and further improve the prediction accuracy.

**Author Contributions:** Conceptualization, B.S.S., E.M. and D.H.W.; methodology, B.S.S., E.M. and D.H.W.; software, B.S.S., E.M. and T.T.D.; validation, B.S.S. and T.T.D.; formal analysis, B.S.S., E.M. and D.H.W.; investigation, B.S.S. and E.M.; resources, E.M. and D.H.W.; data curation, E.M.; writing—original draft preparation, B.S.S. and E.M.; writing—review and editing, E.M. and D.H.W.; visualization, B.S.S. and E.M.; supervision, D.H.W. and S.P. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Data Availability Statement:** Restrictions apply to the availability of these data. Data was obtained from Ethio Telecom (a Mobile Network Operator) and are available from the corresponding author with the permission of Ethio Telecom.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Proceeding Paper* **An Application of Neural Networks to Predict COVID-19 Cases in Italy †**

**Lorena Saliaj \* and Eugenia Nissi**

Department of Economic Studies, School of Economic, Business, Legal and Sociological Sciences, Università degli Studi "G. d'Annunzio", 65127 Pescara, Italy; eugenia.nissi@unich.it

**\*** Correspondence: lorena.saliaj@unich.it

† Presented at the 8th International Conference on Time Series and Forecasting, Gran Canaria, Spain, 27–30 June 2022.

**Abstract:** COVID-19 pandemic has become the greatest worldwide threat, as it has spread rapidly among individuals in most countries around the world. This study concerns the problem of weekly prediction of new COVID-19 cases in Italy, aiming to find the best predictive model for daily infection number in countries with a large number of confirmed cases. We compare the forecasting performance of linear and nonlinear forecasting models using weekly COVID-19 data for the period between 24 February 2020 until 16 May 2022. We discuss various forecasting approaches, including a Nonlinear Autoregressive Neural Network (NARNN) model, an Autoregressive Integrated Moving Average (ARIMA) model, a TBATS model, and Exponential Smoothing on the collected data and compared their accuracy using the data collected from 23 March 2020 to 20 April 2020, choosing the model with the lowest Mean Absolute Percentage Error (MAPE) value. Since the linear models seem to not easily follow the nonlinear patterns of daily confirmed COVID-19 cases, Artificial Neural Network (ANN) have been successfully applied to solve problems of forecasting nonlinear models. The model has been used for weekly prediction of COVID-19 cases for the next 4 weeks without any additional intervention. The prediction model can be applied to other countries struggling with the COVID-19 pandemic, to any possible future pandemics, and also help make better decisions in future.

**Keywords:** COVID-19; time series forecasting; NARNN; ARIMA

### **1. Introduction**

The World Health Organization has recognized the COVID-19 virus as a global threat, declaring it a universal epidemic. Predicting COVID-19 infections' future trend is very important, as it has been having a significant worldwide negative impact on economics, medicine, finance, life expectancy, etc. The chance of having data in advance on its spread may enhance public health decision-making, allowing countries to avoid possible future crises by better allocating health resources.

Different forecasting models have been proposed for predicting the global or local spread of the pandemic, since 2020.

In our work, we provide forecasts for the confirmed Italian regions' new COVID-19 cases, using linear and nonlinear time series forecasting models and comparing their accuracy to analyze their advancement based on the daily reported data. Our aim is to forecast new confirmed COVID-19 cases through a comparison of the performance of these models, with the aim to have clear expectations of future new cases.

The purpose of our work is to determine the best COVID-19 new cases forecasting model. Several studies try to predict the evolution of the COVID-19 pandemic. Batista [1] predicted the number of cases in China, South Korea, and the rest of the world during the first semester of 2020 using a logistic model. Safi and Sanusi [2] applied an ARIMA model on data collected during the first and second pandemic wave. Khan and Gupta [3]

**Citation:** Saliaj, L.; Nissi, E. An Application of Neural Networks to Predict COVID-19 Cases in Italy. *Eng. Proc.* **2022**, *18*, 11. https://doi.org/ 10.3390/engproc2022018011

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 21 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

chose an ARIMA (1,1,0) model for predicting Indian COVID-19 infection cases considering data that followed a linear trend. Abotaleb and Makarovskikh [4] proposed a combined ARIMA, Exponential Smoothing, BATS, and TBATS hybrid model for data collected until March 2021 in Russia. Gecili et al. [5] proposed an ARIMA model for American and Italian data collected from February 2020 until April 2020. Salaheldin and Abotaleb [6] chose the exponential growth model for daily COVID-19 forecasting in China, Italy, and USA.

In this paper, we aim to choose the best model among the most known time series forecasting models. Since the COVID-19 new-cases curve follows a nonlinear trend and considering that we have collected more recent pandemic data, this work emphasizes the importance of using nonlinear methods for modeling these time series, as classical linear models would not be able to identify the traits of nonlinear time series and, subsequently, would not give reliable predicted values. We have considered data from the beginning of the spread of the pandemic in Italy (24 February 2020) to 16 May 2022 in the Italian regions, months which were thought, according to previous proposed forecasting models, would correspond to quiet months from the point of the spread of the pandemic.

#### **2. Materials and Methods**

In this work, we considered data published online from Superior Health Institute on Epidemiology for Public Health related to COVID-19 infection cases in Italian regions for the period between 24 February 2020 and 16 May 2022 considering:


The forecasting was conducted through the R package "forecasting", which provides methods and tools for forecasting univariate time series. We implemented an ARIMA model, a NNAR model, as well as a TBATS and Holt's linear model, and chose the best model considering the Mean Average Percentage Error (MAPE) for each of them as follows:

$$\text{MAPE} = \frac{1}{\text{n}} \sum\_{t=1}^{\text{n}} \left| \frac{\text{A}\_{\text{t}} - \text{F}\_{\text{t}}}{\text{A}\_{\text{t}}} \right| \tag{1}$$

where, n is the total number of observations, At is the actual value at time t, and Ft is the forecast value at time t.

#### *2.1. ARIMA Model*

The first model is ARIMA (Auto-Regressive Integrated Moving Average), which is the most common linear model for time series forecasting. It represents a time series as a function of its past values, its own lags, and the lagged errors, to forecast future values. An ARIMA model is compound by three terms: *p*, *d*, *q*:

$$y\_t = \varphi\_0 + \varphi\_1 y\_{t-1} + \varphi\_2 y\_{t-2} + \dots + \varphi\_p y\_{t-p} + \varepsilon\_t + \theta\_1 \varepsilon\_{t-1} + \theta\_2 \varepsilon\_{t-2} + \dots + \theta\_p \varepsilon\_{t-q} \tag{2}$$

where, *p* is the order of the Auto-Regressive (AR) term and refers to the number of *y* lags, which should be used as predictors; *q* is the order of the Moving Average (MA) term and it refers to the number of lagged errors used as predictors; while *d* is the number of differentiating required to make the time series stationary.

Although ARIMA is widely used for time series analysis, it is not easy to choose appropriate orders for its components, so we proceeded to determine them automatically, using the *auto.arima* function to obtain the best ARIMA model for each region (Table 1).


**Table 1.** The chosen ARIMA model for each region.

The model estimation concerns the use of statistical techniques to derive the coefficients that better fit the chosen ARIMA model. Once the model was identified and the parameters were estimated, it was used for forecasting. It is checked using statistical tests and residual plots that can be used to analyze the suitability of various models to historical data.

#### *2.2. TBATS Model*

The TBATS (Trigonometric Exponential smoothing state space model with Box-Cox transformation, ARMA errors, Trend, and Seasonal component) model uses a combination of Fourier terms with an exponential smoothing state space model and a Box-Cox transformation, in an automated manner. The unit of time used in modeling was day.

#### *2.3. Holt's Linear Trend*

This model includes a prediction equation and two smoothing equations. It uses double exponential smoothing parameters to forecast future values: The first parameter is used for the overall smoothing, while the other for the trend smoothing equation. We obtained the current value considering the adjusted last smoothed value for the last period's trend and updated the trend over time, expressing it as the difference between the last two smoothed values.

Holt's forecast equation:

$$
\hat{\mathbf{y}}\_{\mathbf{t}+\mathbf{h}|\mathbf{t}} = \mathbf{l}\_{\mathbf{t}} + \mathbf{h}\mathbf{b}\_{\mathbf{t}} \tag{3}
$$

where

$$\mathbf{l}\_{\mathbf{l}} = \alpha \mathbf{y}\_{\mathbf{l}} + (1 - \alpha)(\mathbf{l}\_{\mathbf{l}-1} + \mathbf{b}\_{\mathbf{l}-1}) \tag{4}$$

indicates the first equation (level equation), while

$$\mathbf{b}\_{\mathbf{l}} = \beta (\mathbf{l}\_{\mathbf{l}} - \mathbf{l}\_{\mathbf{l}-1}) + (1 - \beta)\mathbf{b}\_{\mathbf{l}-1} \tag{5}$$

indicates the trend equation, where:

0 ≤ α ≤ 1 is the smoothing parameter for the trend, 0 ≤ β ≤ 1; lt indicates the time series value at time t; bt is the time series trend at time t.

#### *2.4. ANN Model*

Artificial Neural Networks forecasting models are nonlinear models inspired by biological neural networks that identify and model nonlinear relationships between the variables. They are compounds of a collection of neurons, grouped in input, hidden, and output layers, and map a set of inputs into a set of output variables, through hidden layers of neurons. Their ability to learn from a training procedure and previous examples makes them a powerful forecasting tool. They have the ability to analyze new data based on previous results.

An ANN is composed of several layers:


In our study, the NAR network was developed using the "nnetar" function of R software "caret" package that fits a neural network model to a time series [7] developed by Hyndman, O'Hara, and Wang. A NNAR (p,k), where p indicates the number of nonseasonal lags used as inputs and k the number of nodes in the hidden layer, can be described as an AR process with nonlinear functions. Considering the traits of the new COVID-19 cases trend for Italian regions, we chose a (28-5-1) network, with 28 lags as input nodes and 1 hidden layer with 5 nodes. It has the form of a feedforward three-layer ANN, where neurons have a one-way connection with the neurons of the next layers. The data set was divided into training set (70%) and testing set (15%), while the last 8 days data were used for the validation.

The forecasting performance of all these models was evaluated using the Mean Absolute Percentage Error (MAPE), while the model fits were evaluated using AIC (Akaike Information Criterion), reported in Table 2.


**Table 2.** MAPE (%) for forecasting models' accuracy.

Selection and accuracy measures for the forecasting models are reported in Table 2. MAPE was used to measure the performance of the models. We chose the best forecasting model according to the MAPE value, as it is recommended as an accuracy comparing unit when using different methods on a time series, considering as the most accurate model the one with the lowest MAPE value.

In addition to the graph, where it can be clearly seen, the above values of the table show that the ANN model has given more accurate forecasting values than the other linear forecasting models, for every region. According to MAPE, ANN improved the forecasting accuracy compared with ARIMA, TBATS, and Holt's.

The NARNN model gives better results in almost all the considered regions, with a considerable difference from the indicators of the other models. ANN model has the lowest MAPE for the considered period for all the regions, improving the forecasting performance up to 36.47%, considering Campania. TBATS model has the highest MAPE values for the considered period, indicating that it cannot appropriately follow our data's traits.

In Table 3 we present the MAPE value for the last 6 days data, considered as testing data. Once again, we can observe that the ANN model is the best for forecasting COVID-19 new cases in Italian regions. This fact confirms once again our assumption about choosing the best model for our time series, considering the nonlinear trend our data follow.


**Table 3.** MAPE (%) for 6 days' accuracy of ANN forecasting model for Italian regions.

We performed the forecasting for new COVID-19 cases in Italian regions using the above models. We conducted a 30-days-ahead forecast (until 16 June 2022) and compared the forecasting data with the testing data for 6 days ( 11 May 2022–16 May 2022), applying the forecasting models to the confirmed cases for the last 8 days data and compared the results with the actual COVID-19 data. We calculated the MAPE values as the difference between actual data and forecast values. The MAPE values for ANN forecasting model are represented in Table 3. Based on our analysis, we concluded that the prediction performance of the models was similar to the real data. In particular, the ANN model gave more accurate predictions, as its MAPE values were lower compared to the other models. We observed decreasing MAPE values, in particular for the last 6 days' testing values, as its values decreased from about 7% to 1%. Higher MAPE values were observed for the other predictive models. ARIMA had a worse predicting performance for the first 3 days and

the last day, while TBATS was the worst forecasting model when comparing the 6 days' training data MAPE values.

Figure 1 presents the forecasting results of ANN model for the following 30 days for COVID-19 new confirmed cases in Italian regions. The NARNN model values follow very well the time series' trend thanks to the training process, which enables the model to better understand the time series' features. Once trained, the ANN decides itself on the importance of the variables, as it keeps learning continuously, performing quite well with unfamiliar data, thanks to its ability to work with multiple parallel inputs, as well as nonlinearity and plasticity in finding the most suitable model for time-series forecasting [8].

**Figure 1.** Daily COVID-19 Italian regions' new cases prediction with ANN model: (**a**) Abruzzo; (**b**) Basilicata; (**c**) Calabria; (**d**) Emilia-Romagna,(**e**) Lazio; (**f**) Liguria; (**g**) Lombardia; (**h**) Marche; (**i**) Molise; (**j**) Campania; (**k**) Piemonte; (**l**) Puglia; (**m**) Sardegna; (**n**) Friuli-Venezia Giulia; (**o**) Sicilia; (**p**) Toscana; (**q**)Trentino Alto-Adige; (**r**) Umbria; (**s**) Valle D'Aosta; (**t**) Veneto.

Figure 1 shows the trend of the number of new cases predicted by the ANN model for each region, obtained considering 28 lags as inputs and 5 nodes in the hidden layer. From the results obtained by the predictions of NARNN model, we can say that this model's predictions of the new COVID-19 confirmed cases are closer to the observed time series values. This is also emphasized by the value of MAPE for the test set, which is much lower than other forecasting models' MAPE values. According to the ANN (28-5-1) model, there will be an increasing trend in the number of new COVID-19 infections by the end of May, until 16 June in the following regions: Abruzzo, Basilicata, Calabria, Lazio, Liguria, Molise, Piemonte, Trentino Alto-Adige, and Veneto, while for the rest of them the ANN model predicted a constant to decreasing trend for the next 30 days.

#### **4. Discussion**

In this work, we evaluated four different time series forecasting models for predicting daily Italian regions' COVID-19 confirmed new cases. Using various models let us compare their forecasting accuracy and make an optimal selection. For our time series, the ANN model was preferred over the other linear forecasting models. It was chosen based on MAPE value, as it had the lowest value among all the forecasting models. The ANN (28-5) model gives better results in all the considered indicators with a considerable difference from the indicators of the other linear models. It predicted an increase in the number of new COVID-19 infections by the end of May 2022, in almost all the Italian regions. The results are valid for a short period of time because in the long run they can be influenced

by other factors such as vaccination, immunization of the population, and measures taken by government authorities to limit the spread of the infection, etc.

The above-considered models can be implemented on new data as they become available for possible future COVID-19 new confirmed cases forecasting, in order to improve forecasting accuracy, maybe taking into consideration other patients' parameters as possible inputs for the ANN model, since additional data would improve forecasting performance. Predictions about possible future new cases would be very helpful for the allocation of medical resources, handling the spread of the pandemic, and getting more prepared in terms of health care systems. People that deal with decision-making could find it very helpful for future projections regarding intervention for reducing and controlling the spread of the infection.

**Author Contributions:** Conceptualization, E.N. and L.S.; methodology, E.N.; software, L.S.; validation E.N.; formal analysis, E.N. and L.S.; investigation, L.S.; resources, L.S.; data curation, E.N.; writing—original draft preparation, L.S.; writing—review and editing, E.N.; visualization, E.N.; supervision, E.N. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Data are available at: https://www.iss.it/ (accessed on 5 May 2022).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Proceeding Paper* **Relationship between Stationarity and Dynamic Convergence of Time Series †**

**Gerardo Covarrubias and Xuedong Liu \***

Faculty of Higher Studies Aragon, National Autonomous University of Mexico, Ciudad Netzahualcóyotl 57000, Mexico; gerardo\_covarrubias\_lopez@hotmail.com

**\*** Correspondence: xdong@comunidad.unam.mx; Tel.: +52-1-5626293891

† Presented at the 8th International Conference on Time Series and Forecasting, Gran Canaria, Spain, 27–30 June 2022.

**Abstract:** In economic science, when a stochastic process is studied from an econometric approach, it focuses on stationarity and the reference approach to dynamic equilibrium is ignored. Although both approaches are theoretically closely linked through the common purpose of making forecasts, in existing methodologies they are studied in a mutually exclusive way. Therefore, in this paper the consistency between dynamic equilibrium and stationarity is analyzed. In practice, the theoretical correspondence between these two important properties could present some inconsistency due to misspecification in an autoregressive model or the presence of spuriousness, generated by the components of the series.

**Keywords:** time series; dynamic convergence; stationarity; unit root

**Citation:** Covarrubias, G.; Liu, X. Relationship between Stationarity and Dynamic Convergence of Time Series. *Eng. Proc.* **2022**, *18*, 12. https://doi.org/10.3390/ engproc2022018012

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 21 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

In recent years, economists have been mainly interested in the study of successions of random variables that change over time and that give rise to stochastic processes; in which, from the econometric approach, stationarity, as a particular state of equilibrium, is considered an elementary assumption in the specification of a model that has the purpose of forecasting. On the other hand, if the stochastic processes are studied from the dynamic stability approach, the fundamental assumption is convergence, which will determine the behavior of the time series through the construction of difference equations.

Although both approaches are theoretically linked, forecasting studies are carried out in different ways.

In this regard, when time series are analyzed, the estimated models aim to understand the series around its dynamic structure to make a forecast as an extrapolation of the stochastic process or time series [1–4]. The origin of the study focuses on the understanding of predictable components and time lags in dynamic analysis, which is based on the estimation of stochastic equations in differences.

From a theoretical point of view, it is accepted that there is a correspondence between dynamic convergence and stationarity, since both properties are a key point in the forecast of a realization; that is, a time series is stationary if and only if its dynamics are stable or convergent to equilibrium. For this reason, this research aims to open a line of empirical research on this correspondence.

It should be noted that in all the textbooks on the subject it is assumed that stationarity is a sufficient condition to estimate an autoregressive type model and as a result the projection of a realization is obtained.

In this regard, if the assumption of ergodicity is assumed, there is uncertainty or ambiguity in the estimation due to the high degree of complexity involved in building the probability distribution structure of a stochastic process. Consequently, when doing an

empirical study, it is likely that both properties are not consistent, in an issue that has not been addressed in the conventional literature.

A solution to resolve the inconsistency may be to extract the components of the series and model only the irregular component.

#### **2. Theoretical and Conceptual Approach on the Stationarity–Convergence Relationship**

From the econometric point of view, when we talk about stationarity, the first difference of a time series is fundamental (Equation (1)), which from the dynamic stability approach represents an equation in differences. This implies that the problem of forecasting a realization in econometric terms is widely linked to the formulation and estimation of difference equations from the dynamic stability approach.

The following equation represents a stochastic process without intercept:

$$
\Delta y\_t = y\_{t+1} - y\_t \tag{1}
$$

In this way, if the series is first order stationary, it should converge towards the dynamic stability of the process, and vice versa.

The Importance of Stationarity and Convergence in a Stochastic Process

Stationarity is a particular state of statistical equilibrium and is a fundamental part of the study of time series [5]. The probability distributions corresponding to these series remain stable over time. This is because, for any subset of the time series, the joint probability distribution must remain unchanged [6]. Furthermore, the fact that a series is stationary implies that the system is impacted by some type of shock in each instance, it will adjust to equilibrium again [7] or the consequences of impact will gradually disappear; that is, a shock in period *t* will have a small effect at time *t* + 1 and a smaller one at *t* + 2, and so on [8].

Since there is no variation in the probability distribution, if the time series is nonstationary, the results derived from classical linear regression are invalid or simply may not have theoretical meaning, which is known as spurious correlation [9].

To solve this problem, as defined by Granger and Newbold [10] based on the study by Yule [11], the application of cointegration models and error correction mechanism (ECM) for the long term emerged, applying the Engle and Granger [12] cointegration model for equations with two or more variables and the Johansen [13,14] model for systems of equations with autoregressive vectors. In these cases, it should be noted that, if the series are not stationary, they must be adjusted by differentiation, and with the same order to guarantee cointegration. Consequently, when we analyze the time series or stochastic process, it is of vital importance to determine stationarity by applying tests such as Phillips– Perron [15] and Augmented Dickey–Fuller [16], which are explained below.

$$Y\_t = \rho Y\_{t-1} + u\_t \tag{2}$$

If *ρ* = 1 it is a random walk model without intercept with the presence of a unit root; that is, the serie is not stationary; in contrast, if |*ρ*| < 1 it is a stationary serie and is also convergent.

In this regard, Gujarati and Porter [17] pose an interesting question: why not simply regress *Yt* on its lagged value over a period *Yt*−<sup>1</sup> and find that *ρ* is statistically equal to 1?

If this happens, *Yt* would not be stationary. Nevertheless, it is not convenient to estimate the regression by ordinary least squares (OLS) and test the hypothesis that *ρ* = 1 through the usual t-test, since this test has a very marked bias in the case of a unit root; for this reason, Equation (2) is manipulated by subtracting *Yt*−<sup>1</sup> from both sides simultaneously to obtain Equation (3).

$$Y\_t - Y\_{t-1} = \rho Y\_{t-1} - Y\_{t-1} + u\_t$$

$$\Delta Y\_t = (\rho - 1)Y\_{t-1} + u\_t$$

$$\Delta Y\_t = \delta Y\_{t-1} + u\_t \tag{3}$$

Therefore, the regression is not estimated on Equation (2), but on Equation (3), and the null hypothesis *δ* = 0 is tested.

As Equation (2) is a first-order difference equation, it is required that it meets the stability condition |*ρ*| < 1 as mentioned [4].

From the dynamic stability approach, the solution of a random walk model without drift seen as a difference equation is:

$$y\_l = Ab^t\tag{4}$$

where *A* = *y*0, *b* = *a*0.

Thus, the path described by Equation 4 depends on the value of *b*. In this case, if the complementary function tends to zero when t grows indefinitely, the equilibrium is dynamically stable; that is, the trajectory expressed in Equation (4) must be analyzed as *t* increases as shown in Table 1 [18].

**Table 1.** Trajectory behaviors.


In the same way, derived from Equation (4), when stipulating the convergence condition of the time trajectory *yt* towards the equilibrium *yp* we must rule out the case of *b* = ±1, which means that there is a unit root from the point of view econometric.

In general terms, the necessary and sufficient equilibrium condition is confirmed for a trajectory to be convergent if and only if *|b|* < 1, according to the stationarity condition.

#### **3. Empirical Evidence of Bias in the Stationarity–Convergence Relationship**

In order to find possible empirical inconsistencies between dynamic convergence and stationarity in a time series, three stochastic processes corresponding to trade between Mexico and China were analyzed to which the unit root test was applied and their convergence was analyzed at equilibrium using first-order autoregressive models.

#### *3.1. Case 1: Trade Balance of Mexico with China: An Explosive Deficit Balance*

In this first case, the trade balance between Mexico and China was analyzed, explained by the difference between Mexico and China in millions of dollars per year in the period 1993–2020. Figure 1 shows that the deficit for Mexico systematically increases explosively over time, so it is evidently a non-stationary and divergent stochastic process.

**Figure 1.** Trade balance between Mexico and China in millions of USD 1993–2020.

Table 2 shows the results obtained in the Dickey–Fuller augmented (DFA) test using the E-views software, with the t-statistic in parentheses.


**Table 2.** DFA test results.

\* *t*-statistic for statistical significance in parentheses.

The case 1 is discarded since = (*ρ* − 1) = 0.0330. *ρ* = 1.0330 > 1. In the remaining cases, we can verify that, according to the hypothesis test, the null hypothesis is not rejected; consequently, the series has a unit root; additionally, despite the fact that the coefficient is negative and greater than −2, they are not statistically significant, so we can conclude that it is a divergent and non-stationary stochastic process in congruence with what is stipulated in the theory.

#### *3.2. Case 2: Trade Balance of Mexico with China: An Estimate in Relative Terms*

In this second case, it is proposed to use the trade balance in relative terms; that is, the quotient between imports and exports made by Mexico with its second trading partner on a monthly basis (Figure 2).

**Figure 2.** Monthly fluctuation of trade between China and Mexico, 1993–2020.

Similarly, Table 3 shows the results obtained in the unit root test with δ coefficients between −2 and 0, which is why not one of them is discarded. The best model that describes the trajectory is case 2, according to the criteria of Akaike, Schwartz, and Hannan-Quinn and it is obtained that *ρ* = 0.3907, which shows that the convergence and stability condition is met, |*ρ*| < 1 and similarly, from the econometric approach, *δ* is statistically significant, in addition, the *p*-value of the test is 0.0000, so the existence of a unit root is rejected. We can then conclude that it is a stationary and convergent stochastic process.


**Table 3.** DFA test results.

\* *t*-statistic for statistical significance in parentheses.

#### *3.3. An Empirical Inconsistency in the Fluctuation of Trade between Mexico and China*

In a third case, the behavior of the trade balance between Mexico and China in the period 1993–2020 was analyzed with annualized data.

In the first instance, if we inspect the graph in Figure 3, we can see that the series is possibly stationary and convergent.

**Figure 3.** Annual fluctuation of trade between China and Mexico 1993–2020.

Table 4 shows that for the three cases a negative value of δ and greater than −2 was obtained.


**Table 4.** DFA test results.

\* *t*-statistic for statistical significance in parentheses.

As can be seen, the *p* value in all three cases is greater than 0.05; therefore, at 95% confidence, the null hypothesis that there is a unit root is not rejected, thus, the analyzed series is not stationary.

However, if we analyze in detail the dynamic equilibrium in the three cases, the value of δ is negative, greater than −2 and statistically significant for cases 2 and 3, which means that the series in question is convergent.

If we opt for case 3, according to the value of the *t* statistic and the p value, we can assert that the coefficient of *δ* is statistically significant at 95% confidence; likewise, we determine the value of *ρ* which is 0.4254, implies that the series is convergent and stable. It is important to mention that by choosing case 3 convergence is guaranteed but around a deterministic trend, and consequently the stability of the series is questioned.

For case 2, the coefficient *δ* is also statistically significant from which a value of *p* of 0.5530 is derived; that is, the convergence and stability of the series is confirmed again. However, when applying the unit root test, it was shown that the series is non-stationary, both at 99% and 95% confidence level.

From the econometric approach, in the case of annual figures, the stability of the series would be ambiguous since it is not a stationary series since it is not possible to reject the null hypothesis of the unit root test.

#### **4. Concluding Remarks**

When theoretically analyzing the stochastic processes, the results obtained were consistent for the divergent case; however, for the convergent case certain ambiguities were found.

We can then assert that there is a double causal relationship between stationarity and dynamic stability theoretically. Because the forecasts about the behavior of the series are the result of making estimates using a stationary series, in autoregressive models, this statement is of great importance. However, the problem arises when there is no stationarity.

Then, we can conclude that the non-stationarity of a series could mean a wrong specification of the model, including a spurious correlation. So, when applying the model for projections of series or forecasts, in the bias with respect to the true values, the impact is not necessarily lost over time. In other words, the possible dynamical convergence that follows from a non-stationary series might not be the closest to the true trajectory; in short, it could be classified as a false dynamic convergence.

Let us also consider that within the process to determine stationarity, if we assume the assumption of ergodicity, stationarity implies a simplification of various assumptions; in general, that a sample preserves the properties of the population.

In other words, the ergodicity assumption implies that the series is stationary, which induces a considerable homogeneity in the stochastic process, reduces the uncertainty as a simplification of the suppositions. So, if the analyzed series is not stationary, the ergodicity is not satisfied; therefore, the assumptions are not simplified and, consequently, there would be no stability and convergence in the series, so the coefficients obtained from the regression are the result of spuriousness or an erroneous specification of the model.

Finally, since the simplification of the assumptions is not met, the mean and variance of the series vary with respect to time; then, being non-stationary, the ergodic process is not fulfilled, and they should not be used for forecasts.

**Author Contributions:** Conceptualization, G.C. and X.L.; methodology, G.C. and X.L.; software, G.C.; validation, X.L.; formal analysis, G.C. and X.L.; investigation, G.C. and X.L.; resources, X.L.; data curation, G.C. and X.L.; writing—original draft preparation, G.C. and X.L.; writing—review and editing, G.C. and X.L.; visualization, G.C. and X.L.; supervision, X.L.; project administration, X.L.; funding acquisition, X.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** PAPIIT Program of DGAPA, UNAM Number IN31102.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Sistema de Consulta de Información Estadística por País. Available online: http://www.economia-snci.gob.mx/sic\_php/pages/estadisticas/ (accessed on 21 February 2021).

**Acknowledgments:** For the preparation of this research, the Postdoctoral Scholarship Program at UNAM and the PAPIIT Program of DGAPA, UNAM are widely appreciated and recognized.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Proceeding Paper* **Partitioning of Net Ecosystem Exchange Using Dynamic Mode Decomposition and Time Delay Embedding †**

**Maha Shadaydeh 1,***<sup>∗</sup>* **, Joachim Denzler <sup>1</sup> and Mirco Migliavacca <sup>2</sup>**


**Abstract:** Ecosystem respiration (Reco) represents a major component of the global carbon cycle. An accurate estimation of Reco dynamics is necessary for a better understanding of ecosystem–climate interactions and the impact of climate extremes on ecosystems. This paper proposes a new datadriven method for the estimation of the nonlinear dynamics of Reco using the method of dynamic mode decomposition with control input (DMDc). The method is validated on the half-hourly Fluxnet 2015 data. The model is first trained on the night-time net ecosystem exchange data. The day-time Reco values are then predicted using the obtained model with future values of a control input such as air temperature and soil water content. To deal with unobserved drivers of Reco other than the user control input, the method uses time-delay embedding of the history of Reco and the control input. Results indicate that, on the one hand, the prediction accuracy of Reco dynamics using DMDc is comparable to state-of-the-art deep learning-based methods, yet it has the advantages of being a simple and almost hyper-parameter-free method with a low computational load. On the other hand, the study of the impact of different control inputs on Reco dynamics showed that for most of the studied Fluxnet sites, air temperature is a better long-term predictor of Reco, while using soil water content as control input produced better short-term prediction accuracy.

**Keywords:** ecosystem respiration; dynamic mode decomposition with control; time delay embedding

#### **1. Introduction**

Carbon losses from ecosystems affect climate change. Ecosystem respiration (Reco), the sum of autotrophic and heterotrophic respiration, represents a major component of the global carbon cycle. Accurate estimation of Reco dynamics is necessary for a better understanding of ecosystem–climate interactions and the impact of climate extremes on ecosystems. This paper proposes a new data-driven method for the estimation of the nonlinear dynamics of Reco using the method of dynamic mode decomposition (DMD), an emerging tool for the analysis of nonlinear dynamical systems.

Ecosystem respiration is typically described as an exponential function of temperature based on the law of thermodynamics [1]. This function is defined based on certain parameters, such as temperature sensitivity, which are assumed to remain constant. However, several studies have pointed to the dependence of these parameters on other drivers of Reco [2] . Such issue is partially compensated in regression models by the use of temporal moving windows for parameters estimation [3].

The Eddy Covariance (EC) technique is widely used to measure the net ecosystem exchange (NEE) which is the difference between Reco, the total CO2 release due to all respiration processes, and the gross carbon uptake by photosynthesis (GPP). The two CO2 fluxes Reco and GPP are derived from NEE by applying partitioning methods. Recently

**Citation:** Shadaydeh, M.; Denzler, J.; Migliavacca, M. Partitioning of Net Ecosystem Exchange Using Dynamic Mode Decomposition and Time Delay Embedding. *Eng. Proc.* **2022**, *18*, 13. https://doi.org/10.3390/ engproc2022018013

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 21 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

deep learning-based methods have been proposed for modeling Reco dynamics [4,5] using EC measurement of night-time NEE when photosynthesis, and therefore GPP, is

assumed to be 0. These approaches provide data-driven equation-free estimates of Reco with the flexibility to include other meteorological and biological drivers affecting Reco during the daytime to achieve the NEE partitioning task. In spite of their improved performance compared to state-of-the-art empirical methods, they require a sufficient amount of training data as well as extensive tuning of the used deep networks' hyperparameters. Accordingly, the trained model cannot take into account some short-term variations in ecosystem respiration.

#### **2. Methods**

The Koopman operator [6] enables the transformation of finite-dimensional nonlinear system dynamics to an infinite-dimensional linear dynamical system. Finding the eigenfunctions of the Koopman operator, however, remains a major obstacle to its implementation. DMD is a simple numerical algorithm that approximates the Koopman operator with a best-fit linear model that advances measurements from one time step to the next ([7], [8]. It is an equation-free system identification method where the underlying dynamics of the system are learned from snap-shot in time of measurement data. DMD decomposes system dynamics into temporal modes whereby each mode represents a fixed oscillation frequency and decay/growth rate. It has been extended to deal with dynamical systems with exogenous control input (DMDc) [9].

In this paper, we propose a new data-driven yet physics-aware method for the dynamical modeling of Reco, which can serve as an NEE partitioning method. The proposed approach is based on using DMDc in a sliding temporal window approach. The system state Reco is represented as a linear dynamical model with an autonomous component in addition to an exogenous component which is a function of control input. The control input to the system can be soil or air temperature (Tair) in accordance with the thermodynamics law or any other observed drivers such as soil water contents (SWC). Such modeling of Reco dynamics allows to disentangle the exogenous effect of the control input, e.g., Tair, from the autonomous dynamics of Reco, and hence allows to intervene on the control input to study its effect on the system. To deal with unobserved drivers of Reco other than temperature or any user control input, we make use of time-delay embedding (TDE) of the history of the system's state and control input. According to the Takens theory [10], such an embedding guarantees, under certain conditions, that the system will learn the trajectories of the original system. The TDE in DMDc has been shown to facilitate the treatment of nonlinear systems with linear models [11]. Another advantage of using TDE in the proposed method is that it allows for learning Reco dynamics from short data as it compensates for using advanced measurement in time. This is relevant as it enables ecosystem forecast taking into account short-term variations in the system dynamics.

#### **3. Results**

We used the half-hourly EC Fluxnet 2015 data [12] measured at multiple Fluxnet sites with different vegetation types and average temperatures to investigate the impact of different control inputs, e.g., Tair and SWC, on Reco dynamics. The model is trained on night-time NEE which is assumed to be the ground-truth values of night-time Reco. The day-time Reco values are then predicted using the obtained model with future values of control input. The method is validated on Reco short-term and long-term forecast periods with different control inputs. The obtained results indicate that: (1) The performance of the proposed method is comparable to the recently proposed deep learning-based NEE partitioning methods, yet it has the advantages of being a simple and almost hyperparameter-free method with a very low computational load. (2) The use of TDE facilitates learning Reco dynamics from very short data, i.e., up to one night samples of NEE. (3) For most of the studied Fluxnet sites, Tair is a better long-term predictor of Reco, while using SWC as control input produced better short-term forecast accuracy.

**Author Contributions:** Conceptualization, M.S., J.D. and M.M; methodology, M.S.; writing—original draft preparation, M.S.; writing—review and editing, M.S., J.D. and M.M.; funding acquisition, M.S. J.D. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research is funded by the Carl Zeiss Foundation within the scope of the program line "Breakthroughs: Exploring Intelligent Systems" for "Digitization — explore the basics, use applications" and the German Research Foundation (DFG) grant SH 1682/1-1.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** We used the Fluxnet 2015 datasets publicaly available at https://doi. org/10.1038/s41597-020-0534-3, accessed on 15 April 2022.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Proceeding Paper* **An Ordinal Procedure to Detect Change Points in the Dependence Structure between Non-Stationary Time Series †**

**Alexander Schnurr <sup>1</sup> and Svenja Fischer 2,\***


**Abstract:** The presence of non-stationarity can have crucial effects on statistical tests on correlation between two or more data sets. We present a procedure to detect changes in the strength of dependence between data sets that is based solely on a comparison of the ordinal structures within a moving window and thus compares the up-and-down behavior only. Hence, it is not distracted by changes within a single data set, such as change-points and trends or even nonlinear transformations, leading to non-stationarity. The applicability of the method is demonstrated for a hydrological data set of runoff time series which are impacted by a reservoir. It is demonstrated that the method overcomes problems of classical methods when non-stationarity is present.

**Keywords:** ordinal patterns; structural breaks; non-stationary time series; hydrological data

**Citation:** Schnurr, A.; Fischer, S. An Ordinal Procedure to Detect Change Points in the Dependence Structure between Non-Stationary Time Series. *Eng. Proc.* **2022**, *18*, 14. https:// doi.org/10.3390/engproc2022018014

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 21 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

In various applications, the co-movement of different data sets is a desirable property. More often, it is a source of risk:


In all of these cases, it is important to monitor the strength of the interdependence between the data sets. This interdependence or co-movement is often modeled using the mathematical concept of correlation. However, sometimes it is not clear whether the time series admit second moments. Hence, it might be the case that correlation between the random variables (of which the time series consist) is not even defined. Another drawback of this mathematical concept is that it is well known to measure mostly *linear* dependence whilst the dependence in our applications could be far from being linear.

Our aim in the present article is to detect changes within the dependence structure between data sets. Since co-monotonic behavior might be a source of risk, we would like to detect whether this becomes (at a certain point in time) stronger or weaker. An additional difficulty arises since some of the data examples we have in mind are known to be non-stationary. Usually, tests for change points in the dependence structure require that the single time series *are* stationary (cf. e.g., [1,2]). If this was not the case, the tests would falsely detect a change in the dependence, which in fact might be a change in one of the single time series while the dependence between the time series remains constantly strong.

Due to climate change, the demand of being able to detect changes in the dependence structure in non-stationary settings has increased. Several data sets (like temperature) which were known to be stationary (up to some seasonal components) have started to increase systematically. Sometimes, it is not known how strong the incline of the new deterministic component is. Even worse: the incline itself is subject to change and increases stronger and stronger.

In the following, we will always use the term 'time series' if we consider the mathematical model, that is, a stochastic process (*Xi*)*i*∈<sup>N</sup> defined on a probability space (Ω, <sup>F</sup>, <sup>P</sup>). In contrast to this, we write 'data set' for the real or simulated data we are using. Here, and in what follows, we always consider two time series or two data sets at a time. If we want to analyze the dependence of more than two time series or data sets, we consider them pairwise.

Let us start with a simulated example illustrating the above idea. Consider two correlated ARMA(1,1) time series, i.e.,

$$\begin{aligned} X\_i^{(1)} &= \mu\_1 + \phi\_1 X\_{i-1}^{(1)} + \varepsilon\_i^{(1)} - \theta\_1 \varepsilon\_{i-1}^{(1)} \\ X\_i^{(2)} &= \mu\_2 + \phi\_2 X\_{i-1}^{(2)} + \varepsilon\_i^{(2)} - \theta\_2 \varepsilon\_{i-1}^{(2)} \end{aligned}$$

where *<sup>μ</sup>*1, *<sup>μ</sup>*<sup>2</sup> ∈ R, *<sup>φ</sup>*1, *<sup>φ</sup>*2, *<sup>θ</sup>*1, *<sup>θ</sup>*<sup>2</sup> ∈ (0, 1) and ∼ N2(*μ*, <sup>Σ</sup>) with <sup>Σ</sup> = *σ*<sup>1</sup> *ρ ρ σ*<sup>2</sup> . Here, we chose *μ*<sup>1</sup> = 5, *μ*<sup>2</sup> = 6, *φ*<sup>1</sup> = 0.7, *φ*<sup>2</sup> = 0.8, *θ*<sup>1</sup> = 0.3, *θ*<sup>2</sup> = 0.2, *σ*<sup>1</sup> = *σ*<sup>2</sup> = 1 and *ρ* = 0.9 to simulate *n* = 600 data points from a short-range dependent time series with a high crosscorrelation (Figure 1). Artificially, two change-points were included in the data set: first, an exponential drift was added to the second part of *X*(2), i.e., *X*(2) *<sup>i</sup>* <sup>=</sup> *<sup>X</sup>*(2) *<sup>i</sup>* + exp(*i*/300) for *i* = 301, ... , 600 and *X*(2) *<sup>i</sup>* <sup>=</sup> *<sup>X</sup>*(2) *<sup>i</sup>* for *i* = 1, ... , 300. Secondly, another 400 data points were added (third part) by simulating once again two cross-correlated ARMA(1,1) time series *<sup>X</sup>*(1) and *<sup>X</sup>*(2) with the same parameters as before but *<sup>ρ</sup>* <sup>=</sup> 0.6. These two data sets were added to the first two data sets such that each set consists of 1000 data points, e.g., *X*(1) *<sup>i</sup>* <sup>=</sup> *<sup>X</sup>*(1) *<sup>i</sup>* for *<sup>i</sup>* <sup>=</sup> 1, ... , 600 and *<sup>X</sup>*(1) *<sup>i</sup>* <sup>=</sup> *<sup>X</sup>*(1) *<sup>i</sup>* for *i* = 601, ... , 1000. Moreover, the mean of the second time series was adjusted to match the exponential drift, that is, *X*(2) *<sup>i</sup>* <sup>=</sup> *<sup>X</sup>*(2) *<sup>i</sup>* + exp(2) for *i* = 601, ... , 1000. The exponential drift in the second series led to significant non-stationarity (detected with Mann–Kendall test for short-range dependent data; [3]). In a way, this is more than just a random simulated example: think about a data set of monthly temperature and the corresponding discharges of a river. While it is known that temperature tends to increase for most parts of the world, this was not detected yet for discharge [4], though both data sets are clearly cross-correlated. Application of the classical correlation measures such as Pearson's correlation coefficient, *ρP*, or Spearman's Rho, denoted by *ρS*, leads to a significant drop of the estimated correlation coefficients when applying the measure to the first and second part of the data. For the first part of the time series, both correlation measures detected significant correlation, where *ρP*(*X*<sup>1</sup> 1:300, *<sup>X</sup>*<sup>2</sup> 1:300) = 0.786 and *<sup>ρ</sup>S*(*X*(1) 1:300, *<sup>X</sup>*(2) 1:300) = 0.837. However, when considering also the part with the exponential drift, significant correlation was no longer detected: *<sup>ρ</sup>P*(*X*(1) 1:600, *<sup>X</sup>*(2) 1:600) = 0.201 and *<sup>ρ</sup>S*(*X*(1) 1:600, *<sup>X</sup>*(2) 1:600) = 0.303. This implies that, though the correlation between both time series is still high, in the sense that they show the same up-and-down behavior, the drift masks this correlation for the classical correlation measures.

**Figure 1.** Simulated cross-correlated ARMA(1,1) series with exponential drift in the middle part of the second time series.

We would like to overcome this problem. To this end, we use so-called ordinal patterns in order to analyze the up-and-down behavior of the data sets (and time series) under consideration.

The most common approach in *ordinal pattern analysis* works as follows: One decides for a small number *<sup>d</sup>* ∈ N, that is, for the length of the data windows under consideration. In each window, only the ordinal information is considered. There are various ways to encode the ordinal information of a *d*-dimensional vector in a pattern. Here, we use the following: Let *Sd* denote the set of permutations of {1, ... , *d*}, which we write as *d*-tuples containing each of the numbers 1, ... , *d* exactly one time. By the *ordinal pattern of order d* of the vector *x* = (*x*1,..., *xd*), we refer to the permutation

$$\Pi(\mathfrak{x}\_1, \dots, \mathfrak{x}\_d) := (\pi\_1, \dots, \pi\_d) \in \mathcal{S}\_d$$

which satisfies

$$
\mathfrak{x}\_{\pi\_1} \ge \dots \ge \mathfrak{x}\_{\pi\_d}.
$$

When using this definition, it is often assumed that the probability of coincident values within the vector (*x*1, ... , *xd*) is zero. Allowing for coincident values would require an additional convention like *<sup>π</sup>j*−<sup>1</sup> > *<sup>π</sup><sup>j</sup>* if *<sup>x</sup>πj*−<sup>1</sup> = *<sup>x</sup>π<sup>j</sup>* for *<sup>j</sup>* ∈ {2, ... , *<sup>d</sup>*} or an approach as in [5].

Roughly speaking, *ordinal pattern dependence* measures how often we encounter the same patterns at the same time considering our two data sets (cf. Figure 2); the theoretical counterpart is the probability

$$\mathbb{P}(\Pi(X\_1, \ldots, X\_d) = \Pi(Y\_1, \ldots, Y\_d))$$

for the underlying models, that is, the time series we consider.

**Figure 2.** Two data sets with partially co-monotonic behavior.

In fact, one subtracts the probability for the hypothetical case of independence and divides by some normalizing factor in order to obtain values between zero and one in the case of positive dependence. The mathematical definition of ordinal pattern dependence can be found in the subsequent section.

In our simulated toy example above, the ordinal pattern dependence (we have chosen *d* = 5) is not impacted by the exponential drift. In fact, the ordinal pattern dependence (here, we use the sample version OPD defined below Equation (2)) does not change significantly with OPD5(*X*(1) 1:300, *<sup>X</sup>*(2) 1:300) = 0.366 and OPD5(*X*(1) 1:600, *<sup>X</sup>*(2) 1:600) = 0.300. However, the change of dependency between both time series in the last part is clearly detected with OPD5(*X*(1) 601:1000, *<sup>X</sup>*(2) 601:1000) = 0.105. This change can also be detected by the classical correlation measures *<sup>ρ</sup><sup>P</sup>* and *<sup>ρ</sup>S*, but less clearly with *<sup>ρ</sup>P*(*X*(1) 601:1000, *<sup>X</sup>*(2) 601:1000) = 0.592 and *<sup>ρ</sup>S*(*X*(1) 601:1000, *<sup>X</sup>*(2) 601:1000) = 0.601. Recall that, although the up-and-down-behavior remains the same in the second part, the classical measures do not detect this any more due to the moderate exponential drift in the background.

The origin of the concept of ordinal patterns lies in the theory of dynamical systems [6,7]. They have been used in order to analyze the entropy of data sets [8] and to estimate the Hurst-parameter of long-range dependent time series [9,10]. Furthermore, related methods have proved to be useful in the context of EEG-data in medicine [11], index data in finance [12] and flood data in extreme value theory [13]. Already in [14], we have suggested using ordinal patterns for the analysis of hydrological time series. Changes in the dependence have not been considered in that article. However, changes in a *single* data set have been analyzed in [15] via ordinal patterns. Analyzing changes in the dependence between time series/data sets is a topic that has attained considerable interest over the last few years (cf. [1,2,16]). There exists an approach (cf. [17]) that was developed to detect changes in the dependence structure despite having non-stationarity of the single time series using the concept of local stationarity, but the assumptions are rather technical and make it hard to apply to short time series in practice.

Using ordinal patterns in order to analyze the dependence between time series, instead, has several advantages: we do not need to assume the existence of second moments and the time series do not necessarily have to be stationary. Furthermore, the method is robust against small noise and/or measurement errors and the intuitive concept makes it easy to apply in practice. Since only the order structure is considered, the algorithms are quick.

The notation we are using is more or less standard. (Ω, F, P) denotes a probability space in the background. We write E for the expected value w.r.t. P, while N denotes the positive integers starting with one.

The paper structure is as follows: first, we describe the methodology in the subsequent section. In Section 3, our method is used on hydrological data sets. The Section 4 consists of a short discussion and an outlook.

#### **2. Methodology**

Let *<sup>X</sup>* = (*Xi*)*i*∈<sup>N</sup> and *<sup>Y</sup>* = (*Yi*)*i*∈<sup>N</sup> be two time series defined on a common probability space (Ω, F, P). We describe our analysis step-by-step:

#### (I) Preparation and first analysis.

Check whether the whole time series are stationary. Maybe even the bivariate timeseries (*X*,*Y*) is stationary. In this case, both single time series are stationary and the dependence structure remains the same over time. In this case, we are done. If the single time series are stationary, but the bivariate one is not stationary, one can use our method or one of those found in the literature (cf. [1,2,16]) to detect changes in the dependency. The most interesting case for us is if the two time series *X* and *Y* or at least one of them is not stationary. Then, the classical methods cannot be applied. Still, one might be interested in the analysis of co-monotonic behavior and in changes of this behavior.

(II) Calculate the ordinal pattern dependence of the whole time series.

At first, we have to check whether the two time series admit ordinal pattern dependence.

**Definition 1.** *We define the ordinal pattern dependence between two random vectors* **X** = (*X*1,..., *Xd*) *and* **Y** = (*Y*1,...,*Yd*) *by*

$$\text{OPD}\_d(\mathbf{X}, \mathbf{Y}) = \frac{\mathbb{P}(\Pi(\mathbf{X}) = \Pi(\mathbf{Y})) - \sum\_{\pi \in S\_d} \mathbb{P}(\Pi(\mathbf{X}) = \pi) \mathbb{P}(\Pi(\mathbf{Y}) = \pi)}{1 - \sum\_{\pi \in S\_d} \mathbb{P}(\Pi(\mathbf{X}) = \pi) \mathbb{P}(\Pi(\mathbf{Y}) = \pi)}. \tag{1}$$

This definition of ordinal pattern dependence only considers positive dependence, which is in line with our considerations in the present article. Negative dependence could be included by analyzing the co-movement of **X** and −**Y** or by using an analogous definition taking inverse patterns into account. Usually, one is interested in measuring either positive *or* negative dependence. If one wants to consider both dependencies at the same time, the quantity

$$\text{OPD}\_d(\mathbf{X}, \mathbf{Y})^+ - \text{OPD}\_d(\mathbf{X}, -\mathbf{Y})^+,\tag{2}$$

where *<sup>a</sup>*<sup>+</sup> :<sup>=</sup> max{*a*, 0} for every *<sup>a</sup>* <sup>∈</sup> <sup>R</sup>, is useful (cf. [18]). For data sets, we use the estimators

$$\frac{1}{n} \sum\_{i=1}^{n-d+1} \mathbf{1}\_{\{\Pi(\mathbf{X}\_i, \dots, \mathbf{X}\_{i+d-1}) = \Pi(\mathbf{Y}\_i, \dots, \mathbf{Y}\_{i+d-1})\}}$$

for P(Π(**X**) = <sup>Π</sup>(**Y**)) and (*<sup>π</sup>* ∈ *Sd*)

$$\frac{1}{n} \sum\_{i=1}^{n-d+1} \mathbf{1}\_{\{\Pi(X\_{i}, \dots, X\_{i+d-1}) = \pi\}}$$

for P(Π(**X**) = *π*). For OPD in Equation (1), we use these plug-in estimators and denote it with OPD*d*.

#### (III) Check whether the dependence changes over time.

Using a test statistic of the following kind, we analyze whether the ordinal pattern dependence between the time series remains more or less the same

$$T\_{\rm OPD} = \max\_{1 \le k \le n} \left| \overline{\rm OPD}\_d(X\_{1:k\prime} \, Y\_{1:k}) - \overline{\rm OPD}\_d(X\_{1:n\prime} \, Y\_{1:n}) \right| \tag{3}$$

which is a CUSUM type test statistic. In the examples of the subsequent section, we already have a good candidate for the time of the structural break. If this was not the case, the argmax could be used in order to find this point in time.

#### **3. Results**

As a real life example, we consider the six runoff data sets of catchments of the mesoscale in Bavaria, Southern Germany. Four of the measuring gauges, Windischeschenbach, Unterköblitz, Münchshofen, and Heitzenhofen, are located at the main river Naab, a large tributary of the Danube river (Figure 3a). Between Unterköblitz and Münchshofen gauges, the tributary river Schwarzach flows into the Naab. Two gauges are located at the Schwarzach river: Rötz and Warnbach. In November 1975, a large reservoir was built at the Schwarzach river, called the Eixendorfer See, located directly downstream of the Rötz gauge. It serves flood protection, low water elevation (water regulation), hydropower generation, and recreation, and, with a storage of 19.30 million m3, it is among the largest reservoirs in Southern Germany. We consider the monthly maximum discharges of the period from 1959–2018, which often serve as a basis for flood frequency analyses. Due

to their temporal resolution, monthly maximum discharges are usually considered to be short-range dependent [19] and affected by seasonality (Figure 3b).

We apply the methods to analyze the data sets as described in Section 2.

**Figure 3.** (**a**) Map of the location of the gauges considered in the application with Bavarian borders (grey) and location of the zoom-in map in Europe (red); (**b**) auto-correlation of the monthly maximum discharges at Münchshofen gauge.

(I) Preparation and first analysis.

The available discharge data sets are checked for consistency and trimmed to a joint time period between 1959 and 2018, therefore consisting of *n* = 730 data points each. For each data series, the Wilcoxon test for short-range dependent data [20] is applied to test whether there is a significant change-point in the mean. For the gauges Münchshofen and Heitzenhofen in the Naab river as well as Warnbach in the Schwarzach river, such a significant change-point (5% significant level) was detected in the period of 1975 till 1977. This change-point can be directly related to the beginning of operation of the dam in the Schwarzach, which has a large impact on the mean since the largest discharge peaks are reduced by the reservoir for flood protection. Therefore, all gauges located downstream of the reservoir can be considered as non-stationary. Hence, the classical tests for structural breaks in the dependence structure cannot be applied.

(II) Calculate the ordinal pattern dependence of the whole time series.

In the next step, we calculate the ordinal pattern dependence between the discharge series of all six gauges as given in Definition 1 with *d* = 5. A similar pattern length was chosen before in the case of hydrological data and proved to be sufficient (cf. [5,14]). One could also think about using *d* = 12 and thus taking into account the annual cycle. However, this would increase the number of patterns and thus the computational effort significantly. Moreover, the annual cycle is not equally strong everywhere, e.g., due to different periods of snowmelt, and thus a general definition could be difficult. We can assume a positive dependence in this application since discharges will most likely increase and fall similarly due to spatial extension of weather patterns such as rainfall.

The ordinal pattern dependence between all six gauges is given in Table 1.

**Table 1.** Ordinal pattern dependence of all six gauges for the time period 1959-2018. Since the OPD is symmetric, only the upper half of the table is shown.


The results reveal that there is a high dependence between the gauges of the main river, decreasing with distance. This can be related to weather patterns such as rainfall events which are spatially extended and affect several catchments at the same time as well as routing. Interestingly, the gauges of the tributary are less correlated, neither with each other nor with the main river. However, dependency increases for the gauges downstream of the tributary, which is hydrologically reasonable.

(III) Check whether the dependence changes over time.

The commissioning of the dam in 1975 may not only affect the mean of the discharge series. Indeed, it is also likely that the dependence structure between the discharge series of the gauges is altered by it. It can be assumed that especially the dependency between the gauges upstream of the dam and therefore those not affected by the regulation have a higher dependency to the gauges downstream after commissioning of the dam since the incoming tributary discharge is regulated and smoother and thus does not alter the main river discharge as much as before. The dam may also affect the low dependency between the gauges in the tributary given in Table 1. This hypothesis is tested in the following by applying the ordinal pattern dependence to the time series before as well as after the assumed change-point in 1975 (Table 2). Due to the nature of ordinal patterns, this change-point detection is not affected by the previously detected change in mean.

There are clear differences between the dependency before and after the commissioning of the dam. It is not straight-forward to determine if the differences are significant, since classical tests like Fisher's Z-test cannot be applied. However, it was shown in previous studies that such differences appear to be significant ([14]). Indeed, the dependency between the upstream main river gauges, Windischeschenbach and Unterköblitz, increases or stays the same compared to the period before the commissioning of the dam. The dependency between the gauge upstream of the dam, Rötz, and downstream of the dam, Warnbach, instead decreases since the discharge is no longer transferred directly between the gauges, but the dam now regulates the runoff. Instead, the discharge of the downstream gauge is mostly impacted by inter flow. We also tested possible change-points in 1976 and 1977, taking into account a possible delayed reaction, but the general tendency of all results

stayed the same. The ordinal pattern dependence, therefore, was able to detect the changes in the dependence structure caused by the commissioning of the dam despite the change in mean at the exact same position in the time series.

**Table 2.** Ordinal pattern dependence of all six gauges for the time period 1959–1975 (upper table) and 1976–2018 (lower table).


#### **4. Discussion**

Our method shows how the commissioning of the dam has changed the dependence between the discharge data sets. In a similar way, it could be used on the other frameworks described in the Introduction. If we had not known in advance when the dam was built, we could have analyzed this by calculating the argmax of the test statistic (3) above.

The data analysis shows that ordinal pattern dependence can indeed be used in order to derive changes in the dependence structure between time series. The method is robust, easy to implement and has the advantage that the concept behind it has a simple intuitive meaning: if the ordinal patterns are often the same at the same time points, the up-and-down-behavior of the two data sets is similar.

We have chosen here *d* = 5. This value has proven to be useful in the context of discharge data. Smaller values of *d* are more often used in statistical frameworks, while higher values are used in the analysis of dynamical systems. Finding an optimal choice for *d* is part of ongoing research.

In a way, our toy example and the real world data analysis of Section 3 serve as a pilot study, showing the applicability of the method. By now, it is not possible to derive critical values or confidence bands. This is not in the scope of the present paper, since it is very technical and still a work in progress: in order to leave classical stationarity behind, a standing assumption would be 'ordinal pattern stationarity'. One needs to show that both data sets might be realizations of ordinal pattern stationary time series. Then, and that is a point being missing in the existing literature, one needs to derive limit theorems for, say, a short range dependent time-series on the space of ordinal patterns (which is a discrete space without canonical order). Finally, one might be able to derive exact change point tests in this framework. If the original time series can be assumed to be stationary, limit results like in [18] could be applied. However, the above example shows how nicely the method can deal with *non-stationary* time series.

**Author Contributions:** Conceptualization, A.S. and S.F.; methodology, A.S.; software, S.F.; writing original draft preparation, A.S. and S.F.; writing—review and editing, A.S. and S.F.; visualization, A.S. and S.F. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the German Reserach Foundation (DFG) Grant No. FOR 2416 (research unit SPATE) for Svenja Fischer and Grant No. SCHN 1231-3/2 for Alexander Schnurr.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Publicly available datasets were analyzed in this study. These data can be found here: https://www.nid.bayern.de/abfluss (accessed on 12 May 2022).

**Acknowledgments:** We are grateful to Bayerisches Landesamt für Umwelt, www.lfu.bayern.de (accessed on 12 May 2022), for providing the discharge data. We would like to thank three anonymous referees for their valuable comments.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Proceeding Paper* **On the Prospective Use of Deep Learning Systems for Earthquake Forecasting over Schumann Resonances Signals †**

**Carlos Cano-Domingo 1,\* , Ruxandra Stoean <sup>2</sup> , Nuria Novas-Castellano <sup>1</sup> , Manuel Fernandez-Ros <sup>1</sup> , Gonzalo Joya <sup>3</sup> and Jose A. Gázquez-Parra <sup>1</sup>**


**Abstract:** The relationship between Schumann resonances and earthquakes was proposed more than 50 years ago; however, the experimental support has not been fully established. A considerable amount of recent studies have focused on the relationship between a single earthquake and the Schumann resonance signal variation around this earthquake, obtaining preliminary support for the existence of the link. Nonetheless, they all lack a systematic and general approach. In this research, we propose a novel methodology to detect the presence of relevant earthquakes based on the Schumann resonance. The methodology is based on a deep learning framework composed of a pretrained variational auto-encoder followed by an LSTM network and a fully connected layer with a sigmoid output. The results reveal the uncovered relationship between earthquake activity and Schumann resonance signal using the novel methodology, being the first automatic earthquake detector based on Schumann resonance signal.

**Keywords:** Schumann resonance; earthquake detection; deep learning; autoencoder; LSTM; RNN; forecasting; dimension reduction

### **1. Introduction**

Schumann resonance (SR) are extremely low frequency (ELF) electromagnetic signals generated mainly through the lightning iteration, which propagates along the Earth– ionosphere Electromagnetic Cavity [1]. The electromagnetic cavity is formed by two electromagnetic media with a high level of conductivity (the Earth and the ionosphere) and an insulator (the atmosphere). The electromagnetic resonance produced has multiple modes in which the intensity of higher modes are considerably lower than the first modes [2]. Experimentally, the average central frequency for the first six modes is agreed to be around 7.8 Hz, 14 Hz, 20 Hz, 26 Hz, 33 Hz, and 39 Hz. Despite that, no theoretical characterization of the frequency of the SRs modes manages to accurately fit the experimental results. For example, in [3], the authors detailed a novel method for estimating the central values; however, the results do not agree accurately with the experimental worldwide data. Other studies have approached the central frequency estimation using a 3D electromagnetic simulator with very promising results [4,5]. Although the results are consistent with the experimental data, they are not completely equal to them. It is important to remark that all these approaches are centered on the central frequency estimation in a steady condition, the analytical or simulated behavior of the SR frequency variation remains undiscovered.

In light of the above, it is clear that there is difficulty involved in the estimation of the SR signal parameters, even for one of the simplest parameters, e.g., the central frequency

**Citation:** Cano-Domingo, C.; Stoean, R.; Novas-Castellano, N.; Fernandez-Ros, M.; Joya, G.; Gázquez-Parra, J.A. On the Prospective Use of Deep Learning Systems for Earthquake Forecasting over Schumann Resonances Signals. *Eng. Proc.* **2022**, *18*, 15. https:// doi.org/10.3390/engproc2022018015

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 21 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

for the first mode in the steady condition. This fact has led this team to introduce deep learning (DL) techniques in order to analyze it.

Over the last five years, one of the most studied parts of SR has been the diurnal and seasonal patterns of SR and their evolution over the time. The importance of this pattern is mainly driven by the fact that many observatories have been set up during these years. The diurnal and seasonal patterns of a UK observatory is fully explained in [6]. They explain the specific frequency registered by their observatory based on the proximity to the African thunderstorm. In [7], the researchers detailed this regular pattern and its relation with the most important lightning hot spots during specific hours of the day. Slight differences are also expected between observatories. Some aspects of the electromagnetic composition of the Earth–ionosphere cavity can be explained based on the little differences between observatories. For example, in [1], the authors use two distant stations and focus on the comparison between their spectrum and the differences between the central frequency of the first mode. In recent years, a growing interest in their relationship with other natural phenomena has received more attention; for example, the connection with the lightning activity [8,9], the relation with geomagnetic storms in [10], or with a solar proton event in [11]. Due to the complexity of the SR signals, the characteristics of the previously mentioned relationship have not been dealt with in depth, although the auspicious results point out a highly possible connection.

In relation with the presented research, earthquake (EQ) detection using SR signals has become a central issue in the last five years in the SR research. Experiments on detecting individual EQ were performed in the last two years with encouraging results. Some papers have centered their investigation on the relationship between individual events and their impact on the SR signal.

The study of the relation with two huge EQs in Japan is presented in [12]. The results discussed the effect of EQs both in the transient domain and in the frequency domain, along with a compelling theoretical explanation. In [13], the researchers point out the statistical link between frequency and intensity values of the first SR mode around individual seismic events. The work was centered on the study of the SR signal variations around important EQs, fifteen days before and five days after each EQ event. The study of the relation between two important EQ and their affectation in the SR signal in Greece can be found in [14]. They exposed the statistical interconnection between these two EQ events and the variation of the ELF background signal. Although this approach is interesting, it does not allow for establishing a systematical methodology that can detect EQ events using the continuous SR signal.

Other authors have focused on describing the prerequisites and modeling the electromagnetic (EM) iteration between SR signal and EQ. A model treatment to estimate the EQ magnitude based on the variation of the SR spectrum is fully explained in [15]. The results show substantial evidence of the usage of SR signal variation for detecting EQ above 7 and at a distance up to 3 Mn. A model of the electromagnetic manifestation in the SR band by close EQ events is described in [16]. The most promising result that emerges from their research is that nearby EQs provoke noticeable modification of the SR signal. In [17], the authors studied the prerequisites to record seismic activity in the SR band in the time domain. The results using two distant stations are promising; however, the system does not fulfill the needed accuracy for the EQ prediction. The study of the possibility of forecasting EQ using a narrow transient window is explored in [18], with a machine learning (ML) approach.

To the best of our knowledge, the evidence proposed by these studies are not conclusive. The detection of EQ events is mainly performed by the analysis of individual EQ; however, a more generalized framework would be needed to established an experimental link between SR and EQ.

It is also important to mention the lack of previous studies exploring the usage of the latest advances in computational intelligence applied to SR signals, which is more than important due to the vast amount of data and the difficulty involved in detecting variation in the regular patterns, as was mentioned before.

This research presents a new approach to exploring the possibility of using the time evolution of the SR signal to detect the occurrence of high-intensity nearby EQs. The methodology is constituted of three parts: A pretrained variation auto-encoder (VAE) focuses on reducing the data dimension, followed by a long short-term memory (LSTM) network, with the aim of taking into account the temporal trend of the SR codified data and a fully connected layer to classify the interaction based on the LSTM neuron states. The aim of this paper is to use the DL advances to design a method for establishing a correlation between SR signal and EQ.

#### **2. Materials and Methods**

The methodology used in this research is composed by a deep encoder followed by an LSTM for studying the correlation between the EQ and the SR signal. A complete description of the methodology can be seen in Figure 1.

**Figure 1.** Summary of the methodology used composed by a CNN encoder followed by an LSTM network.

The SR signal is obtained by the ELF Sierra de Los Filabres observatory. The observatory has been recording SR signals continuously from 2016, although, due to maintenance problems, some registers were not captured. This observatory contains two sensors, one for the H*NS* and the other for the H*EW*. The digital part is composed of an ADC with 24 bits of resolution and 187 samples per second. A detailed description of the observatory can be found in [19]. The data are recorded in 30-min segments. Each segment is processed and the Welch algorithm fetches the spectral information of each segment. Finally, each segment is reduced to a 256 length vector, which contains the frequency information from 0 Hz to 42 Hz. For the sake of concreteness, we have selected only the H*NS* sensor. Considering the data from January 2016 to December 2020, the total number of 30-min intervals is 87,697; however, due to the maintenance problem, the real number of 30 min intervals is 79,281.

The EQ data have been obtained through a public EQ Repository [20]. The total number of EQ with a Richter magnitude greater than 4.5 during the five year period is 35,023; however, not all EQs have been considered for this research, as the system will be focused on detecting the EQ, which can have more impact on the SR of our observatory. The EQ events have been filtered by an ad hoc criterium based on the expert knowledge of this group and the previous literature:


By selecting only the EQ that fulfills these criteria, the number of EQ is reduced to 161.

The time step used in this paper is 30 min due to the resolution of the SR signal. Consequently, the EQ event has been adjusted to its corresponding 30 min time step. The duration of the EQ has been widened to 24 h, which means 48 registers around the actual event. This widened process is performed to allow the DL model to learn about the influence of the EQ on the SR signal; therefore, the widened register can no longer be

considered the EQ event, but we consider the positive values as a representation of the EQ affection in the SR register.

To sum up, the ground truth comprises 87,697 registers, in which, 5459 are labeled as 1 (meaning that an EQ affected the SR register) and 82,238 are labeled as 0 (no affection).

The data set has been split into two parts:


The first step in our methodology is the usage of a pretrained DL encoder, framed in a VAE methodology. A schematic representation can be seen in Figure 2. The deep encoder is composed of three convolutional layers and two fully connected layers. The first convolutional layer takes a 1D vector of 256 positions and outputs 32 vectors of length 128. The second convolutional layer outputs 64 vectors of length 64. The last convolutional layer converts the input to 64 vectors of length 32. The last part of the encoder is composed of two fully connected layers that take the output of the last convolutional and output a code vector of 10 components. The code data set is composed of 87,692 codes of 10 values, out of which 79,281 are the output of the deep encoder plus 8000 zeros codes, which correspond with the SR lost segments due to maintenance problems. The main purpose of using the encoder is to focus on the variability of the signal, but not to take into account the common structure of the SR signal. The method selected for this purpose is a VAE, which tries to reconstruct the original signal but reduces the dimension of the input from 256 to 10; however, retaining the most critical information, while removing the record's common part. The variational term refers to the randomization of the learned space [21]. The VAE is also reinforced by the Lorentzian fit algorithm in the cost function.

**Figure 2.** Deep encoder and data description. The encoder is composed of three CNN layers, which output a code of 10 values for each SR segment.

The next step is composed of an LSTM network, which sequentially takes the inputs composed by the five-year outputs of the DL encoder. The LSTM architecture is composed of an LSTM layer with 32 features in the hidden state. The LSTM network is fed with 1440 values, which correspond to 30 days of 30-min segments. After the 1440 sequence registers have been introduced into the LSTM network, the hidden state of the network is processed by a fully connected layer made of two layers. The first layer takes 32 values as the input and produces a vector of 128 values as output. The last layer processes the

128 values and outputs a code of only 1 value. In order to calculate a binary output, the last layer is followed by a sigmoid function.

The binary output of the LSTM layer is compared to the SR affectation of this register. The DL LSTM system is trained to predict the affectation of the EQ in the first 25% of the 1440 sequences. In Figure 3, a description of our LSTM methodology is outlined. The sequence is compared after 1440 values have been fed to the network (30 days). The result is compared against the EQ affectation of the 360 registers (7.5 days). In order to add more generalization to the model, an L1 regularization and a dropout of 0.4 have been added. The hyper-parameters of the LSTM network have been selected using an automatic tool.

**Figure 3.** Diagram of the LSTM network used in this research.

The DL system is trained to predict the affectation of the EQ in the middle of the 1440 sequence. A summary of this methodology can be found in Figure 1.

#### **3. Results**

The results have been obtained after applying the pre-trained deep encoder to the previously mentioned training data set and the decoded output has been fed to the LSTM network. The LSTM network is trained until a 96.5% accuracy is obtained in the training data set. In Figure 4, the results applied to the training data set can be observed. It can be seen that the system has learned completely and is able to recognize the pattern of the codified SR evolution. As was explained in the methodology section, this approach first used a training data set for supervised learning and then an extensive test data set to evaluate the performance of the model. This test data set is composed of patterns that have not been presented to the network before. The results can be seen in Figure 5. It can be observed that the system is able to generalize the EQ detection when the test data set is used. There are some clear matches, although other times, the system predicts an EQ but there is no clear reference register. It is also important to notice that at the end of the test data set the system is not able to detect the EQ affection in general.

**Figure 4.** Results of the model applied to the training data set. 1st Row: DL output. 2nd Row: reference.

**Figure 5.** Results of the model applied to the test data set. 1st Row: DL output. 2nd Row: reference.

In Table 1, the confusion matrix of both the training and test data set is exhibited. The results have to be taken carefully because the methodology is based on the assumption that the SR affection lasts around 24 h and the offset between the SR and the EQ is fixed. It leads to two issues. The DL do not have the constraint of 48 consecutive "1" and also the detection does not have to start twelve hours before the EQ event constantly. Considering these two factors, even the test results are very promising with a high detection rate. Table 2 gathers the summary of the results obtained for both data sets. Considering that the number of segments labeled as EQ affectation is around 7%, the system is able to recognize a pattern in more than 18% of cases. It is also important to remark that the balance accuracy greatly supports the fact that this methodology can be seen as the first step in the EQ DL detection research.

**Table 1.** Confusion matrix for training and test data sets.



**Table 2.** Summary of results for training and test data sets.

The criteria used for selecting which EQ events are considered for training purposes is based on earlier literature as it was explained previously; however, a possible discrepancy can be expected as an explanation for the false positive detection of the DL algorithm. In Figure 6, the predicted result is presented along with a higher number of EQ events. The 2nd row shows the distance between the observatory and the EQ and the 3rd row outlines their magnitude. To expose the possible discrepancies for selecting EQ, more flexible criteria have been used: 5 Richter magnitude and 4500 km max distance. The results show that for some DL prediction with no correspondence in the expected register, the system is able to recognize the pattern as a SR affectation. This is very clear for the last DL forecast.

**Figure 6.** Results of the model applied to the test data set with more EQ considered. 1st Row: DL output. 2nd Row: distance. 3rd Row: magnitude.

A total of 27 EQ can be recognized from the test data set, each EQ event is composed of a set of at least 15 consecutive "1" values. The middle point of each consecutive segment has been chosen to evaluate the behavior of these EQ events.

In Figure 7, the minimum time difference between the prediction of the EQ event and all the expected ones can be seen. The 50 h can be considered a reasonable limit in which it can be ensured that the prediction does not match with any EQ expected event. The time difference distribution performs substantially better when the values are close to

the training data set; however, when it comes to a far distant SR, the register is not able to predict with the same level of accuracy.

**Figure 7.** Relation between the time difference and the days from the last point of the training data set.

The correlation between time difference and intensity of the EQ event is worth mentioning because it means that the proposed DL methodology is effectively learning to detect EQ events with lower error values when the EQ event is more powerful. In Figure 8, the correlation between the time difference and the magnitude or distance for the EQ events with less than 50 h of time difference can be observed. It is interesting to notice that EQ events with lower magnitude values are detected with more delay than higher values, contrary to our suppositions. On the other hand, the relation between time difference and distance follows our expectation. The prediction of closer EQ events is produced sooner than for the more distant ones.

**Figure 8.** Correlation between the time difference and the magnitude or distance. Points: results-Dashed Line: Linear Regression.

Finally, a comparison without using the DL encoder was performed for validation purposes. In Table 3, the results are summarized. It can be seen that the LSTM methodology is not enough to detect EQ events using SR signals. The balanced accuracy is even worse than the pure random option.


**Table 3.** Summary of result for training and test data sets using the LSTM with the raw SR registers.

#### **4. Conclusions**

Our work led us to conclude that the usage of DL for processing SR and for exploring their relationship with other phenomena is more than promising. We have managed to develop a novel methodology for studying the SR signal based on two aspects: the dimension reduction using VAE and the study of the SR variation with a recurrent neural network (RNN) network.

These results have gone some way towards enhancing our understanding of the relationship between EQ events and the variation of the SR signal. This work has highlighted that the codification of the SR using VAE extracts valuable information at least for detecting changes in the SR pattern. Contrary to most of the literature, in this study, we have used the central peak frequency of each of the first six SR modes, which could help find these unexpected results. Our investigation into this area is still ongoing and seems likely to confirm our hypothesis; however, due to the lack of data from other observatories to compare, this finding might not be generalized to other SR registers. To further our research, we are planning to use the information from the two sensors of our observatory and also obtain SR data from observatories that were substantially distant from the Sierra de Los Filabres one, so as to enhance the accuracy of our prediction. We will also have to construct new DL architectures in order to improve the characteristics of our EQ detector models, such as temporal convolutional network (TCN) or temporal fusion transformer (TFT).

**Author Contributions:** Conceptualization: C.C.-D., R.S., N.N.-C. and J.A.G.-P.; methodology: C.C.- D., R.S. and G.J.; software: C.C.-D.; hardware: J.A.G.-P. and M.F.-R.; investigation: C.C.-D., N.N.-C. and J.A.G.-P.; resources: J.A.G.-P.; writing—original draft preparation: C.C.-D. and R.S.; writing review and editing: N.N.-C., G.J., M.F.-R. and J.A.G.-P.; visualization: C.C.-D. and R.S.; supervision: J.A.G.-P. and N.N.-C.; project administration: J.A.G.-P. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received funding by "Proyecto Puente 2021/001" of the University of Almeria (Spain), I+D+I Project UAL18-TIC-A025-A, the University of Almeria, and the European Regional Development Fund (FEDER). And the Ministry of Economics and Competitiveness of Spain financed this work, under Project TEC2014-60132-P, in part by Innovation, Science and Enterprise, Andalusian Regional Government through the University of Almeria, Spain and in part by the European Union FEDER Program.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** We thank the Andalusian Institute of Geophysics, the Electronics, Communications, and Telemedicine TIC019 Research Group of the University of Almeria, Spain, CIAMBITAL Group. The European Regional Development Fund (FEDER) and the Technologies Electronics Department and the Telecommunication Research Institute (TELMA) from the University of Malaga.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Proceeding Paper* **Probabilistic Forecasting for Oil Producing Wells Using Seq2seq Augmented Model †**

**Hadeel Afifi 1,\* , Mohamed Elmahdy 2, Motaz El Saban 2,3 and Mervat Abu-Elkheir <sup>1</sup>**


**Abstract:** Time series forecasting is a challenging problem in the field of data mining. Deterministic forecasting has shown limitations in the field. Therefore, researchers are now more inclined towards probabilistic forecasting, which has shown a clear advantage by providing more reliable models. In this paper, we utilize seq2seq machine learning models in order to estimate prediction intervals (PIs) for a large oil production dataset. To evaluate the proposed models, Prediction Interval Coverage Probability (PICP), Prediction Interval Normalized Average Width (PINAW), and Coverage Widthbased Criterion (CWC) metrics are used. Our results show that the proposed model can reliably estimate PIs for production forecasting.

**Keywords:** time series forecasting; prediction intervals; seq2seq; oil production

#### **1. Introduction and Background**

A time series is a sequence of observations that is captured through time, and forecasting is the process of estimating future trends or values based on present and past values. Time series forecasting has applications in various fields, such as electricity consumption and price forecasts [1,2], wind forecasting [3], temperature forecasting [4], and several other real-life applications [5].

There are two main forecasting methods: deterministic forecasting and probabilistic forecasting [6]. Deterministic forecasting, also known as point forecasting, is the process of predicting a single deterministic value in the future, which is then compared against the target real value. However, deterministic forecasting has shown limitations in the field because no information is available about the dispersion of the actual values around the estimated values, and it is hard to tell by how much the actual value would deviate from the predicted value, which could be especially disadvantageous for complex data. Therefore, probabilistic forecasting is being explored as a forecasting method that produces potentially substantial improvement over deterministic forecasting by providing more reliable models [7]. In probabilistic forecasting, a range under which the target value should be present is predicted. This range is referred to as Prediction interval (PI).

The goal of time series analysis is to create a model that attempts to describe the behavior of the series and predict its future values. In order to facilitate the inference of information about time series, the series should be transformed to be a stationary one [8]. Moreover, most of the statistical approaches to analyzing time series data require the series to be stationary [9]. A stationary series is loosely defined as a series which has statistical properties that do not vary over time, such as mean and variance. A strict stationary series definition is too restricting; thus, a weaker version is usually used instead [10]. In order

**Citation:** Afifi, H.; Elmahdy, M.; El Saban, M.; Abu-Elkheir, M. Probabilistic Forecasting for Oil Producing Wells Using Seq2seq Augmented Model. *Eng. Proc.* **2022**, *18*, 16. https://doi.org/10.3390/ engproc2022018016

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 21 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

to make the series stationary, we need to remove the trend and the seasonality. Trend represents a varying mean which can be observed in the series by the values that either keep on increasing or decreasing over time. On the other hand, seasonality is represented by a pattern which repeats itself over time, which can indicate a varying variance.

Both time series forecasting statistical and deep learning models have been discussed in the literature. Statistical models such as ARIMA are used for more precise prediction, but need experts and deep domain knowledge with vigorous analysis. On the other hand, deep learning models such as Long short-term memory (LSTM) require less knowledge in the field and less time in investigation, as there is no need to discover optimal features and parameters for the model [11]. Time series models have also adapted an attention mechanism. Attention was first introduced to solve a machine translation task [12]. The goal of attention was to overcome the shortcomings in recurrent neural networks (RNNs), as they struggled to remember long sequences. This is achieved by retaining the hidden stats at each step during decoding. Attention gives more importance to some features over the others by using weights for the features. Then, weighted sum is taken using soft max to get the sequence context for each feature.

The application of interest in this study is oil production. Oil is a traditional fossil fuels which is studied by many researchers. Even with the emergence of renewable energy such as wind, oil is still an important factor that affects the economy and plays an important role in energy investment due to the high risk, a long cost payback period and other factors accompanied by the investment in renewable energy [13].

State-of-the-art reservoir engineering forecasting techniques rely on Arps' Decline Curve Analysis (DCA) equations [14]. To estimate oil and gas reserves for current and future wells, DCA has been used as one of the prominent techniques for such a task. Arps divides the well production into two main partitions:


The curve function is summarized as follows:

$$q(t) = \begin{cases} \frac{q\_i}{(1 + bd\_i t)^{\frac{1}{b}}} & t < t\_h \\ q\_h e^{-d\_f t} & t \ge t\_h \end{cases} \tag{1}$$

where *q*(*t*) is the oil production rate in barrels/day, *qi* is the initial production, *di* is the initial decline in the hyperbolic part of the equation, *t* is time, and *b* is the hyperbolic factor controlling the rate of change of the decline. After reaching a certain decline rate *df* , the curve is represented by an exponential one using *qh*, the production reached by time *th*.

Decline curve modeling has been used to predict production data where the curve is fitted to the data to estimate future points. In [15], different decline models are evaluated, namely the Exponential Decline Model (SEDM) and the Logistic Growth Model (LGM), followed by the Extended Exponential Decline Model (EEDM), the Power Law Exponential Model (PLE), Doung's Model, and the Arps Hyperbolic Decline Model.

In [16], unlike the traditional trend stationarity techniques, a new method was adopted which utilizes the Arps decline curve. The trend found in the oil datasets is removed, utilizing the Arps fitted curve in an attempt to make the series stationary.

In this study, we propose a machine learning model that estimates a prediction interval for a large dataset composed of monthly oil production data of unconventional oil-producing wells. Accurate estimation of prediction intervals can play a critical role in quantifying uncertainty and to support investment and divestment decisions. The following sections are organized as follows: Section 2 discusses the experiment details and the setups used. Section 3 describes the data. Section 4 describes the evaluation metrics used to assess the prediction intervals. Section 5 includes the results and their visualization, and presents some insights. We finally conclude in Section 6.

#### **2. Model**

The machine learning model utilized in this study is a sequence-to-sequence (seq2seq) model. seq2seq is an encoder–decoder-based deep learning model. Two LSTM models are used separated by a repeat vector, which repeats the input three times. It is used as multi-step ahead forecasting, since it forecasts several steps (three) ahead in the future (future sequence). It is followed by two layers of densely connected NN of 100 units and 1 unit applied using TimeDistributed layers.

Using the model, we test different setups, which include trend removal and attention mechanism. Quantile loss is utilized to create the upper and lower bounds of the PIs, as well as the 0.5 quantile (p50).

Trend removal is used to make the series stationary by utilizing the special trend accompanied by oil production series. First, the sequence is fitted using a hyperbolic-toexponential Arps decline curve. Then, trend removal is simply achieved by taking the difference between the original series with the Arps fitted curve, as shown in Figure 1. Regarding the attention mechanism, we implemented a simple attention layer using keras [17] following the attention mechanism introduced in [12].

**Figure 1.** Trend removal by taking the difference between the actual curve and the Arps fitted curve.

#### **3. Data**

Experiments are conducted on a dataset consisting of sequences of production data obtained over successive months for producing wells from all US oil and gas basins. These sequences are represented by the number of oil barrels per day from oil horizontal wells. Only the data post the peak production are used, bearing in mind that typically, the data prior to the peak production in the reservoir engineering domain are studied independently of the rest of the data. The total number of wells in our experiment is 60,000, where 50,000 were used to train the model and 10,000 are withheld for the testing phase. The sliding window technique is leveraged to cover all months and make the input sequence consistent in size where sequences of nine consecutive months are taken; six months are used as features, and three as targets. Accordingly, the training set consists of 1,596,240 sequences and the test set consists of 46,386 sequences. We aim to estimate an interval by employing the quantile loss using the 0.5 quantile for the lower bound and the 0.95 quantile for the upper one to achieve 90% of the prediction interval.

#### **4. Evaluation Metrics**

Different metrics are utilized to evaluate the predicted prediction intervals. The most commonly used metrics are: Prediction Interval Coverage Probability (PICP) and Prediction Interval Normalized Average Width (PINAW) [2,18]. PICP is a method that measures the probability of a specific target existing within the predicted interval. It is defined as follows:

$$PICP = \frac{1}{N} \sum\_{i=1}^{N} \epsilon\_i \tag{2}$$

where *<sup>i</sup>* is defined as

*<sup>i</sup>* = " 1, if *xi* ∈ [*Li*, *Ui*] 0, if *xi* ∈ [*Li*, *Ui*]

*Li* and *Ui* represent the lower and upper bound of the prediction interval, respectively; *N* is the number of samples, and *xi* is the target.

The results are enhanced significantly by increasing the PICP value. However, the width of the PI has an impact on the prediction. Increasing the PI width by moving the lower and upper bounds further apart to include more targets negatively affects the prediction significance, as the decision-makers will have little information to base their decision upon. PINAW, also known in the literature as Normalized Mean Prediction Interval Width (NMPIW), was introduced to overcome this flaw; it is a metric that measures the width of the interval. It is commonly used in the literature with the investigation of probabilistic forecasting. PINAW is the average width of the several predicted PIs normalized by the width of the target, and it is defined as follows:

$$PINAW = \frac{1}{N\*R} \sum\_{i=1}^{N} (L\_i - L\_i) \tag{3}$$

where *R* is the range of the target; in other words, it is the maximum minus the minimum target.

Hence, the smaller the value of the PINAW, the better the results are.

It is desirable to have a narrow PI, which can be obtained by targeting a narrower interval while considering the quantiles. However, that has conflicted interests with having a large number of target points, which can be obtained by making the PI wider. Therefore, a Coverage Width-based Criterion (CWC) is introduced in the literature [19]. CWC is defined as

$$\text{CWC} = PINAW \ast \left(1 + \gamma(PICP) \ast e^{-\eta(PICP - \mu)}\right) \tag{4}$$

where *γ*(*PICP*) is defined as follows:

$$\gamma(PICP) = \begin{cases} 1, & \text{if } PICP < \mu \\ 0, & \text{if } PICP \ge \mu \end{cases}$$

While the hyper-parameter *μ* is the target PICP value and *η* is the penalty for having a PICP value less than the target.

The value of *η* should be large to provide a high penalty for non-sufficiently-informative PIs. Having reached the target PICP, the CWC will have the same value as the PINAW and, in that case, it is safe to assume an informative PI. On the other hand, a smaller PICP will lead to high CWC values caused by the penalty *η* in the exponential term in equation 4. Hence, a small CWC is targeted.

#### **5. Results and Discussion**

The results are shown in Table 1. Adding attention only and 90% PI by choosing 0.05 and 0.95 quantiles, we are able to obtain 90.5% PICP, 9.4% for PINAW and, 0.094 CWC when the expected PICP is set to 90% and the penalty parameter is equal to 50 [20]. On the other hand, by applying trend removal only, we obtain 85.9% PICP, 6.9% PINAW, and 0.605 CWC. Additionally, when using both trend removal and attention, we get 85.4% PICP, 6.7% PINAW and 0.729 CWC. The CWC value increases with non-satisfactory PICP, which means when PICP is less than the expected value, the CWC will be significantly greater than PINAW. CWC will be equal to PINAW when the PICP value is greater than or equal to the expected value. Thus, we can deduce that the smaller the CWC value is, the better the prediction is. The very slight improvement upon using the attention can be attributed to the sequence under investigation; the sequence is not long enough to emphasize the enhancement.



The result in our work regarding using Arps differencing confirms the results in [16], which can be found in Table 2. From our results in Table 1, it is clear that using Arps differencing yields a narrower width compared to keeping the trend. We can also see that using Arps differencing is better than choosing a narrower PI in regards to PICP, which represents the coverage probability of the PI when both yield a narrower width, as shown in Figure 2.

**Figure 2.** Arps differencing yields a narrower width compared to keeping the trend. It also has more PI coverage probability compared to a narrower width.

**Table 2.** Gradient Boosting for regression results in percentage for different setups using PICP and PINAW evaluation metrics in [16]. Each step represents a predicted month.


#### **6. Conclusions and Future Work**

The proposed machine learning model uses monthly production data as features. The sliding window technique is utilized to cover all production months. The model applies the Arps differencing technique to reduce the variation in the statistical properties that make the series non-stationary. Moreover, we investigate the effect of applying the attention mechanism to the model. The seq2seq deep learning model is utilized by taking the first n production points starting from the peak production to predict a p50 production curve, as well as a prediction interval in which the next m production points observed will most likely lie. Prediction intervals are estimated besides the p50 in order to capture uncertainties lacking in the deterministic forecasts. Further investigation applying different architectures and machine learning models and observing the outcome after removing other special trends that may accompany other datasets is warranted.

**Author Contributions:** Conceptualization, H.A., M.E., M.E.S. and M.A.-E.; methodology, H.A., M.E. and M.E.S.; software, H.A.; validation, H.A.; resources, H.A., M.E., M.E.S.; writing original draft preparation, H.A.; writing—review and editing, M.E., M.E.S. and M.A.-E.; visualization, H.A.; supervision, M.E., M.E.S. and M.A.-E.; project administration, M.E., M.E.S. and M.A.-E. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Data Availability Statement:** Production data are publicly available but a third party service (Conduit Resources) was used to collect the scattered data from many sources.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Proceeding Paper* **Towards Time-Series Feature Engineering in Automated Machine Learning for Multi-Step-Ahead Forecasting †**

**Can Wang 1,\* , Mitra Baratchi <sup>1</sup> , Thomas Bäck <sup>1</sup> , Holger H. Hoos <sup>1</sup> , Steffen Limmer <sup>2</sup> and Markus Olhofer <sup>2</sup>**


**Abstract:** Feature engineering is an essential step in the pipelines used for many machine learning tasks, including time-series forecasting. Although existing AutoML approaches partly automate feature engineering, they do not support specialised approaches for applications on time-series data such as multi-step forecasting. Multi-step forecasting is the task of predicting a sequence of values in a time-series. Two kinds of approaches are commonly used for multi-step forecasting. A typical approach is to apply one model to predict the value for the next time step. Then the model uses this predicted value as an input to forecast the value for the next time step. Another approach is to use multi-output models to make the predictions for multiple time steps of each time-series directly. In this work, we demonstrate how automated machine learning can be enhanced with feature engineering techniques for multi-step time-series forecasting. Specifically, we combine a state-of-theart automated machine learning system, auto-sklearn, with tsfresh, a library for feature extraction from time-series. In addition to optimising machine learning pipelines, we propose to optimise the size of the window over which time-series data are used for predicting future time-steps. This is an essential hyperparameter in time-series forecasting. We propose and compare (i) auto-sklearn with automated window size selection, (ii) auto-sklearn with tsfresh features, and (iii) auto-sklearn with automated window size selection and tsfresh features. We evaluate these approaches with statistical techniques, machine learning techniques and state-of-the-art automated machine learning techniques, on a diverse set of benchmarks for multi-step time-series forecasting, covering 20 synthetic and real-world problems. Our empirical results indicate a significant potential for improving the accuracy of multi-step time-series forecasting by using automated machine learning in combination with automatically optimised feature extraction techniques.

**Keywords:** automated machine learning; machine learning; time-series forecasting

#### **1. Introduction**

Time-series (TS) data, such as electrocardiograms, music, exchange rates, and energy consumption, is everywhere in our daily life and business world. Multi-step-ahead forecasting is an important task in time-series modeling with many industrial applications, such as crude oil price forecasting [1] and flood forecasting [2]. Both ML models [3,4] and statistical models [5,6] are used for this purpose. Creating an ML pipeline for such TS data analysis is, however, difficult for many domain experts with limited machine learning expertise, due to the complexity of the data sets and the ML models.

To reduce the complexity of creating machine learning pipelines, automated machine learning (AutoML) research has recently focused on developing algorithms that

**Citation:** Wang, C.; Baratchi, M.; Bäck, T.; Hoos, H.H.; Limmer, S.; Olhofer, M. Towards Time-Series Feature Engineering in Automated Machine Learning for Multi-Step-Ahead Forecasting. *Eng. Proc.* **2022**, *18*, 17. https://doi.org/10.3390/ engproc2022018017

Academic Editors: Ignacio Rojas, Hector Pomares, Olga Valenzuela, Fernando Rojas and Luis Javier Herrera

Published: 21 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

can automatically design ML pipelines optimised for a given data set without human input [7,8]. The components in a pipeline may include a data cleaning component, feature engineering component, ML model component, and ensemble construction component. While there has been work proposed in the past that used AutoML for TS analysis [9], the current AutoML systems do not support specialised techniques used for designing machine learning pipelines for TS data, such as TS feature engineering. As demonstrated by Christ et al. [10], including such specialized techniques for extracting TS features or feature importance filtering could significantly improve the accuracy of ML models. Therefore, our goal in this paper is to study if extending AutoML systems by including such techniques can improve the quality of automatically generated machine learning pipelines for multi-step forecasting.

This paper presents a study of AutoML for time-series multi-step TS forecasting implemented through (i) multi-output modeling, and (ii) recursive modeling. We combine the state-of-the-art AutoML system of auto-sklearn with tsfresh, a well-known library for feature extraction from TS. We implement three AutoML variants and state-of-the-art baseline models, including auto-sklearn, SVM, GBM, N-BEATS, and Auto-Keras. More specifically, our contributions are as follows:


The remainder of this paper is structured as follows: Section 2 covers related work on TS forecasting, TS feature engineering and AutoML. Section 3 introduces the problem statement of AutoML for multi-step TS forecasting. In Section 4, the methodology of our newly proposed models is explained. Section 5 presents the results of the empirical performance comparison of different ML models for multi-step TS forecasting. A summary of our work and directions for future work are given in Section 6.

#### **2. Related Work**

**Multi-step TS Forecasting:** different ML methods and statistical methods have been used for both single-step and multi-step TS analysis in the past few decades [11]. These include artificial neural networks [3], support vector machines [4], gradient boosting machine [12], random forest [13], auto-regressive moving average models [5], and exponential smoothing models [6]. To use these models for multi-step forecasting, two major approaches are used: direct and recursive strategies [14]. The first of these uses multi-output regression models to predict multiple TS steps into the future directly or use multiple models (one for each time step) to make multi-step forecasting. Multi-output models show good performance and require less computational resources than training multiple models to realise multi-step forecasting [15]. The second approach uses a single model recursively, using the predicted values as an additional input to forecast the next step. In this case, the error in the prediction may be accumulated [14]. Recursive strategies only require one model, which saves significant computational time [14]. Other methods based on these two approaches are also used, such as the DirRec strategy [16], which combines the recursive strategies and direct strategies.

**TS feature engineering**: Feature engineering is an essential component in ML pipelines. Feature extraction methods have also been used for TS analysis tasks [17,18]. There are a number of TS feature extraction libraries that are widely used for TS analysis, including tsfresh [19], Catch22 [20], and hctsa [21]. Catch22 extracts a selected list of the 22 most useful features of the 4791 features of hctsa from a TS. Extracting features with Catch22

is more computationally efficient than hctsa, with only a 7% reduction in accuracy on average. In the following, we use tsfresh, which extracts more than 700 TS features in parallel and has previously shown strong performance [19], since it does not require huge computational resources like hctsa, or suffers from an accuracy reduction like Catch22.

**AutoML**: AutoML systems have been used in many domains, such as image classification [22], language processing [23] and energy consumption forecasting [9]. However, TS features are not well-studied in AutoML systems. In our earlier work [9], we studied AutoML for short-term load forecasting demonstrating the competitive performance of AutoML systems. However, we did not investigate the use of TS features. In another work [24], we studied AutoML with TS features for single-step TS forecasting. Our experimental results indicate that AutoML with TS features can further improve the accuracy of AutoML systems. In this work, we focus on studying how to improve the performance of AutoML systems on multi-step forecasting tasks by using TS features. Many AutoML systems have been recently developed, including Auto-sklearn [25], AutoGluon [8] and Auto-Keras [26] . Auto-sklearn is used in our experiments, since it supports various algorithms (e.g., SVMs) that are not available in the other systems, and it is easy to extend. Auto-Keras is used to create a deep learning baseline in our experiments.

#### **3. Problem Statement**

Given a univariate TS **x** = [*x*1, ... , *xi*] composed of *i* observations. We are interested in predicting the next *k* values **x** = [*xi*+1, ... , *xi*<sup>+</sup>*k*], where *k* > 1 denotes the forecasting window. Usually not all the data points show the same influence on the predictions of [*xi*+1, ... , *xi*<sup>+</sup>*k*]. The more recent data points tend to be more important. Specifically, given a TS segment [*xi*−*w*+1, ... , *xi*], we are interested in forecasting of [*xi*+1, ... , *xi*<sup>+</sup>*k*]; the window size *w* indicates how much historical data are used to make the prediction. In our previous work [24], we found that the window size *w* plays an essential role in single-step TS forecasting; specifically, if an ML model gets too much or too little information, this may reduce the model's performance. Here, we extend the automated window size selection technique to multi-step forecasting. Besides this, we use tsfresh to automatically extract features from TS data within these windows.

#### *AutoML for Multi-Step TS Forecasting*

We define the *Combined Algorithm Selection and Hyperparameter optimisation (CASH)* problem [27] for multi-output TS forecasting as the following joint optimisation problem: Given a TS data set **x** = [*x*1, ... , *xn*] that is split into **x***train* and **x***valid*, we are interested in building an optimised model using **x***train* by minimising loss on **x***valid*. Formally, we define the **automated TS forecasting problem** as:

*Let* <sup>A</sup> <sup>=</sup> {*A*(1), ... , *<sup>A</sup>*(*k*)} *be a set of algorithms with associated hyperparameter spaces* <sup>Λ</sup>(1), ... , <sup>Λ</sup>(*k*)*. Let* **<sup>w</sup>** <sup>=</sup> {*w*(1), ... , *<sup>w</sup>*(*l*)} *be the set of the possible window sizes. Furthermore, let* **<sup>x</sup>***train be a training set, and* **<sup>x</sup>***valid be a validation set. Finally, let* <sup>L</sup>(*A*(*i*) *<sup>λ</sup>* , *<sup>w</sup>*(*j*), **<sup>x</sup>***train*, **<sup>x</sup>***valid*) *denote the loss that algorithm A*(*i*) *achieves on* **x***valid when trained on* **x***train with hyperparameters <sup>λ</sup>* <sup>∈</sup> <sup>Λ</sup>(*i*) *and window size <sup>w</sup>*(*j*)*. Then the automated TS forecasting problem is to find the window size, algorithm, and hyperparameter setting that minimises this loss, i.e., to determine:*

$$\mathcal{L}(A^\*, \lambda^\*, w^\*) \in \operatorname\*{arg\,min}\_{A \in \mathcal{A}, \lambda \in \Lambda, w \in \mathbf{w}} \mathcal{L}(A\_{\lambda}, w, \mathbf{x}\_{\text{train}}, \mathbf{x}\_{\text{valid}}) \tag{1}$$

#### **4. Methodology**

This section presents the two multi-step forecasting techniques and the AutoML technique enhanced with TS features we use later in our experiments.

#### *4.1. Multi-Step Forecasting*

**Recursive strategy:** In this strategy, given a univariate TS **x** = [*x*1, ... , *xn*] composed of *n* observations, a model *f* is trained to perform a single-step ahead forecast:

*<sup>x</sup>*ˆ*i*+<sup>1</sup> = *<sup>f</sup>*(*xi*−*w*+1, ..., *xi*) with *<sup>i</sup>* ∈ {*w*, ..., *<sup>n</sup>* − 1}. Then we use *<sup>x</sup>*ˆ*i*+<sup>1</sup> as an input to predict *xi*+2. *<sup>x</sup>*ˆ*i*+<sup>2</sup> = *<sup>f</sup>*(*xi*−*w*<sup>+</sup>2, ..., *xi*, *<sup>x</sup>*ˆ*i*+1) with *<sup>i</sup>* ∈ {*w*, ..., *<sup>n</sup>* − 1}. We continue recursively, making new predictions in this manner until we forecast *xi*<sup>+</sup>*k*.

**Direct Multi-Output strategy:** A multi-output strategy has been proposed by Taieb and Bontempi [28] to solve multi-step TS forecasting tasks. In this strategy, one multi-output model *<sup>f</sup>* is learned, i.e., [*x*ˆ*i*+1, ..., *<sup>x</sup>*ˆ*i*+*k*] = *<sup>f</sup>*(*xi*−*w*+1, ..., *xi*) with *<sup>i</sup>* ∈ {*w*, ..., *<sup>n</sup>* − *<sup>k</sup>*}.

#### *4.2. Auto-Sklearn with TS Feature Engineering*

In this work, we study how we can extend auto-sklearn [25] to perform automatic feature extraction on TS data. Originally, the pipelines constructed by auto-sklearn include a preprocessor, feature preprocessor and ML components. The ensemble construction used in auto-sklearn uses a greedy algorithm to build the ensembles. The workflow of auto-sklearn is illustrated in Figure 1. Auto-sklearn has a powerful feature preprocessor component. However, it does not support any specialised TS feature extractors. In our work, we use our newly proposed TS feature extractors in the search space instead of the feature extractor in auto-sklearn. Automated feature extraction in this case considers both the selection of the window size and extraction of relevant features. Therefore, we propose three variants of auto-sklearn that are specially designed for TS forecasting tasks by replacing the feature extractors of auto-sklearn with one of the following.

**Figure 1.** Workflow of auto-sklearn. Auto-sklearn uses Bayesian optimisation to search the space of ML pipelines, including a preprocessor, feature preprocessor and ML components.


#### **5. Experimental Results**

Our key empirical results are based on aggregate performance over 20 data sets and 8 models. More detailed descriptions of the data sets and models are described in the section.

#### *5.1. Data Sets*

The open-source datasets from CompEngine [30] are used in our experiments. CompEngine is a TS data engine containing 197 types of data sets comprising 29,514 in total. These data sets include both real-world and synthetic data sets. We chose ten real-world and ten synthetic data sets from different categories. These are comprised of the following 20 categories: Audio: Animal sounds, Human speech, Music; Ecology: Zooplankton growth; Economics: Macroeconomics, Microeconomics; Finance: Crude oil prices, Exchange rate, Gas prices; Medical: Electrocardiography ECG; Flow: Driven pendulum with dissipation, Duffing-van der Pol Oscillator, Driven van der Pol oscillator, Duffing two-well oscillator, Diffusionless Lorenz Attractor; Stochastic Process: Autoregressive with noise, Correlated noise, Moving average process, Nonstationary autoregressive, Random walk.

Since there are usually more than one data set in each category, we choose the first one of each category. We split every data set into 67% training and 33% test set, based on temporal order, since the data sets are TS.

#### *5.2. Experimental Setup*

All the experiments were executed on 8 cores of an Intel Xeon E5-2683 CPU (2.10 GHz) with 10 GB RAM. In the experiments, version 0.8.0 of auto-sklearn and version 0.16.1 of tsfresh were used. To evaluate the quality of an ML pipeline, we used quantified error/accuracy. *RMSE* was used as a performance metric in the optimisation. The maximum evaluation time for one ML pipeline was set to 20 min wall-clock time. The time budget for every AutoML optimisation on each data set was set to 3 h wall-clock time. In these experiments, we used hold-out validation (training:validation = 67:33), the default validation technique in auto-sklearn. The split was carried out only on the training data, such that the optimisation process never sees the test data. However, we did not shuffle the data set in order to preserve the temporal structure of the TS data. All remaining choices were left at their default settings. Since experiments are very time-consuming, we used bootstrapping to create distributions of performance results in order to investigate their variability. Every experiment was run 25 times. We then randomly sampled 5 out of the 25 results and selected the model with the lowest *RMSE* on the training set out of these five models and reported the *RMSE* on the test set. We repeated this process 100 times per model and data set. The distributions we showed are based on these 100 values.

We compared the AutoML methods, including Auto-Keras, auto-sklearn and our proposed variants, with traditional ML baselines and N-BEATS. Both recursive and multioutput techniques are used in the ML baselines (GBM and SVM). All other models use the multi-output approach.

#### *5.3. Baselines*


**Figure 2.** Workflow of our customised search space of Auto-Keras for TS forecasting. The input data flows through the RNN Block. Hyperparameters, such as the number of layers and learning rate, will be optimised during the search. The Regression Head then generates output based on the information from the RNN Block.

#### *5.4. Our Methods*


Tables 1 and 2 compare the performance achieved by different methods in terms of *RMSE* on the test set. Table 1, shows the results for traditional ML baseline models, while Table 2 presents the results for AutoML techniques and N-BEATS. To present the results in these tables, we calculated the statistical significance of the results by the non-parametric Mann–Whitney U-test [37] with a standard significance level set to 0.05. The bold-faced entries show the lowest mean *RMSE* achieved on a given data set, and the ∗ means the *RMSE* is statistically best.


**Table 1.** *RMSE* on test set acquired from traditional ML baselines. GBM-recursive, GBM-multiout, SVM-recursive, and SVM multioutput win on 6, 6, 0, and 8 out of 20 data sets respectively.

**Table 2.** *RMSE* on test set acquired from different AutoML methods including vanilla auto-sklearn (VA), our proposed variants (W, T, and WT), Auto-Keras and the state-of-the-art method N-BEATS. The accuracy of N-BEATS, Auto-Keras, VA, W, T, WT are statistically significant on 5, 0, 2, 8, 3, and 3 out of 20 data sets, respectively.



**Table 2.** *Cont*.

#### *5.5. Research Questions*

#### **Q1: How do recursive and multi-output techniques compare in terms of accuracy?**

To determine the answer to this question, we compared the recursive and multi-output versions of GBM and SVM algorithms. Among the baselines we consider, N-BEATS as described in the original work is not a recursive model. Therefore, we do not consider it for this analysis. Looking at Table 1, we generally observe that GBM-multioutput performs better than GBM-recursive on 12 out of 20 data sets, while SVM-multioutput outperforms SVM-recursive on 17 out of 20 data sets in terms of *RMSE*. As we have observed that multi-output models tend to perform better, which is in line with the results from [14]. Therefore, we only use the multi-output technique in our next experiments.

#### **Q2: To what extent can AutoML techniques (Auto-Keras, auto-sklearn, and our variants) beat the traditional baselines (GBM, SVM)?**

Looking at Tables 1 and 2, one can compare the performance achieved by different methods in terms of *RMSE* on the test set. We observe that Auto-Keras beats all the traditional ML baseline models (GBM-recursive, GBM-multioutput, SVM-recursive, and SVM-multioutput) on 4 out of 20 data sets. Vanilla auto-sklearn outperforms all the

traditional ML baselines on 8 out of 20 data sets. Our three variants W, T, WT show lower error than all the traditional ML baselines on 10, 5, and 5 out of 20 data sets, respectively. The best AutoML (W) outperforms the best traditional ML baseline (SVM-multioutput) on 14 out of 20 data sets.

#### **Q3: To what extent can AutoML techniques beat N-BEATS?**

Looking at Table 2, we observe that the best AutoML (W) outperforms N-BEATS on 14 out of 20 data sets. For other AutoML techniques we observe that Auto-Keras, VA, T, and WT beat N-BEATS on 5, 12, 11, and 10 out of 20 data sets, respectively. AutoML methods that are based on standard machine learning are beating this neural networks based model.

#### **6. Conclusions**

In this paper, we extend AutoML for multi-step TS forecasting with TS features. We found that AutoML can achieve significantly higher accuracy than the traditional ML baselines on 14 out of 20 data sets in terms of *RMSE*. Although N-BEATS performs better than Auto-Keras and vanilla auto-sklearn on many data sets, our AutoML TS variants still managed to beat it on 14 out of 20 data sets. We found that the multi-output technique tends to perform better with the same budget than the recursive technique in the multi-step TS forecasting tasks. Overall, these results clearly demonstrate that the use of AutoML techniques and multi-output strategies for multi-step TS forecasting is promising.

Interesting avenues for future work include AutoML for online learning and TS classification. Most AutoML systems focus on a stable data set. Characteristics of TS data might change over time and consequently, the best configuration of data sets may vary over time. We see potential value in extending AutoML on evolving data streams. Furthermore, in TS classification typically classification with a fixed-sized sliding window has been studied. However, the best window size might not be easily determined. Our automated window size selection technique may help to improve the performance of the classification tasks.

**Author Contributions:** Conceptualization, C.W., M.B. and H.H.H.; methodology, C.W., M.B. and H.H.H.; software, C.W.; validation, C.W.; formal analysis, C.W.; investigation, C.W., M.B. and H.H.H.; resources, C.W., M.B. and H.H.H.; data curation, C.W.; writing—original draft preparation, C.W.; writing—review and editing, M.B., T.B., H.H.H., S.L. and M.O.; visualization, C.W.; supervision, M.B., T.B., H.H.H., S.L. and M.O.; project administration, T.B., H.H.H., S.L. and M.O.; funding acquisition, T.B., H.H.H., S.L. and M.O. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work is part of the research programme C2D–Horizontal Data Science for Evolving Content with project name DACCOMPLI and project number 628.011.002, which is (partly) financed by the Netherlands Organisation for Scientific Research (NWO).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data sets used in our experiments are available in https://github. com/wangcan04/AutoML-multistep-forecasting, accessed on 1 June 2022.

**Conflicts of Interest:** The authors declare no conflict of interest.The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **Abbreviations**

The following abbreviations are used in this manuscript:

TS Time-series ML Machine Learning AutoML Automated Machine Learning

#### **References**

