# **Feature Papers of Forecasting 2021**

Edited by Sonia Leva Printed Edition of the Special Issue Published in *Forecasting*

www.mdpi.com/journal/forecasting

**Feature Papers of Forecasting 2021**

## **Feature Papers of Forecasting 2021**

Editor

**Sonia Leva**

MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade • Manchester • Tokyo • Cluj • Tianjin

*Editor* Sonia Leva Department of Energy, Politecnico Di Milano Italy

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *Forecasting* (ISSN 2571-9394) (available at: https://www.mdpi.com/journal/forecasting/special issues/FP 2021).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. *Journal Name* **Year**, *Volume Number*, Page Range.

**ISBN 978-3-0365-5571-3 (Hbk) ISBN 978-3-0365-5572-0 (PDF)**

© 2022 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND.

### **Contents**


### **Fotios Petropoulos and Evangelos Spiliotis**


### **About the Editor**

### **Sonia Leva**

Sonia Leva is Full Professor in "Elettrotecnica" (Electrical Engineering-Circuit Theory) in Politecnico di Milano (Italy). She received a Ph.D. degree in 2001 in Electrical Engineering from the Faculty of Engineering, Politecnico di Milano, Italy. She was Research Associate and after Associate Professor of Electrical Engineering at Politecnico di Milano. Since 2016, Sonia Leva is a Full Professor in "Elettrotecnica" (Electrical Engineering-Circuit Theory), starting her professor activity on January 06, 2016.

She has been an IEEE member since 2000 and a senior member since 2013. She served as a Chairperson of sessions in an international conference organized by Institute of Electrical and Electronic Engineers. She is the author of about 300 papers mainly published on international and national journal or conference proceedings. She served as Editor-in-Chief for *Forecasting* from 2019.

Sonia Leva is founder and coordinator of SolartechLab and MultyGood Microgrid Lab ad Department of Energy, Politecnico di Milano.

### *Editorial* **Editorial for Special Issue: "Feature Papers of Forecasting 2021"**

**Sonia Leva**

Department of Energy, Politecnico di Milano, 20156 Milano, Italy; sonia.leva@polimi.it

The human capability to react or adapt to upcoming changes strongly relies on the ability to forecast them. Forecasting and its applications are increasingly important because they allow to improve decision-making processes by providing useful insights about the future. Scientific research is giving unprecedent attention to forecasting methods and applications, with a continuously growing number of articles about novel forecast approaches being published.

In this Special Issue, as well as in the one published in 2020 [1], high-quality papers in *Forecasting* spread into topics such as power and energy forecasting, forecasting in economics and management, forecasting in computer science, weather and forecasting and environmental forecasting have been selected and published. In particular, in this Special Issue, the most recent and high-quality research about forecasting is collected. Eleven papers are selected to represent a wide range of research fields where forecasting applications are playing a crucial role.

Nikolaidis et al. [2] propose a dynamical forecaster capable of estimating the required spinning reserves on the basis of a real-time load forecast. A neural network is trained via non-linear regression to accurately predict the load ahead starting from eight predictors, divided into constant and variable inputs by exploiting a model predictive control. The results provided demonstrate that the adoption of the proposed dynamical forecaster allows for significant improvements in terms of decreasing operating reserve requirements: Based on real-time updates, the load forecasting can achieve lower costs while the system security is preserved.

Ramos et al. [3] present a methodology designed for office buildings and aimed at improving the accuracy in electricity consumption forecasting on a 5-min time interval, providing proper support to decisions related to energy management towards higher efficiency. The prediction, based on data measured by different devices including presence, temperature, consumption and humidity, is carried out by means of two different forecasting algorithms, namely, Artificial Neural Network (ANN) and K-Nearest Neighbor (KNN) algorithms. The present research demonstrated that in order to achieve the maximum forecast accuracy in different periods of the day, hence in different contexts regarding consumption patterns, different forecasting algorithms must be used.

Chaiton et al. [4] present the outcomes of simulations forecasting the impact of five possible Tobacco Endgame policies on smoking prevalence and on tax revenues in Ontario by 2035. The Ontario SimSmoke simulation is exploited for modeling the expected effect of the first four strategies, namely: plain packaging, free cessation services, decreasing the number of tobacco outlets and increasing tobacco taxes. On the other hand, different models are involved in the evaluation of the impact of increasing the minimum required age to legally purchase tobacco to 21 years. Simulations predict that an increase in tobacco taxes will determine the greatest decrease in smoking prevalence, and that reducing smoking prevalence to "less than 5 by 35" by combining non-tax interventions and excise tax increase will result in a minimal impact on tax revenues.

Petropoulos et al. [5] focus on univariate time series forecasting and provide an overview of five different approaches allowing an improvement in the performances achievable with standard extrapolation methods. In further detail, the Theta method (manipulation of local curvatures), Multiple Temporal Aggregation (MTA), bootstrapping,

**Citation:** Leva, S. Editorial for Special Issue: "Feature Papers of Forecasting 2021". *Forecasting* **2022**, *4*, 335–337. https://doi.org/10.3390/ forecast4010018

Received: 24 January 2022 Accepted: 4 February 2022 Published: 3 March 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Forecasting with Sub-seasonal Series (FOSS) and forecasting with multiple starting points are discussed and compared in terms of how information is extracted from data, the computational cost and the performance. Moreover, the concept of the "wisdom of the data" is presented, explaining how a proper data manipulation can translate into improved forecast accuracy by combining forecasts carried out from different perspectives on the same data.

Watson et al. [6] investigate how the quality of weather data derived from thunderstorm simulations influences the outcome of power outage models. A comparative analysis is conducted using two different Numerical Weather Prediction (NWP) systems with various levels of data assimilation, determining how outage models trained on these different sets of weather data differ in terms of performance. It is demonstrated that erroneous estimations in weather simulations propagate into the outage models in specific and quantifiable ways, suggesting how improved weather representations can possibly improve the quality of the power outage insights obtained.

Nespoli et al. [7] propose a preliminary forecast procedure with the objective to predict a family of batteries which is suitable, from both a technical and a financial point of view, for coupling with a certain PV plant configuration. The procedure is applied to hypothetical plants aimed at fulfilling the energy requirements of a commercial and an industrial loads. The amount of energy produced by the PV system is estimated on the basis of a performance analysis carried out on real plants with similar characteristics, while the battery operations are determined by two distinct control logics regulating charge and discharge, respectively. Finally, an unsupervised clustering based on k-means algorithm applied to all possible PV+BESS (Battery Energy Storage System) configurations allowed the researchers to identify the family of feasible solutions which, as expected, was characterized by a low payback time and a low number of residual cycles.

Boudhaouia et al. [8] describe a novel web-oriented data analysis platform capable of forecasting water consumption in real-time by exploiting Machine Learning techniques. The prediction is carried out with no prior and contextual information, relying only on past water consumption data recorded by smart meters as unevenly spaced time series with high-resolution and based on two different algorithms, namely, a Long Short-Term Memory (LSTM) and a Back-Propagation Neural Network (BPNN). The two models are tested on forecasting the water consumption in a private building: By evaluating their performance, it is observed that LSTM outperforms BPNN, providing more accurate predictions. According to the authors, the developed model can even be generalized to different types of consumption, such as electricity and gas.

Bas et al. [9] introduce a novel time series forecasting approach based on the Holt method modified by using time-varying smoothing parameters instead of fixed ones. Holt's smoothing parameters are obtained for each observation exploiting first-order autoregressive models whose parameters, in turn, are assessed through a Harmony Search Algorithm (HSA). The proposed method is tested on Istanbul Stock Exchange datasets covering the years between 2000 and 2017: The forecasts are obtained with a subsampling bootstrap approach, and different test lengths are considered during this analysis.

Wu et al. [10] deal with the topic of forecasting volatility from econometric datasets, a crucial task in finance. First, they assess the robustness of state-of-art Normalizing and Variance-Stabilizing (NoVaS) methods for long-term time-aggregated predictions, addressing the lack of experimental results in current NoVaS-related studies. Then, they develop a novel model-free method that, after an extensive analysis, demonstrated improved and more stable performance with respect to state-of-art NoVaS and standard GARCH-type methods in both the short and long term, regardless of whether simulation or real-world data are used.

Ali et al. [11] propose a novel approach aimed at predicting ocean currents by means of deep learning. In detail, a LSTM model is applied to the prediction of the three-dimensional tensors representing water column velocity. The proposed method is tested on estimating the Loop Current (LC) measured in the Gulf of Mexico between 2009 and 2011 at multiple spatial and temporal scales, where an RMSE (Root Mean Square Error) lower than 0.05 cm/s and a correlation coefficient of 0.6 were presented. Moreover, the model presented a useful forecast period, hence the time interval after which the forecast significantly diverges from the observed motion field, larger than 4 days.

Vega et al. [12] face the challenge of forecasting the number of new COVID-19 infections in the short and medium term by proposing the SIMLR model, incorporating Machine Learning (ML) into the epidemiological SIR model. By combining these two components, it is substantially possible to reduce the amount of data required by Machine Learning in order to produce accurate predictions and to estimate the time-varying parameters of a SIR model to produce forecasts with an advance of one to four weeks. The proposed SIMLR model is applied to study cases from Canada and the United States, demonstrating state-ofthe-art forecasting performance with the additional advantage of providing probabilistic and interpretable outcomes. The authors expect this approach to be involved not only in COVID-19 modeling and for other infectious diseases as well.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**


### *Article* **SIMLR: Machine Learning inside the SIR Model for COVID-19 Forecasting**

**Roberto Vega 1,2,\*, Leonardo Flores <sup>3</sup> and Russell Greiner 1,2**


**Abstract:** Accurate forecasts of the number of newly infected people during an epidemic are critical for making effective timely decisions. This paper addresses this challenge using the SIMLR model, which incorporates machine learning (ML) into the epidemiological SIR model. For each region, SIMLR tracks the changes in the policies implemented at the government level, which it uses to estimate the time-varying parameters of an SIR model for forecasting the number of new infections one to four weeks in advance. It also forecasts the probability of changes in those government policies at each of these future times, which is essential for the longer-range forecasts. We applied SIMLR to data from in Canada and the United States, and show that its mean average percentage error is as good as state-of-the-art forecasting models, with the added advantage of being an interpretable model. We expect that this approach will be useful not only for forecasting COVID-19 infections, but also in predicting the evolution of other infectious diseases.

**Keywords:** COVID-19; probabilistic graphical models; interpretable machine learning

### **1. Introduction**

Since its identification in December 2019, COVID-19 has posed critical challenges for the public health and economies of essentially every country in the world [1–3]. Government officials have taken a wide range of measures in an effort to contain this pandemic, including closing schools and workplaces, setting restrictions on air travel, and establishing stay at home requirements [4]. Accurately forecasting the number of new infected people in the short and medium term is critical for the timely decisions about policies and for the proper allocation of medical resources [5,6].

There are three basic approaches for predicting the dynamics of an epidemic: compartmental models, statistical methods, and ML-based methods [5,7]. Compartmental models subdivide a population into mutually exclusive categories, with a set of dynamical equations that explain the transitions among categories [8]. The Susceptible-Infected-Removed (SIR) model [9] is a common choice for the modelling of infectious diseases. Statistical methods extract general statistics from the data to fit mathematical models that explain the evolution of the epidemic [6]. Finally, ML-based methods use machine learning algorithms to analyze historical data and find patterns that lead to accurate predictions of the number of new infected people [7,10].

Arguably, when any approach is used to make high-stake decisions, it is important that it be not just accurate, but also interpretable: It should give the decision-maker enough information to justify the recommendation [11]. Here, we propose SIMLR, which is an interpretable probabilistic graphical model (PGM) that combines compartmental models and ML-based methods. As its name suggests, it incorporates machine learning (ML) within an SIR model. This combines the strength of curve fitting models that allow accurate predictions in the short-term, involving many features, with mechanistic models that allow to extend the range to predictions in the medium and long terms [12].

**Citation:** Vega, R.; Flores, L.; Greiner, R. SIMLR: Machine Learning inside the SIR Model for COVID-19 Forecasting. *Forecasting* **2022**, *4*, 72–94. https://doi.org/10.3390/ forecast4010005

Academic Editor: Sonia Leva

Received: 9 December 2021 Accepted: 10 January 2022 Published: 13 January 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

<sup>1</sup> Department of Computing Science, University of Alberta, Edmonton, AB T6G 2R3, Canada; rgreiner@ualberta.ca

SIMLR uses a mixture of experts approach [13], where the contribution of each expert to the final forecast depends on the changes in the government policies implemented at various earlier time points. When there is no recent change in policies (two to four weeks before the week to be predicted), SIMLR relies on an SIR model with time-varying parameters that are fitted using machine learning methods. When a change in policy occurs, SIMLR instead relies on a simpler model that predicts that the new number of infections will remain constant. Note that forecasting the number of new infections one and two weeks in advance (Δ*I*<sup>1</sup> and Δ*I*2) is relatively easy as SIMLR knows, at the time of the prediction, whether the policy has changed recently. However, for three- or four-week forecasts (Δ*I*<sup>3</sup> and Δ*I*4), our model needs to estimate the likelihood of a future change of policy. SIMLR incorporates prior domain knowledge to estimate such policy-change probabilities.

The use of such prior models—here epidemiological models—is particularly important when the available data is scarce [14]. At the same time, machine learning models need to acknowledge that the reported data on COVID-19 is imperfect [15,16]. The use of probabilistic graphical models allows SIMLR to account for this uncertainty on the data. At the same time, the probability tables associated with this graphical model can be manually modified to adapt SIMLR to the specific characteristics of a region.

This work makes three important contributions. (1) It empirically shows that an SIR model with time-varying parameters can describe the complex dynamics of COVID-19. (2) It describes an interpretable model that predicts the new number of infections one to four weeks in advance, achieving state-of-the-art results, in terms of mean absolute percentage error (MAPE), on data from Canada and the United States. (3) It presents a machine learning model that incorporates the uncertainty of the input data and can be tailored to the specific situations of a particular region.

The rest of Section 1 describes the related work and the basics of the SIR compartmental model. Section 2 then describes in detail our proposed SIMLR approach. Section 3 shows the results of the predicting the number of new infections in the United States and provinces of Canada. Finally, Section 4 presents our final remarks.

### *1.1. Basic SIR Model*

The Susceptible-Infected-Removed (SIR) compartmental model [9] is a mathematical model of infectious disease dynamics that divide the population into three disjoint groups [8]. Susceptible (S) refers to the set of people who have never been infected but can acquire the disease. Infected (I) refers to the set of people who have and can transmit the infection. Removed (R) refers to the people who have either recovered or died from the infection and cannot transmit the disease anymore. This model is defined by the differential equations:

$$\frac{dS}{dt} = -\frac{\beta S(t)I(t)}{N}, \ \frac{dI}{dt} = \frac{\beta S(t)I(t)}{N} - \gamma I(t), \ \frac{d\mathbb{R}}{dt} = \gamma I(t) \tag{1}$$

SIR assumes an homogeneous and constant population, and it is fully defined by the parameters *β* (transmission rate) and *γ* (recovery rate). The intuition behind this model is that every infected patient gets in contact with *β* people. Since only the susceptible people can become infected, the chance of interacting with a susceptible person is simply the proportion of susceptible people in the entire population, *N* = *S* + *I* + *R*. Likewise, at every time point, *γ* proportion of the infected people is removed from the system. Figure 1a depicts the general behaviour of an SIR model.

**Figure 1.** (**a**) General behaviour of the SIR model. (**b**) The number of infections predicted by the SIR model with fixed parameters, fitted to the US data for 1 week prediction. (**c**) Similar to (**b**), but with time-varying parameters.

### *1.2. Related Work*

The main idea behind combining compartmental models with machine learning is to replace the fixed parameters of the former with time-varying parameters that can be learned from data [6,17–19]. However, most of the approaches focus on finding the parameters that can explain the past data, and not on predicting the number of newly infected people. Although those approaches are useful for obtaining insight into the dynamics of the disease, it does not mean that those parameters will accurately predict the behaviour in the future.

Particularly relevant to our approach is the work by Arik et al. [5], who used latent variables and autoencoders to model extra compartments in an extended Susceptible-Exposed-Infected-Removed (SEIR) model. Those additional compartments bring further insight into how the disease impacts the population [20,21]; however, our experiments suggest that they are not needed for an accurate prediction of the number of new infections. One limitation of their model is a decrease in performance when the trend in the number of new infections changes. We hypothesize that those changes in trend are related to the government policies that are in place at a specific point in time. SIMLR is able to capture those changes by tracking the policies implemented at the government level.

A different line of work replaces epidemiological models with machine learning methods to directly predict the number of new infections [22–25]. Importantly, Yeung et al. [26] added non-pharmaceutical interventions (policies) as features in their models; however, their approach is limited to make predictions up to two weeks in advance, since information about the policies that will be implemented in the future is not available at inference time. Our SIMLR approach differs by being interpretable and also by forecasting policy changes, which allows it to extend the horizon of the Δ*I* predictions.

There are many models that attempt to predict the evolution of the COVID-19 epidemic. The Center for Disease Control and Prevention (CDC) in the United States allows different research teams across the globe to submit their forecasts of the number of cases and deaths 1 to 8 weeks in advance [27]. More than 100 teams have submitted at least one prediction to this competition. We compare SIMLR with all of the models that made predictions 1 to 4 weeks in advance in the same time span as our study.

### **2. Materials and Methods**

We view SIMLR as a probabilistic graphical model that uses a mixture of experts approach to forecast the number of new COVID-19 infections, 1 to 4 weeks in advance. Figure 2 shows the intuition behind SIMLR. Changes in the government policies are likely to modify the trend of the number of new infections. We assume that stronger policies are likely to decrease the number of new infections, while the opposite effect is likely to occur when relaxing the policies. These changes are reflected as a change in the parameters of the SIR model. Using those parameters, we can then predict the number of new infections, then use that to compute the likelihood of observing other new policy changes in the short term.

While Figure 2 is an schematic diagram used for pedagogical purposes; Figure 3 depicts the formal probabilistic graphical model, as a plate model, that we use to estimate the parameters of the SIR model, the number of new infections, and the likelihood of observing changes in policies 1 to 4 weeks in advance. The blue nodes are estimated at every time point, while the values of the green nodes are either known as part of the historical data, or inferred in a previous time point. The random variables are assumed to have the following distributions:

$$\begin{array}{c|c} \text{CT}\_{t+1} & \{\text{CP}\_{t-\tau}\}\_{\tau\in\{1,2,3\}} & \sim & \text{Cat}\_{K\in\{-1,0,1\}}(\theta\_{\text{CT}})\\ \beta\_{t+1} & \{\beta\_{t-\tau}\}\_{\tau\in\{0,1,2\}}, \text{CT}\_{t+1} & \sim & \mathcal{N}(\mu\_{\beta}, \Sigma\_{\beta})\\ \gamma\_{t+1} & \{\gamma\_{t-\tau}\}\_{\tau\in\{0,1,2\}}, \text{CT}\_{t+1} & \sim & \mathcal{N}(\mu\_{\gamma}, \Sigma\_{\gamma})\\ \text{SIR}\_{t+1} & \beta\_{t+1}, \gamma\_{t+1} & \sim & \mathcal{N}(\mu\_{SIR}, \Sigma\_{SIR})\\ \mathcal{U}\_{t} & \{\text{SIR}\_{t-\tau}\}\_{\tau\in\{0,1,2\}} & \sim & \text{Cat}\_{K\in\{-1,0,1\}}(\theta\_{\text{Id}})\\ \mathcal{O}\_{t} & \mathcal{W}\_{t} & \sim & \text{Cat}\_{K\in\{0,1\}}(\theta\_{\text{O}})\\ \text{CP}\_{t+1} \mid & \mathcal{O}\_{t}, \mathcal{U}\_{t} & \sim & \text{Cat}\_{K\in\{-1,0,1\}}(\theta\_{\text{CP}}) \end{array} \tag{2}$$

where *<sup>t</sup>* indexes the current week, *SIRt* = [*St*, *It*, *Rt*], *<sup>μ</sup>SIR* <sup>∈</sup> <sup>R</sup><sup>3</sup> is given below by Equation (3), *μβ* = (*α*0,*CTt*+<sup>1</sup> )+(*α*1,*CTt*+<sup>1</sup> )*βt*−<sup>1</sup> + (*α*2,*CTt*+<sup>1</sup> )*βt*−<sup>2</sup> + (*α*3,*CTt*+<sup>1</sup> )*βt*−<sup>3</sup> and *μγ* = (*ω*0,*CTt*+<sup>1</sup> )+(*ω*1,*CTt*+<sup>1</sup> )*γt*−<sup>1</sup> + (*ω*2,*CTt*+<sup>1</sup> )*γt*−<sup>2</sup> + (*ω*3,*CTt*+<sup>1</sup> )*γt*−<sup>3</sup> are linear combinations of the three previous values of *β* and *γ*, (respectively). The coefficients of those linear combinations depend on the value of the random variable CT*t*<sup>+</sup>1. We did not specify a distribution for the node New\_infections*t*+<sup>1</sup> because its value is deterministically computed as *St* − *St*+1.

**Figure 2.** Intuition behind SIMLR. The policies currently in place determine the value of the parameters needed to infer the next values, using an SIR model. Those predictions are then used to estimate how the policies might change in the future.

Informally, the assignment CT*<sup>t</sup>* = −1 means that we expect a change in trend from an increasing number of infections to a decreasing one. The opposite happens when CT*<sup>t</sup>* = 1, while CT*<sup>t</sup>* = 0 means that we expect the population to follow the current trend (either increasing or decreasing). We assume these changes in trend depend on changes in the government policies 2 to 4 weeks prior to the week of our forecast—e.g., we use {CT*t*−3, CT*t*−2, CT*t*−1} when predicting the number of new infections at *t* + 1, Δ*It*+1, and we need {CT*t*, CT*t*<sup>+</sup>1, CT*t*+2} when predicting Δ*It*+4. Note that, at time *t*, we will not know CT*t*+<sup>1</sup> nor CT*t*+2. We chose this interval based on the assumption that the incubation period of the virus is 2 weeks.

The status of CT*t*+<sup>1</sup> defines the coefficients that relate *βt*+<sup>1</sup> and *γt*+<sup>1</sup> with their three previous values *βt*, *βt*−1, *βt*−<sup>2</sup> and *γt*, *γt*−1, *γt*−2, respectively. Since *βt*+<sup>1</sup> and *γt*+<sup>1</sup> fully parameterize the SIR model in Equation (1), we can estimate the new number of infected people, Δ*It*+1, from these parameters (as well as the SIR values at time *t*).

The random variables *Ut* ∈ {−1, 0, 1} and *Ot* ∈ {0, 1} are auxiliary variables designed to predict the probability of observing a change in policy at time *t* + 1. Intuitively, *Ut* represents the "urgency" of modifying a policy. As the number of cases per 100K inhabitants and the rate of change between the number of cases in two consecutive time points increases, the urgency to set stricter government policies increases. As the number (and rate of change) of cases decreases, the urgency to relax the policies increases. Finally, *Ot* models the "willingness" to execute a change in government policies. As the number of time points without a change increases, so does this "willingness".

**Figure 3.** Modeling SIMLR as a PGM for forecasting new cases of COVID-19. The blue nodes are estimated at each time point, while the green ones are either based on past information, or where estimated in a previous iteration.

### *2.1. SIR with Time-Varying Parameters*

We can approximate an SIR model by transforming the differential Equation (1) into the equations of differences:

$$\begin{aligned} S\_t &= -\beta \frac{S\_{t-1} I\_{t-1}}{N} + S\_{t-1} \\ I\_t &= \beta \frac{S\_{t-1} I\_{t-1}}{N} - \gamma I\_{t-1} + I\_{t-1} \\ R\_t &= \gamma I\_{t-1} + R\_{t-1} \end{aligned} \tag{3}$$

where *St*, *It*, *Rt* are the number of individuals in the groups Susceptible, Infected and Removed, respectively, at time *t*. Similarly *St*−1, *It*−1, *Rt*−<sup>1</sup> represent the number individuals in each group at time *t* − 1. *β* is the transmission rate, and *γ* is the recovery rate.

While the SIR model is non-linear with respect to the states (S, I, R), it is linear with respect to the parameters *β* and *γ*. Therefore, under the assumption of constant and known population size (i.e., *N* = *St* + *It* + *Rt*) we can re-write the set of Equation (3) as:

$$
\begin{bmatrix} \mathbf{S}\_{t} \\ I\_{t} \end{bmatrix} \quad = \begin{bmatrix} -\frac{\mathbf{S}\_{t-1}I\_{t-1}}{N} & \mathbf{0} \\ \frac{\mathbf{S}\_{t-1}I\_{t-1}}{N} & -I\_{t-1} \end{bmatrix} \begin{bmatrix} \boldsymbol{\beta} \\ \boldsymbol{\gamma} \end{bmatrix} \quad + \quad \begin{bmatrix} \mathbf{S}\_{t-1} \\ I\_{t-1} \end{bmatrix} \tag{4}
$$
 
$$
\mathbf{R}\_{t} \quad = \quad \mathbf{N} - \mathbf{S}\_{t} - I\_{t} \tag{5}
$$

Given a sequence of states *x*1, ... , *xn*, where *xt* = [*St It*] *<sup>T</sup>*, it is possible to estimate the optimal parameters of the SIR model as:

$$\left(\left(\beta^\*, \gamma^\*\right)\right) = \underset{\beta, \gamma}{\arg\min} \sum\_{i=1}^n ||x\_i - \beta\_i||^2 + \lambda\_1(\beta - \beta\_0)^2 + \lambda\_2(\gamma - \gamma\_0)^2 \tag{5}$$

where *x*ˆ*<sup>i</sup>* is computed using Equation (4), and *λ*<sup>1</sup> and *λ*<sup>2</sup> are optional regularization parameters that allow the incorporation of the priors *β*<sup>0</sup> and *γ*0. For the case of Gaussian priors—i.e., *<sup>β</sup>* ∼ N (*β*0, *<sup>σ</sup>*<sup>2</sup> *<sup>β</sup>*) and *<sup>γ</sup>* ∼ N (*γ*0, *<sup>σ</sup>*<sup>2</sup> *<sup>γ</sup>*)—we use *λ*<sup>1</sup> = <sup>1</sup> 2*σ*<sup>2</sup> *β* and *λ*<sup>2</sup> = <sup>1</sup> 2*σ*<sup>2</sup> *γ* [28]. Intuitively, Equation (5) computes the transmission rate (*β*∗) and the recovery rate (*γ*∗) that best explain the number of new infections, deaths, and recovered people in a fixed time frame. If we know a standard recovery rate and transmission rate a priori (*β*0, *γ*0), it is possible to incorporate them into the Equation (5) as regularization parameters. The weights *λ*<sup>1</sup> and *λ*<sup>2</sup> control how much to weight those prior parameters. Small weights means we basically use the parameters learned by the data, and large weights mean more emphasis on the prior information.

In the traditional SIR model, we set *λ*<sup>1</sup> = *λ*<sup>2</sup> = 0 and fit a single *β* and *γ* to the entire time series. However, as shown in Figure 1a, an SIR model with fixed parameters is unable to accurately model several waves of infections. As illustration, Figure 1b shows the predictions produced by fitting an SIR with fixed parameters (Equation (5)) to the US data from 29 March 2020 to 3 May 2021, and then using those parameters to make predictions one week in advance, over this same interval. That is, using this learned (*β*, *γ*), and the number of people in the *S*, *I*, and *R* compartments on 28 March 2020, we predicted the number of observed cases during the week of 29 March 2020 to 4 April 2020. We repeated the same procedure for the entire time series. Note that even though the parameters *β* and *γ* were found using the entire time series – i.e., using information that was not available at the time of prediction—the resulting model still does a poor job fitting the reported data.

Figure 1c, on the other hand, was created by allowing *β* and *γ* to change every week. Here, we first found the parameters that fit the data from 29 March 2020 to 4 April 2020—call them *β*<sup>1</sup> and *γ*1—then used those parameters along with the SIR state on 28 March 2020 to predict the number of new infections one week ahead—i.e., the sampled week of 29 March 2020 to 4 April 2020. By repeating this procedure during the entire time series we obtained an almost perfect fit to the data. Of course, these are also not "legal" predictions since they

too use information that is not available at prediction time—i.e., they used the number of reported infections during this first week to find the parameters, which were then used to estimate the number of cases over this time. However, this "cheating" example shows that an SIR model, with the optimal time-varying parameters, can model the complex dynamics of COVID-19. Recall from Figure 1b that this is not the case in the SIR model with fixed parameters, which cannot even properly fit the training data.

### *2.2. Estimating βt*+<sup>1</sup> *and γt*+<sup>1</sup>

Naturally, the challenge is "legally" computing the appropriate values of *βt*+<sup>1</sup> and *γt*<sup>+</sup>1, for each week, using only the data that is known at time *t*. Figure 3 shows that computing *βt*+<sup>1</sup> and *γt*+<sup>1</sup> depends on the status of the random variable CT*t*<sup>+</sup>1. When CT*t*+<sup>1</sup> = 0—i.e., there is no change in the current trend—we assume that:

$$\begin{aligned} \beta\_{t+1} &\sim \mathcal{N}(\boldsymbol{\alpha}\_{0} + \boldsymbol{\alpha}\_{1}\beta\_{t} + \boldsymbol{\alpha}\_{2}\beta\_{t-1} + \boldsymbol{\alpha}\_{3}\beta\_{t-2}, \sigma\_{\beta}^{2}) \\ \gamma\_{t+1} &\sim \mathcal{N}(\boldsymbol{\omega}\_{0} + \boldsymbol{\omega}\_{1}\gamma\_{t} + \boldsymbol{\omega}\_{2}\gamma\_{t-1} + \boldsymbol{\omega}\_{3}\gamma\_{t-2}, \sigma\_{\gamma}^{2}) \end{aligned} \tag{6}$$

At time *t*, we can use the historical daily data *x*1, *x*2, ... , *xt* to find the weekly parameters *β*1, *β*2, ... , *βt*/7 and *γ*1, *γ*2, ... , *γt*/7. Note that the is just one value for each week, so is there are 140 days, there are 140/7 = 20 weeks. The first weekly pair (*β*1, *γ*1) is found by fitting Equation (5) to *x*1, ... , *x*7; (*β*2, *γ*2) to *x*8, ... , *x*14; and so on. Finally, we find the parameters *α* and *ω* in Equation (6) by maximizing the likelihood of the computed pairs. After finding those parameters, it is straightforward to infer (*βt*<sup>+</sup>1, *γt*+1). Note that this approach is the probabilistic version of linear regression. To estimate the parameters *σ*<sup>2</sup> *<sup>β</sup>* and *σ*2 *<sup>γ</sup>* we can simply estimate the variance of the residuals. An advantage of also computing these variances is that it is possible to obtain confidence intervals by sampling from the distribution in Equation (6) and then using those samples along with Equation (3) to estimate the distribution of the new infected people.

We estimated *βt*+<sup>1</sup> and *γt*+<sup>1</sup> as a function of the 3 previous values of those parameters since this allows them to incorporate the velocity and acceleration at which the parameters change. We computed the velocity of *β* as *vβ*,*<sup>t</sup>* = *β<sup>t</sup>* − *βt*−<sup>1</sup> and its acceleration as *aβ*,*<sup>t</sup>* = *vβ*,*<sup>t</sup>* − *vβ*,*t*−1. Then, estimating *β<sup>t</sup>* = *θ*<sup>0</sup> + *θ*1*βt*−<sup>1</sup> + *θ*2*vβ*,*t*−<sup>1</sup> + *θ*3*aβ*,*t*−<sup>1</sup> is equivalent to the model in Equation (6). The same reasoning applies to the computation of *γt*. We call this approach the "trend-following varying-time parameters SIR", tf-v-SIR.

For the case of CT*<sup>t</sup>* = −1 and CT*<sup>t</sup>* = 1 (which represents a change in trend from increasing number of infections to decreasing number of infections or vice-versa), we set *βt*+<sup>1</sup> and *γt*+<sup>1</sup> to values such that the predicted number of new cases at week *t* + 1 is identical to the one at week *t*. We call this the "Same as the Last Observed Week" (SLOW) model. As shown in Section 3, SLOW is a baseline with very good performance despite its simplicity. Given that the pandemic is a physical phenomenon that changes relatively slowly from one week to the next, making a prediction that assumes that the new number of cases will remain constant is not a bad prediction.

### *2.3. Estimating CTt*+1, *CPt*+1, *Ot*

The random variables CT*t*<sup>+</sup>1, CP*t*+<sup>1</sup> and O*<sup>t</sup>* in Figure 3 are all discrete nodes with discrete parents, meaning their probability mass functions are fully defined by conditional probability tables (CPTs). Learning the parameters of such CPTs from data is challenging due to the scarcity of historical information. The random variable CT*t*+<sup>1</sup> depends on the random variable changes in policy (CP) at times *t* − 1, *t* − 2, *t* − 3; however, there are very few changes in policy in a given region, meaning it is difficult to accurately estimate those probabilities from data. For the random variable O, which represents the "willingness" of the government to implement a change in policy, there is no observable data at all. We therefore relied on prior expert knowledge to set the parameters of the conditional probability tables for these random variables. Figure 4 shows the conditional probability

tables (CPT) for the random variables CT*t*+1, CP*t*+1, O*t*. The intuition used to generate the CPT's is as follows:

We considered that a change in trend in the current week depends on changes in policies during the previous three weeks. We chose 3 weeks using the hypothesis that the incubation period for the virus is 2 weeks. Then the effects of a policy will be reflected approximately 2 weeks after a change. We decided to analyze also one week after, and one week before this period, giving as a result the tracking of *CPt*−<sup>3</sup> to *CPt*−1. Secondly, we also assume that whenever we observe a change of policy that will move the trend from going up to going down, then that event will most likely happen. This is why most of the probability mass is located in a single column. For example, if we observe that the policies are relaxed at any point during the weeks *t* − 3, *t* − 2, or *t* − 1, then we assume that we will observe a change in trend with 99.9% probability.

The rationale for the CPT *P*(*Ot* | *Wt*) is that the government becomes more open to implement changes after long periods of 'inactivity'. For example, if they implement a change in policy this week (*Wt* = 0), then the probability of considering a second change of policy during the same week is very small (0.01%). We are assuming that, after a change in policy, the government will wait to see the effect of that change before taking further action. If 4 weeks have passed since the last change in policy, we estimated the probability of considering a change in the policy as 50%, while if more than 7 weeks have passed, then they are fully open to the possibility of implementing a new change.

*P*(*Ot* | *Wt*) estimates the probability of considering a change in the policy. The probability of actually implementing a change, *P*(*CPt*+<sup>1</sup> | *Ot*, *Ut*) depends not only on how willing the government is, but also on how urgent it is to make a change. In general, if the government is open to implement a change, and the urgency is "high", then the probability of changing a policy is high. We also considered that the government "prefers" to either not make changes in policy or relax the policies, rather than to implement more strict policies.




**Figure 4.** Conditional probability tables used by SIMLR. The names of the variables refer to the nodes that appear on Figure 2 on the main text.

### *2.4. Estimating Ut*

For modelling the random variable U*t*, which represents the "Urgency to change the trend", we use an NN-CPD (neural-network conditional probability distribution), which is a modified version of the multinomial logistic conditional probability distribution [29].

**Definition 1** (NN-CPD)**.** *Let Y* ∈ {1, ... , *m*} *be an m-valued random variable with* k *parents X*1, ... , *Xk that each take on numerical values. The conditional probability distribution P*(*Y* | *X*1, ... , *Xk*) *is an NN-CPD if there is an function <sup>z</sup>* <sup>=</sup> *<sup>f</sup><sup>θ</sup>* (*X*1, ... , *Xk*) <sup>∈</sup> <sup>R</sup>*m, represented as a neural network with parameters θ, such that p*(*Y* = *i* | *x*1, ... , *xk*) = exp(*zi*)/ ∑*<sup>j</sup>* exp(*zj*)*, where zi represents the i-th entry of z.*

Note U*<sup>t</sup>* is a latent variable, so there is no observable data at all. We again rely on domain knowledge to estimate its probabilities. To compute *P*(U*<sup>t</sup>* | SIR*t*−2, SIR*t*−1, SIR*t*), we extract two features: *ct* <sup>=</sup> <sup>10</sup> <sup>×</sup> 105(*St*−<sup>1</sup> <sup>−</sup> *St*)/*N*, which represents the number of new reported infections per 100K inhabitants; and *vt* = *ct* − *ct*−1, which estimates the rate of change of *ct*. Then define *P*(U*<sup>t</sup>* | SIR*t*−2, SIR*<sup>t</sup>*−1, SIR*t*) = *P*(U*<sup>t</sup>* | *ct*, *vt*).

To learn the parameters *θ* we created the dataset shown in Figure 5. Note that the targets in such dataset are probabilities. We relied on the probabilistic labels approach proposed by Vega et al. [30] to use a dataset with few training instances along with their probabilities to learn the parameters of a neural network more efficiently. We trained and a simple neural network with a single hidden layers with 64 units, and 3 output units with softmax activation.

The random variables *Ut* ∈ {−1, 0, 1} and *Ot* ∈ {0, 1} are auxiliary variables designed to predict the probability of observing a change in policy at time *t* + 1. Intuitively, *Ut* represents the "urgency" of modifying a policy. As the number of cases per 100 K inhabitants and the rate of change between the number of cases in two consecutive time points increases, the urgency to set stricter government policies increases. As the number (and rate of change) of cases decreases, the urgency to relax the policies increases. Most of the parameters in both NN-CPD tables are similar for the US and Canada, the difference arises from a perceived preference for not setting very strict policies in the US during the first year of the pandemic.



**Figure 5.** Dataset used to create the NN-CPD for the variable *Ut* and its visualization. Values closer to 1 (yellow) increase *p*(*Ut* = 1 | *Ct*, *Vt*). Values closer to 0 (green) increase *p*(*Ut* = 0 | *Ct*, *Vt*). Values closer to −1 (blue) increase *p*(*Ut* = −1 | *Ct*, *Vt*).

### *2.5. Evaluation*

We evaluated the performance of SIMLR, in terms of the mean absolute percentage error (MAPE) and mean absolute error (MAE), for forecasting the number of new infections one to four weeks in advance, in data from United States (as a country and individually for every state) and the six biggest provinces of Canada: Alberta (AB), British Columbia (BC), Manitoba (MN), Ontario (ON), Quebec (QB), and Saskatchewan (SK). For each of the regions, the predictions are done on a weekly basis, over the 39 weeks from 26 July 2020 to 1 May 2021. This time span captures different waves of infections. Equation (7) show the computation of the metrics used for evaluating our approach.

$$\begin{aligned} \text{MAPE} &= \frac{1}{n} \sum\_{t=1}^{n} \left| \frac{y\_t - \hat{y}\_t}{y\_t} \right| \\ \text{MAE} &= \frac{1}{n} \sum\_{t=1}^{n} |y\_t - \hat{y}\_t| \end{aligned} \tag{7}$$

At the end of every week, we fitted the SIMLR parameters using the data that was available until that week. For example, on 25 July 2020, we used all the data available from 1 January 2020 to 25 July 2020 to fit the parameters of SIMLR. Then, we made the predictions for the number of new infections during the weeks: 26 July 2020–1 August 2020 (one week in advance), 2 August 2020–8 August 2020 (two weeks in advance), 9 August 2020–15 August 2020 (three weeks in advance), and 16 August 2020–22 August 2020 (four weeks in advance). After this, we then fitted the parameters with data up to 1 August 2020 and repeated the same process, for 38 more iterations, until we covered the entire range of predictions.

We compared the performance of SIMLR with the SIR compartmental model with time-varying parameters learned using Equation (6) but no other random variable (tf-v-SIR), and with the simple model that forecasts that the number of cases one to four weeks in advance is the "Same as the Last Observed Week" (SLOW). For the United States data, we also compared the performance of SIMLR against the publicly available predictions at the COVID-19 Forecast Hub, which are the predictions submitted to the Center for Disease Control and Prevention (CDC) [31].

For training, we used the publicly available dataset OxCGRT [4], which contains the policies implemented by different regions, as well as the time period over which they were implemented. We limited our analysis to three policy decisions: Workplace closing, Stay at home requirements, and Cancellation of public events in the case of Canada. For the case of the United States we used Restrictions on gatherings, Vaccination policy, and Cancellation of public events. For information about the new number of reported cases and deaths, we used the publicly available COVID-19 Data Repository by the Center for Systems Science and Engineering at Johns Hopkins University [1]. The code for reproducing the results presented here are discussed in Appendix A.

### **3. Results**

### *3.1. Data Preprocessing*

Before inputting the time-series data to SIMLR, we performed some basic preprocessing during the training phase, and exclusively on the training data. We evaluated of our models by comparing its predictions with the results reported by the different health agencies –i.e., we did not fill in the data on the test sets:

1. The original data contains the cumulative number of reported infections/deaths on a daily basis. We trivially transformed this time-series into the number of new daily infections/deaths.


In step 5, we assumed that everyone in a given region was susceptible at the start time—i.e., *S*<sup>0</sup> = *N*. At each new time point, we transfer the number of new infections from *S* to *I*, and the number of new deaths and recovered from *I* to *R*. If the number of new recovered people is not reported, we used the surveillance definition of recovered used by Canadian health agencies. This definition is based on the assumption that a recovered person is one who is not hospitalized and is 14 days past the day when they tested positive [32,33]:

"Active and recovered status is a surveillance definition to try to understand the number of active cases in the population. It is not related to clinical management of cases. It is based on the assumption that a case is recovered 14 days after a particular date..."

### *3.2. MAPE and MAE*

Figure 6 shows the MAPE of the one- to four-week forecasts for the United States as a country and the six biggest provinces of Canada. Note that SIMLR has a consistently lower MAPE than tf-v-SIR and SLOW. Figure 7 shows a similar result in terms of MAE. Tables 1 and 2 show the mean and standard deviations of the metrics corresponding to the Figures 6 and 7. In addition Table 3 show the correlation coefficient between the time series of the reported new infections every week and the predictions made by the different models.

**Figure 6.** Comparison of SIMLR, SIR model with time-varying parameters, and SLOW. Table 1 contains the numerical information.

**Figure 7.** Comparison of SIMLR, SIR model with time-varying parameters, and SLOW in terms of MAE. To make the numbers comparable, the figures each show the US MAE values divided by 100.

**Table 1.** MAPE of the six biggest provinces in Canada and United States as a country, one- to four-weeks in advance. The number in parenthesis is the standard deviation.



**Table 2.** MAE of the six biggest provinces in Canada and United States as a country, one- to four-weeks in advance. The number in parenthesis is the standard deviation. For the case of the US the number of cases was divided by 100.

**Table 3.** Pearson correlation coefficient between the ground truth and the predictions of the six biggest provinces in Canada and United States as a country one- to four-weeks in advance.


Figure 8c shows how our proposed SIMLR approach compares with the 18 models that submitted predictions at the country level to the CDC during the same span of time (results at the state level are included in the Appendix B). Note that SIMLR and the model *LNQ-ens1* are the best performing models, with no statistically significant difference (*p* > 0.05 on a paired *t*-test) with respect to MAPE.

**Figure 8.** (**a**) 1-week forecasts SIMLR, tf-v-SIR, and SLOW, for Alberta, Canada. (**b**) 2-week forecasts, of the same models, for US data. (**c**) Comparison of SIMLR versus models submitted to the CDC (on US data).

### **4. Discussion**

Figure 8 illustrates the actual predictions of SIMLR one week in advance for the province of Alberta, Canada; and two weeks in advance for the US as a country. These two cases exemplify the behaviour of SIMLR. As noted above, there is a 2- to 4-week lag after a policy changes, before we see the effects. This means the task of making 1-week forecasts is relatively simple, as the relevant policy (at times *t* − 3 to *t* − 1) is fully observable. This allows SIMLR to directly compute CT*t*+1, which can then choose whether to continue using the SIR with time-varying parameters if no policy changed at time *t* − 1, *t* − 2, or *t* − 3, or using the SLOW predictor if the policy changed.

Figure 8a shows a change in the trend of reported new cases at week 22. However, just by looking at the evolution of number of new infections before week 22, there is no way to predict this change, which is why tf-v-SIR predicts that the number of new infections will continue growing. However, since SIMLR observed a change in the government policies at week 20, it realized it could no longer rely on its estimation of parameters and so switched to the SLOW model, which is why it was more accurate here. A similar behaviour occurs in week 34, when the third wave of cases in Alberta started. Due to a relaxation in the policies on week 31, SIMLR (at week 31) correctly predicted a change of trend around weeks 33–35.

This behavior is not exclusive for the data of Alberta and it explains why the performance of SIMLR is consistently higher than the baselines used for comparison in Figure 6 and Figure 8c. A striking result is how hard it is to beat the simple SLOW model (COVIDhubbaseline). Out of the 19 models considered here, only five (including SIMLR) do better than this simple baseline when predicting three to four weeks ahead. This brings some insight into the challenge of making accurate prediction in the medium term—probably due to the need to predict, then use, policy change information. Tables A1–A4 in the Appendix B show a comparison between our proposed SIMLR and tf-v-SIR against the models submitted to the CDC for all the states in the US. SIMLR consistently ranks among the best performers, with the advantage of being an interpretable model.

A deeper analysis of Tables A1–A4 shows that, in some states, the performance of SIMLR degrades for longer range predictions. This occurs because we are monitoring only the same three policies for all the states; however, different states might have implemented different policies and reacted differently to them. For example, closing schools might be a relevant policy in a state where there is an outbreak that involves children, but not as relevant if most of the cases are in older people.

Tracking irrelevant policies might degrade the performance of SIMLR. If the status of an irrelevant policy changes, then the dynamics of the disease will not be affected. The model however, will assume that the change in the policy will cause a change of trend and it will rely on the SLOW model, instead of the more accurate tf-v-SIR. Although SIMLR can be adapted to track different policies, the policies that are relevant for a given state must be given as an input. So while we think our overall approach applies in general, our specific model (tracking these specific policies, etc.) might not perform accurate predictions across all the regions. This is also a strength, in that it is trivial to adapt our specific model to track the policies of interest within a given region.

Predictions at the country level are more complicated, since most of the time policies are implemented at the state (or province) level instead of nationally. For making predictions for an entire country, as well as predictions three or four weeks in advance, SIMLR first predicts, then uses, the likelihood of observing a change in trend, at every week. In these cases, the random variable *CTt*+<sup>1</sup> no longer acts like a "switch", but instead it mixes the predictions of the tf-v-SIR and SLOW models, according to the probability of observing a change in the trend.

Figure 8b shows that whenever there is a stable trend in the number of new reported infections—which suggests there have been no recent policy changes—SIMLR relies on the predictions of the tf-v-SIR model; however, as the number (and rate of change) of new infections increases, so does the probability of observing a change in the policy. Therefore, SIMLR starts giving more weight to the predictions of the SLOW model. Note this behavior in the same figure during weeks 13–20.

One limitation of SIMLR is that it relies on conditional probabilities that are hard to learn due to lack of data, which forced us to build them based on domain knowledge. If this prior knowledge is inaccurate, then the predictions might be also misleading. Also, different regions might have different "thresholds" for taking action. Despite this limitation, SIMLR produced state-of-the-art results in both forecasting in the US as a country and at the provincial level in Canada, as well as very competitive results in predictions at the state level in the US.

Note that modelling SIMLR as a PGM does not imply causality. Although changes in the observed policy influence changes in the trend of new reported cases, the opposite is also true in reality. However, using probabilistic graphical models does makes it interpretable. It also allows us to incorporate domain knowledge that compensates for the relatively scarce data. SIMLR's excellent performance—comparable to state-of-the-art systems in this competitive task—show that it is possible to design interpretable machine learning models without sacrificing performance.

### **5. Conclusions**

Forecasting the number of new COVID-19 infections is a very challenging task. Many factors play a role on how the disease spreads, including the government policies and the adherence of citizens to such policies. These elements are difficult to model mathematically; however, the collected data (number of new infections and deaths, for example) are a reflection of all those complex interactions.

Machine learning, on the other side, excels at learning patterns directly from the data. Unfortunately, training many models from scratch can require a great deal of data, especially to learn complex patterns, such as the evolution of a pandemic.

We proposed SIMLR, a methodology that uses machine learning (ML) techniques to learn a model that can set, and adjust, the parameters of mathematical model for epidemiology (SIR). SIMLR augments that SIR model by incorporating expert knowledge in the form of a probabilistic graphical model. In this way, human experts can incorporate their believes in the likelihood that a policy will change, and when. By combining both components we substantially reduce the data that machine learning usually requires to produce models that can make accurate predictions.

Importantly, besides providing state-of-the-art predictions in terms of MAPE in the short and medium term, the resulting SIMLR model is interpretable and probabilistic. The first means that we can justify the predictions given by the algorithm—e.g., "SIMLR predicts 1000 cases for the next week due to a change in the government policies that will decrease the transmission rate". The second means we can produce probabilistic values—so instead of predicting a single value, it can predict the entire probability distribution—e.g., the probability of 100 cases next week, or of 200 cases or of 1000, etc.

This paper demonstrated that a model that explicitly models and incorporates government policy decisions can accurately produce one- to four-week forecasts of the number of COVID-19 infections. This involved showing that an SIR model with time-varying parameters is enough to describe the complex dynamics of this pandemic, including the different waves of infections. We expect that this approach will be useful not only for modelling COVID-19, but other infectious diseases as well. We also hope that its interpretability will leads to its adoption by researchers, and users, in epidemiology and other non-ML fields.

**Author Contributions:** Conceptualization, R.V., L.F. and R.G.; methodology, R.V., L.F. and R.G.; software, R.V.; validation, R.V.; formal analysis, R.V., L.F. and R.G.; investigation, R.V., L.F. and R.G.; resources, R.G.; data curation, R.V.; writing—original draft preparation, R.V.; writing—review and editing, R.V., L.F. and R.G.; visualization, R.V.; supervision, R.G.; project administration, R.G.; funding acquisition, R.G. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by Alberta Machine Intelligence Institute.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** All the datasets used for this manuscript are publicly available. For information about the new number of reported cases and deaths, we used the publicly available COVID-19 Data Repository by the Center for Systems Science and Engineering at Johns Hopkins University [1] https://github.com/CSSEGISandData/COVID-19, accessed on 1 September 2020. For policy tracking we used the OxCGRT [4] https://github.com/OxCGRT/covid-policy-tracker, accessed on 1 September 2020. For comparing our approach with other models we used the publicly available predictions at the COVID-19 Forecast Hub [31] https://github.com/reichlab/covid19 forecast-hub, accessed on 1 September 2020.

**Acknowledgments:** We thank the Google Cloud Research Credits program and Compute Canada for providing computational support. We also benefited from our many meetings with our colleagues of the greater University of Alberta Covid-Team.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **Appendix A. Code Availability**

The code for reproducing the main results of this manuscript are publicly available at: https://github.com/rvegaml/SIMLR, accessed on 7 December 2021.

There are six jupyter notebooks on that repository. All the experiments were run using an e2-standard-4 (4 vCPUs, 16 GB memory) computer in the Google Cloud Platform.


The provided repository in addition contains the in-house developed python library *MLib*. This library contains custom code for inference in probabilistic graphical models.

### **Appendix B. Additional Tables**

**Table A1.** Comparison of MAPE between different models across all the states in the US 1 week in advance. The number in parenthesis represents the standard deviation of the MAPE.



**Table A1.** *Cont.*

**Table A2.** Comparison of MAPE between different models across all the states in the US 2 weeks in advance. The number in parenthesis represents the standard deviation of the MAPE.




**Table A3.** Comparison of MAPE between different models across all the states in the US 3 weeks in advance. The number in parenthesis represents the standard deviation of the MAPE.





**Table A4.** Comparison of MAPE between different models across all the states in the US 4 weeks in advance. The number in parenthesis represents the standard deviation of the MAPE.

**Table A4.** *Cont.*


### **References**


### *Article* **A Deep Learning Model for Forecasting Velocity Structures of the Loop Current System in the Gulf of Mexico**

**Ali Muhamed Ali 1,\*, Hanqi Zhuang 1, James VanZwieten 1, Ali K. Ibrahim <sup>2</sup> and Laurent Chérubin <sup>2</sup>**


**\*** Correspondence: amuhamedali2014@fau.edu

**Abstract:** Despite the large efforts made by the ocean modeling community, such as the GODAE (Global Ocean Data Assimilation Experiment), which started in 1997 and was renamed as Ocean-Predict in 2019, the prediction of ocean currents has remained a challenge until the present day particularly in ocean regions that are characterized by rapid changes in their circulation due to changes in atmospheric forcing or due to the release of available potential energy through the development of instabilities. Ocean numerical models' useful forecast window is no longer than two days over a given area with the best initialization possible. Predictions quickly diverge from the observational field throughout the water and become unreliable, despite the fact that they can simulate the observed dynamics through other variables such as temperature, salinity and sea surface height. Numerical methods such as harmonic analysis are used to predict both short- and long-term tidal currents with significant accuracy. However, they are limited to the areas where the tide was measured. In this study, a new approach to ocean current prediction based on deep learning is proposed. This method is evaluated on the measured energetic currents of the Gulf of Mexico circulation dominated by the Loop Current (LC) at multiple spatial and temporal scales. The approach taken herein consists of dividing the velocity tensor into planes perpendicular to each of the three Cartesian coordinate system directions. A Long Short-Term Memory Recurrent Neural Network, which is best suited to handling long-term dependencies in the data, was thus used to predict the evolution of the velocity field in each plane, along each of the three directions. The predicted tensors, made of the planes perpendicular to each Cartesian direction, revealed that the model's prediction skills were best for the flow field in the planes perpendicular to the direction of prediction. Furthermore, the fusion of all three predicted tensors significantly increased the overall skills of the flow prediction over the individual model's predictions. The useful forecast period of this new model was greater than 4 days with a root mean square error less than 0.05 cm·s−<sup>1</sup> and a correlation coefficient of 0.6.

**Keywords:** deep learning; Loop Current; ocean current forecasting; LSTM; ocean measurements

### **1. Introduction**

Sustained large efforts in the ocean modeling community, such as the GODAE (Global Ocean Data Assimilation Experiment), which started in 1997 [1,2] and was renamed as OceanPredict in 2019 [3], have been made to promote and coordinate the approach to ocean forecasting among the international community. This large effort has seen many achievements in terms of predictive capabilities of ocean features temperature, salinity and sea surface height (SSH) and they are evaluated through a standard set of metrics [4]. However, the prediction of ocean currents has remained a challenge to this day—particularly in ocean regions that are characterized by rapid changes in their circulation due to changes in atmospheric forcing or due to the release of available potential energy through the development of dynamical instabilities. Predictions of ocean currents in the California current system can be found in [5], as well as other studies. This paper shows a correlation

**Citation:** Muhamed Ali, A.; Zhuang, H.; VanZwieten, J.; Ibrahim, A.K.; Chérubin, L. A Deep Learning Model for Forecasting Velocity Structures of the Loop Current System in the Gulf of Mexico. *Forecasting* **2021**, *3*, 934–953. https://doi.org/10.3390/ forecast3040056

Academic Editor: Roberto Henriques

Received: 8 November 2021 Accepted: 6 December 2021 Published: 14 December 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

coefficient less than 0.3 after two days with a root mean square error (RMSE) of 7 cm·s−<sup>1</sup> for the vertically integrated velocity component. Using mooring measurements in the same oceanographic region as that studied in Chao et al. [5], Shulman and Paduan [6] showed a significant decrease in the correlation coefficient and RMSE with depth between the observation and the model analyses while assimilating the 33 h filtered high-frequency (HF) radar surface current data. Ocean numerical models' useful forecast window is no longer than two days over a given area with the best initialization possible, as shown by [7] in a dynamically active current system, such as the Loop Current (LC) in the Gulf of Mexico (GoM). The RMSE was 10 cm·s−<sup>1</sup> and the correlation coefficient was 0.63 for the daily surface averaged predicted current. Ocean numerical model predictions quickly diverge from the observational field throughout the water and become unreliable, despite the fact that they can simulate the observed dynamics through other variables such as temperature, salinity and SSH. Numerical methods such as harmonic analysis are used to predict both short- and long-term tidal currents with significant accuracy. However, they are limited to the areas where the tide was measured.

Today's full-water column predictions primarily rely on the use of finite-difference, finite-volume and finite-element methods to solve the primitive equation of motion in numerical models used to simulate ocean dynamics. The outputs of these models consist of the temporal prediction of three-dimensional fields of ocean state variables including both components of the horizontal velocity field, namely *u* and *v* along the *x* and *y* axes of the Cartesian coordinate system, respectively. In this study, we evaluate the application of a deep learning (DL—[8]) model to predict the three-dimensional velocity field from in-situ data. We demonstrate that the water column current velocity patterns can be learned by a DL model, which can then be used to predict the layered structure of the flow field. To this end, we show that the DL model is capable of accurately predicting the water column velocities more than four days in advance, doubling the current state of the art prediction window for in-situ currents. In this study, we propose a Recurrent Neural Network (RNN) Long Short-Term Memory (LSTM) model [9] to perform predictions of ocean currents' speed and direction, as described in Section 2. LSTM networks have outperformed fully connected neural networks and other machine learning techniques in natural language processing [10,11] that has many similarities with ocean current predictions, as shown by Immas et al. [12]. RNNs have been the state-of-the-art method in modeling time series data for the last decade. In addition, this type of network has seen an increase in reallife applications, including but not limited to aquaculture [13], wind and solar energy resources management [14], bio science and medical applications [15] and also in industrial applications [16].

In a recent study by Wang et al. [17], an LSTM network was used to demonstrate the feasibility of medium-term (3 months) predictions of the GoM's SSH in the LC region. The LSTM model was trained and tested with 18 years of analyzed daily SSH—"analyzed" indicates that the model calculated SSH was corrected with in-situ and remote sensing observations—from the Hybrid Coordinate Ocean Model (HYCOM)-GoM 1/25◦ horizontal resolution [18]. The Loop Current (LC) and the mesoscale eddies associated with its nonlinear dynamics are the major drivers of the upper 1000 m water column circulation in the GoM [19]. The nonlinear dynamics of the LC is dominated by the shedding of anticyclonic eddies called Loop Current Eddies (LCE) at irregular time intervals [20–22]. The formation of the latter is primarily caused by the growth of baroclinic instability, which is associated with the formation of deep meanders and eddies [19,23,24]. Using metrics set in the literature for LC forecasting, the deep learning model predicted; overall, the LC system SSH frontal distance from reference points within 40 km nine weeks in advance. Furthermore, the model also predicted the final separation of two consecutive LC eddies through the SSH evolution, namely the eddies Cameron and Darwin 8 and 12 weeks in advance, respectively, an improvement over the 5–6-week useful forecast range of state of the art numerical models for the LC dynamics [25].

In this study, the LSTM model is applied to the prediction of water column velocity three-dimensional tensors. The prediction model is implemented on in-situ full water column current measurements collected in the LC region in the GoM between 2009 and 2011. Section 2 describes the measurements and their four-dimensional structure as well as the metrics used to assess the model's skills. Section 3 presents the LTSM prediction model and its implementation on the velocity data. Section 4 presents the model results and concluding remarks are given in Section 5.

### **2. Method**

### *2.1. Dataset*

Long term times series of 3-dimensional velocity flow fields in the LC region are readily available from various ocean numerical model consortia that provide free online access. Such consortia include HYCOM (https://www.hycom.org (accessed on 12 September 2021) [26]), Navy Coastal Ocean Model (NCOM) (https://www.ncdc.noaa.gov/ data-access/model-data/model-datasets/navoceano-ncom-reg (accessed on 12 September 2021) [27]), or ECCO (http://www.ecco.ucsd.edu (accessed on 12 September 2021) [28]) for example. In comparison, long term in-situ measurements of the LC system water column are scarce.

A comprehensive observational study of the LC in the eastern GOM, including 9 tall moorings and 7 short moorings, an array of 25 pressure-equipped inverted echo sounders (PIES), and remote sensing, measured the water column velocity for 2.5 years, beginning in April 2009 [29]. This array was located to cover both the east and west sides of the LC between the West Florida Slope and the Mississippi Fan, and was also centered over the zone where LCEs typically separate from the LC. The horizontal separation between moorings was around 50–80 km and between the PIES sensors was around 40–50 km. These recorded data were used to construct the measurement-based water velocity matrix used in this study.

To create such a matrix, these observations were processed using the optimal interpolation, as described in [30,31]. The horizontal resolution of the resulting data array was roughly based on the correlation length scales of recorded data, with the geostrophic velocity profiles based on the gravest empirical method (GEM) [30,32]. The resulting measurement-based water velocity matrix comprised 50 depth levels down to 3000 m below the surface, and extended horizontally between 88.5◦ W to 85◦ W and 24.65◦ N to 27◦ N with a horizontal resolution between 30 and 50 km (Figure 1). However, in this study, only the first 500 m was selected, corresponding to 26 vertical layers. The time resolution for the velocity data was 12 h, which corresponds to 1810 data frames for each *u* and *v* velocity component. The final matrix dimensions were of 1810 × 26 × 29 × 36. The current velocity measurements used in this study encompass the period from May 2009 to November 2011, during which three LCEs, namely Ekman, Franklin, and Hadal, were formed.

**Figure 1.** Loop Current SSH from HYCOM [33] (m) during eddy Ekman separation on 1 July 2009. The red rectangle shows the array boundaries.

### *2.2. 4-Dimensional Tensor Slicing*

The time series of the 3-dimensional gridded velocity forms a tensor whose dimensions must be reduced to one so it can be processed by the DL model. At each time step, a gridded velocity cube can be sliced in layers perpendicular to the three Cartesian coordinate axes (Figure 2). Thus, in the vertical direction (*z*-axis), the volume is split in horizontal layers corresponding to each depth level of the velocity data. Each layer becomes its own time series and can be reduced to a single dimension by EOF decomposition, as was carried out for the SSH field in [17]. For each resulting layer and velocity component, a DL model, trained on its own layer, is used to predict the evolution of that particular layer only. A similar approach can be used for layers perpendicular to the *x* and *y* axes and located at each grid point of the respective axis, as shown in Figure 2. As errors are specific to each layer and because the tensor evolves differently in each of the directions, it is expected that the models' skill will vary with the direction of prediction, as explained in the following section.

**Figure 2.** Velocity plane field extractions perpendicular to each of the three Cartesian coordinate directions at each depth (*z*) in the vertical direction and at each grid point of the *x* (zonal direction) and *y*-axis (meridional direction), respectively.

### *2.3. Volume Slicing Induced Errors*

Most numerical model solutions are obtained from discretized partial differential equations solved on one or more embedded volumetric grid, such as the Arakawa C grid [34]. To solve these equations, boundary conditions are provided at the grid boundaries, where virtual grid points are added for computational purposes. Specific boundary conditions allow the radiation of features from within the grid to outside of it without losing the integrity of the signal inside the grid during the outing process. This process can be tracked in all three directions. In the case of a deep learning model, only features contained within the grid are available to the model. There is no influence from the boundaries, which serve to constrain the solution within the model and limit the model solution's drift in numerical models. Therefore, DL model forecast errors in individual layers may grow significantly over time and ultimately change the integrity of the signal, as shown in Figure 3. This is particularly relevant in the case of perturbation simulations, where the phase of the signal in different layers could be changed by the errors in the individual layered predictions. Figure 3 provides an example of what it would look like in each of the planes normal to

each direction. Starting with the *z*-planes (top view), a slight phase shift in the vertical direction will lead to the removal of the red signal in the *x*-plane in the region outlined by the green shaded area and also in the *y*-plane the furthest on the outside. As each forecast is sequentially reused for the next, the errors become part of the learning base. Additionally, because horizontal motions are much larger than vertical motions in the ocean, the DL model prediction skills will differ according to the direction of layers used for prediction.

**Figure 3.** Layered prediction-induced errors. (**a**) Top view, normal to the *z*-axis. (**b**) Lateral view normal to the *y*-axis. (**c**) Lateral view normal to the *x*-axis. The green shaded area highlights the focus area where errors are displayed. In each subplot, the left (right) image shows the observed (predicted) field. Each color corresponds to a different vertical layer as indicated, layer 1(3) being at the top (bottom).

To evaluate the layered prediction errors, the metrics set by GODAE OceanPredict [4] were applied. They identify two types of errors, namely the single point error and the structural error. These errors are quantified by the calculation of the Peak Signal to Noise Ratio (PSNR) including RMSE and correlation coefficient (CC) (see [17] for definitions), and Structural Similarity (SSIM), respectively. The PSNR is based on the mean square error (MSE) [35]. Given an observed plane field *Ob* of size *m*, *n* and its prediction *Pr*, MSE is defined as:

$$\text{MSE} = \frac{1}{mn} \sum\_{i=0}^{m-1} \sum\_{j=0}^{n-1} \left[ Ob(i,j) - Pr(i,j) \right]^2 \tag{1}$$

The PSNR (in dB) is defined as:

$$\text{PSNR} = 10 \cdot \log\_{10} (\frac{Peak^2}{\text{MSE}}) \tag{2}$$

where *Peak* is the maximum value of all data points in both *Ob* and *Pr*. In image processing, PSNR is primarily used to assess the quality of an image reconstruction. The PSNR between two images is calculated in decibels. To compare image reconstruction quality, both the mean square error (MSE) and peak signal-to-noise ratio (PSNR) are often utilized.

The SSIM index can be calculated in sub-regions of each layer. It is a measure of similarity between two patterns [35].

$$\text{SSIM}(Ob, Pr) = \frac{(2\mu\_{Ob}\mu\_{Pr} + c\_1)(2\sigma\_{ObPr} + c\_2)}{(\mu\_{Ob}^2 + \mu\_{Pr}^2 + c\_1)(\sigma\_{Ob}^2 + \sigma\_{Pr}^2 + c\_2)}\tag{3}$$

where:


### **3. Deep Learning Prediction Model**

Unlike conventional numerical models which use a set of a dynamical equations to describe a physical system, data-driven deep learning methods rely on neural networks to model physical systems. To achieve this, we first reduced the temporal matrix of each layer to one dimension by applying EOFs and then implemented an LSTM network to model the velocity field in each layer, as shown in Figure 4.

**Figure 4.** Single layer forecasting model flow chart.

### *3.1. Empirical Orthogonal Functions*

EOF is a major analysis tool in oceanographic, geophysical and meteorological applications [36–38]. EOFs are used to reduce data dimensions by separating spatial components from temporal components. The principle of this decomposition is to extract the most dominant information with fewer dimensions [39]. It provides a dense description of spatial data and temporal variability in terms of an orthogonal basis (eigenvectors). Each associated eigenvalue provides a measure of the fraction of the total variance under the EOF mode. This decomposition provides a statistical description of any dynamical processes by projecting them onto empirical normal modes, rather than the physical or natural modes of the system, which are process specific and therefore unable to encompass all the processes involved in the dynamics of the system being predicted in this study. The projection of the data onto EOF modes is called principal component (PC), which indicates the temporal variations of the variance of it associated spatial pattern [37]. EOF decomposition is carried out by Singular Value Decomposition (SVD), which is written as follows:

$$Q = \mathcal{U}ID\mathcal{W}^T\tag{4}$$

where *Q* is an *n* × *p* matrix and *D* is an *n* × *p* rectangular diagonal matrix of non-negative numbers (the singular values of *Q*). *U* is an *n* × *n* matrix, the columns of which are orthogonal unit vectors of length *n*, called the left singular vectors of *Q*, and *W* is a *p* × *p* matrix whose columns are orthogonal unit vectors of length *p* and called the right singular vectors of *Q*. In addition, *UD* is the time-dependent principal components (*PC*s), and *W<sup>T</sup>* is the spatial pattern matrix whose columns are so-called EOF modes.

### *3.2. Deep Learning Model: Long Short-Term Memory Network*

The deep learning model selected for the prediction model is a type of Recurrent Neural Network (RNN). RNNs are well suited for time sequence prediction, and work by feeding the output of each neuron, along with a new input, back into itself, forming loops within its architecture [40]. In an RNN network, a simple RNN neuron or hidden unit's output behavior can be modeled by Equation (5), where *xn* and *sn* are the input and state at time *n*, respectively. Furthermore, *Wgi* and *Wgs* represent the input and state (recurrence) weights and *f* an activation function. Note that the output can be obtained from the state whenever it is needed.

$$s\_n = f(\mathcal{W}\_{\mathbb{S}^i} \mathbf{x}\_{n-1} + \mathcal{W}\_{\mathbb{S}^s} \mathbf{s}\_{n-1})\tag{5}$$

However, the caveat of the RNN given in Equation (5) is its gradient vanishing problem or memory loss. This occurs because RNNs are typically trained with a stochastic gradient descent algorithm, and gradients may vanish for a multi-layer RNN due to the chain rule

in differentiation. LSTM neural networks were designed to solve this problem [41], in which a memory unit *mn* was added to avoid the disappearance of gradients. Let *α* and *β* be constants and ⊕ denote an element-wise multiplication; then, the memory unit is updated by the following rule:

$$m\_n = \alpha \oplus m\_{n-1} + \beta \oplus f(\mathcal{W}\_{\mathbb{S}^\*} \mathbf{x}\_{n-1} + \mathcal{W}\_{\mathbb{S}^\*} \mathbf{s}\_{n-1}) \tag{6}$$

The state is then related to the memory unit with an activation function. In this way, derivatives will not vanish due to the additive relationship described in Equation (6).

### *3.3. Prediction Procedure*

The LSTM network used in this study was previously adopted for the prediction of SSH time series in [17,42]. In this study, two identical networks were designed to model and predict the velocity components, *u* and *v*, respectively. After the EOF decomposition of each velocity component, the PCs were used to train the LSTM model, which in turn was used to predict future PCs. At the beginning of the process, the system was initialized using random weights and then run through all the training data in chronological order, each time adjusting the weights through the gradient descent of the loss function. Each run through the entire training dataset is called an epoch and this step allows the model to optimize its weights (Equation (6)), at which point the model can predict any state learned from the data without the data at any point in time in hindcast mode.

The MATLAB Neural Network Toolbox was used to implement the LSTM network. The Adaptive Moment estimation [43] (Adam) optimization rather than the Stochastic Gradient Descent with Momentum [44] (SGDM) algorithm was used to update network weights iteratively during the training phase due to the improved performance with the former. The hyperparameters of the prediction model were manually tuned to optimize the performance of the prediction model. The resulting hyperparameters were as follows: minibatch size = 128, initial learning rate = 0.03, number of hidden nodes = 100, and maximum number of Epochs = 500. Only one LSTM layer was used because the overall performance of the model, including the training and prediction processing time, as well as prediction skills, degraded when more layers were added. Training and prediction were carried out on a single NVIDIA GPU, TITAN X (Pascal compatibility) with CUDA toolkit Version 11 with a memory of 12GB. The training times for a single layer and for all layers for each direction are provided in Table 1. Once trained, at each prediction time step, the LSTM updates its state in accordance to its own prediction. This allows the LSTM to continue predicting based on both the training data and future predictions.



The prediction procedure can be summarized as follows: (1) The current velocity components' time series from time *t*<sup>1</sup> to *tn* are reduced to their respective PCs by EOF decomposition. (2) The PCs are used to train the LSTM model sequentially. (3) All the PCs up to time *tn* are then used to predict the PCs of the velocity field at time *tn* + 1. For the next prediction at time *tn* + 2, the predicted PCs at *tn* + 1 are used to retrain the LSTM model together with the PCs corresponding to time *t*<sup>1</sup> to *tn*, and this is repeated for all subsequent forecasts. In addition, new data can be added at any time to the training dataset, which will then be used to retrain the LSTM model.

### *3.4. Layered Prediction Model Approach*

As previously described, the water velocity dataset is a time series consisting of two orthogonal components *u* and *v*, each of which known as a four-dimensional tensor. To reduce the computational complexity, at each instant, the corresponding velocity cube (*u* or *v*) is partitioned into a number of layers (or planes). For each layer, a prediction module consisting of EOF and LSTM is trained and then used to predict the velocity field of that particular layer. Collectively, these prediction modules form a layered prediction model. Layered models are implemented for each spatial direction. As a consequence, they are referred to as prediction models *X* (29 layers), *Y* (36 layers), and *Z* (26 layers), respectively.

### **4. Layered Prediction Experiments**

In all three directions, each layered model was trained using 90% of the available time series and preserving the remaining for prediction validation. Thus, the training used 1629 samples (814.5 days) from 2 March 2009 to 25 May 2011, while the testing period started from 26 May 2011 to 23 August 2011 (90.5 days). The training and testing periods are illustrated in Figure 5. The model prediction period was set to 7 days, which was also the length of the sliding prediction window. This prediction period was chosen in response to the predictive skill goal set for the LC current speed by the United States' National Academies of Sciences, Engineering, and Medicine (NASEM) [45].

**Figure 5.** Data partitioning for training and prediction experiments. T is the duration of the dataset (1810 time samples = 905 days), F the length of the prediction window (14 samples = 7 days), and the blue line is the testing period which includes 20 forecast sliding windows, each separated by a 12 h period.

### *4.1. Model Z Velocity Predictions*

Model *Z*, or the *Z* directional model, consisted of 26 horizontal layers distributed from the sea surface down to 500 m. The 7-day prediction of the model for the surface layer velocity (layer 1 in the *z*-direction) is shown in Figure 6. It illustrates that the proposed model was able to predict seven days in advance the formation of a cyclonic eddy in the region highlighted in Figure 6. The model accurately predicted the center of rotation, direction and strength of the velocity vectors at the surface. However, elsewhere, the model prediction differed more significantly from the observations. To assess the overall performance of the model, we computed the average CC, RMSE, SSIM and PSNR, along with their standard deviations, for both *u*, *v* on each plane of each tensor and over the twenty sliding windows (Figure 7). These quantities quickly deteriorated over 14 time steps (7 days), which indicates the challenge of predicting LC velocity tensors compared to SSH prediction performed using a similar LSTM structure in [17,42,46], despite the fact that the cyclonic eddy was correctly predicted. Indeed, after 7 days, the anticyclone southwest of the predicted cyclone exhibited a weaker circulation than the observed one, and its northern counterpart was sustained for longer than the observed one.

(**b**)

**Figure 6.** Predicted (top) and corresponding observed (bottom) surface velocity. (**a**) Twelve-hour prediction. (**b**) One hundred sixty-eight-hour (7-day) prediction. The red arrows in (**a**) show the region of formation of the cyclonic eddy predicted in the red highlighted areas in (**b**).

**Figure 7.** Model *Z* fourteen time-step (7 days) 20-day sliding window average of CC, SSIM, PSNR, and RMSE of the velocity fields. The unit for the horizontal axis is prediction time steps (one time step is 12 h). The error bar denotes the standard deviation with a 95% confidence interval.

### *4.2. Directional Velocity Structure Prediction Dependency*

We now compare the predictions of all models in each of the three Cartesian directions. In particular, for a given directional model prediction, we compared the other two model predictions along the same direction as the former. Figure 8 shows the comparison in terms of CC, RMSE, SSIM and PSNR for Model *X* prediction and the other two models (Models *Y* and *Z*) in the *x*-direction. All the metrics were calculated for fourteen time steps and averaged over 20-day sliding windows and over all the layers of the directional model. Model *X* prediction in the *x*-direction exhibited a higher CC and similarity index, although the PNSR is very similar between models, especially after the seventh time step. On the other hand, the RMSE is the highest for Model X after the sixth time step. Model *Y* prediction is also better than Model *Z*'s prediction in the *x*-direction.

A similar comparison is shown for Model *Y* in Figure 9. The CC and the similarity index are strikingly much higher for Model *Y* and than for the other two models. The PSNR is also significantly higher and the RMSE much lower than for the other two models. The prediction of Model *Z* in the *y*-direction was also better than the one of Model *X*. In the *x*-directions, fewer differences were found between all three models predictions than in the *y*-direction.

Figure 10 shows the comparison of the three models in the *z*-direction. As expected, Model *Z* is better at predicting in the *z*-direction; however it shows a better CC than for the other two models only after 7 days. The similarity index is much higher while the PSNR is similar to the ones of the other model, showing no significant improvement. The RMSE becomes lower than for the other two models after the seventh time step. Again, as for the x-direction prediction, the differences between the three models are not as different as they were in the *y*-direction. These results indicate that each model best prediction is associated with its direction of prediction. In addition, in terms of dynamical evolution, the most significant changes were in the *y*-direction (*x*-*z* planes) and better captured by the Model *Y*.

**Figure 8.** Fourteen time-step (7 days) 20-day sliding window average of CC, SSIM, PSNR, and RMSE of the velocity fields in the *x*-direction, predicted by Model *X* (solid line), Model *Y* (dotted line), and Model *Z* (dashed line).

**Figure 9.** Same as Figure 8 but for the velocity fields in the *y*-direction.

**Figure 10.** Same as Figure 8 but for the velocity fields in the *z*-direction.

### *4.3. Vertical Velocity Structure Prediction*

Examples of the 7-day predicted flow field are shown in Figures 11–13 for Models *X*, *Y*, and *Z* respectively. Figure 11 shows a vertical section of the velocity magnitude in the *x*-direction for all three models at 87◦ W. It confirms the metrics results and shows the best agreement between the flow structure of Model *X* and the observations in the *x*-direction. Similarly, Figure 12 shows the vertical section of the velocity magnitude in the *x*-direction for all three models at 25◦ N. It confirms the metrics results and shows the best agreement between the flow structure of Model *Y* and the observations in the *y*-direction. The same consistency between the prediction and observed flow structure is also confirmed for Model *Z* in the *z*-direction (Figure 13). As each prediction model performs best in its corresponding tensor orientation, we propose fusing the prediction of all three models into one tensor.

**Figure 11.** Vertical section of the velocity tensor in the *x*-direction at 87◦ N on day 7 of the prediction. (**a**) Observations; (**b**) Model *X* prediction; (**c**) Model *Y* prediction; (**d**) Model *Z* prediction.

**Figure 12.** Vertical section of the velocity tensor in the *y*-direction at 25◦ W.

**Figure 13.** Horizontal section of the velocity tensor at 100 m depth on day 7 of the prediction. (**a**) Observations; (**b**) Model *X* prediction; (**c**) Model *Y* prediction; (**d**) Model *Z* prediction.

### *4.4. Fusion of the Models' Predictions*

As each model can best predict the evolution of the velocity field in its respective layers, we hypothesize that the fusion of the three model predictions would yield an improved prediction of the overall tensor over each individual one. For this purpose, a simple fusion block was added to the prediction system, as shown in Figure 14. Although various methods can be used to fuse all three tensors, such as unweighted or median selection-based average, we chose to apply a three-dimensional Gaussian smoothing procedure [47] as it provides better results than the other two. The results of the fusion process are shown for the 72 h (3-day) and 168 h (7-day) predictions in Figures 15 and 16, respectively. These figures consist of a 3D representation of the normalized relative vorticity of the flow predicted by each of the three individual models and by the fusion method. Despite the noise associated with each model, the fusion approach is able to filter the noise out and deliver a tensor field that is very similar to the observations, even for a 168 h prediction. The significant improvement of the 3D tensor prediction by the fusion process over individual prediction models is further demonstrated by computing the metrics RMSE, PNSR, SSIM and CC of the various predictions (Figure 17). The fusion output showed an overall improvement over individual predictions for all metrics over the 7-day prediction window. In particular, the RMSE was reduced by more than 25% on day 7 of the prediction.

**Figure 14.** Block diagram of the fusion approach to produce a unified volumetric prediction.

**Figure 15.** Three-dimensional normalized relative vorticity field of the observed and 72 h predicted tensors. (**a**) Observed, (**b**) Model *Z*, (**c**) Model *X*, (**d**) Model *Y*, (**e**) and fusion result.

**Figure 16.** The conditions are same as Figure 15 for the 168 h prediction.

**Figure 17.** Fusion results. (**a**) Mean Squared Error (MSE), (**b**) Correlation Coefficient (CC), (**c**) Peak Signal-to-Noise Ratio (PNSR) and (**d**) Structural Similarity index (SSIM) between the observed 3-dimensional fields and the predictions from Models *Z*, *X*, and *Y*.

### **5. Conclusions**

Modeling and predicting the LCS subsurface vertical structure in the GoM region is essential to all aspects of life in the region. However, useful forecasts of the flow field by current modeling methods do not exceed two days [5–7]. In this study, we developed a deep learning-based prediction model that was capable of predicting some important features of the 3D velocity fields of the LCS up to seven days in advance in a rectangular region where the LC is most active and commonly sheds eddies (Figure 6). Overall, the fusion model exhibited a CC > 0.5 up to 4.5 days (Figure 17). Subsurface velocity data measured by in-situ sensors for an approximately three-year period [29] were used to train and test a deep learning prediction method. To implement the deep learning model, we reduced the dimensionality of the tensors of each component of the velocity field to one dimension by applying EOF. The obtained PC vectors were used as an input variable to the LSTM model. The prediction model was applied separately to each layer of the tensor. We defined one tensor for each direction of the Cartesian coordinate system, which led to three prediction models associated with each direction, respectively. Each model was composed of one individual LSTM model per layer in each tensor and the final prediction consisted of the final tensor made of all the layered predictions for each velocity component. The results of this approach revealed that the prediction models associated with each of the three directions were the best at predicting the flow field in their respective directions. The errors across layers significantly altered the cross-layer structure of the flow. However, the fusion of all three models' solution with a Gaussian filter delivered an improved prediction field over each individual predictions.

Because the number of layer models necessary to conduct the full three-dimensional prediction is equal to the total number of grid points of the field to be predicted, the implementation of such model for real-time forecast seems unrealistic. However, multithread and parallel computing allows for the simultaneous computation of the predictions in all the layers in an efficient and timely manner. In addition, such dense observation arrays are rare and spatially and temporally limited, which limits the size and number of the layers to be predicted as well. In any case, when compared to ocean numerical model operations, even though numerical models are much cheaper to operate, they are unable to reliably predict the evolution of the ocean state without being constrained by ocean observations. It is true that observing arrays are ephemeral, but when they do exist they can be used to make forecasts that do not require numerical models, which simplifies the data processing and streamlines the forecasting process since only one variable is used versus the multitude of state and atmospheric variables required by ocean numerical models. Table 1 shows that the computational time for training is less than 800 s for all layers in a given direction. Assuming that each direction can be computed by one thread, then the overall prediction time would be less than 800 s, which makes this approach adequate for real-time forecasting, even at a hourly rate. HF radar ocean surface current measurements provide a good test-bed for the application of our method, where in this case, only one layer is predicted. The latter are now increasingly used for monitoring coastal circulation in many areas of the coast around the world [48]. The other limitation of the deep learning method is the duration of the measurements. Such methods' accuracy strongly depends on the diversity of events captured by the measurements and therefore their prediction skills can be limited by the duration of the measurements used to create the deep learning models. Ideally, a times series that captures the full extent of the variability in the natural system would yield the best forecast by such methods. However, it is not explicitly clear how prediction improvement is correlated to the duration of the measurements in this tensor prediction method. In a point-wise prediction exercise of ocean current velocity for unmanned underwater vehicle navigation, Immas et al. [12] showed that they could predict with an LSTM model one month of current with one month of training data.

The layered prediction method applied in this study was originally developed by Wang et al. [17] to predict the evolution of the SSH, a two-dimensional field. Predicting the three-dimensional velocity fields with this two-dimensional method has revealed the importance of the relative changes between layers in the accuracy of the predicted tensor. Future work will be focused on the inclusion of the relationship between individual nodes and their surrounding nodes in the domain, in order to account for the relative evolution between nodes. This node's spatio-temporal connectivity could be learned through another DL model ultimately coupled with the prediction model. We anticipate that such multi-model approach could provide longer reliable three-dimensional forecasts than the approach herein.

**Author Contributions:** A.M.A., H.Z. and L.C. conceived the project idea, and designed the experiments; A.M.A. and A.K.I. designed the model; A.M.A. performed the experiments; A.M.A. and H.Z.; A.M.A., H.Z. and L.C. wrote the paper. J.V. edited the paper. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was partially supported by an Understanding Gulf Ocean Systems Phase II grant (#NAS/GRP 2000011052) from the National Academies of Sciences, Engineering, and Medicine (NASEM), and by an MRI grant (#1828181) from National Science Foundation (NSF).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The dataset used in this study is provided by Kathleen Donohue and her team.

**Acknowledgments:** We would like to thank Kathleen Donohue and her team for providing the Dynloop dataset, which made this study possible.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**


### *Article* **Model-Free Time-Aggregated Predictions for Econometric Datasets**

**Kejin Wu 1,† and Sayar Karmakar 2,\*,†**


**Abstract:** Forecasting volatility from econometric datasets is a crucial task in finance. To acquire meaningful volatility predictions, various methods were built upon GARCH-type models, but these classical techniques suffer from instability of short and volatile data. Recently, a novel existing normalizing and variance-stabilizing (NoVaS) method for predicting squared log-returns of financial data was proposed. This model-free method has been shown to possess more accurate and stable prediction performance than GARCH-type methods. However, whether this method can sustain this high performance for long-term prediction is still in doubt. In this article, we firstly explore the robustness of the existing NoVaS method for long-term time-aggregated predictions. Then, we develop a more parsimonious variant of the existing method. With systematic justification and extensive data analysis, our new method shows better performance than current NoVaS and standard GARCH(1,1) methods on both short- and long-term time-aggregated predictions. The success of our new method is remarkable since efficient predictions with short and volatile data always carry great importance. Additionally, this article opens potential avenues where one can design a model-free prediction structure to meet specific needs.

**Keywords:** ARCH-GARCH; model-free; aggregated forecasting

### **1. Introduction**

Accurate and robust volatility forecasting is a central focus in financial econometrics. This type of forecasting is crucial for practitioners and traders to make decisions in risk management, asset allocation, pricing of derivative instruments and strategic decisions regarding fiscal policies, etc. Standard methods to perform volatility forecasting are typically built upon applying GARCH-type models to predict squared financial log-returns. With the model-free prediction principle, first proposed by Politis [1], a model-free volatility prediction method—NoVaS—has been proposed recently for efficient forecasting without the assumption of normality. Some previous studies have shown that the NoVaS method possesses better predictive performance than GARCH-type models when forecasting squared log-returns, e.g., Gulay and Emec [2] showed that the NoVaS method could overcome GARCH-type models (GARCH, EGARCH and GJR-GARCH) with generalized error distributions by comparing the pseudo-out-of-sample (POOS) forecasting performance on S&P500 and BIST 100 return series (here the pseudo-out-of-sample forecasting analysis means using data up to and including the current time to predict future values). Chen and Politis [3] showed that the "time-varying" NoVaS method is robust against possible non-stationarities in the data. Furthermore, Chen and Politis [4] extended this NoVaS approach to perform multi-step-ahead predictions of squared log-returns.

However, to the best of our knowledge, such methods have not been evaluated for time-aggregated prediction. Time-aggregated prediction here stands for the prediction of *Yn*+<sup>1</sup> + ··· + *Yn*<sup>+</sup>*<sup>h</sup>* after observing {*Yt*}*<sup>n</sup> <sup>t</sup>*=1. Such predictions remain crucial for strategic decisions implemented by commodity or service providers, ([5,6]), trust funds, pension

**Citation:** Wu, K.; Karmakar, S. Model-Free Time-Aggregated Predictions for Econometric Datasets. *Forecasting* **2021**, *3*, 920–933. https:// doi.org/10.3390/forecast3040055

Academic Editor: Alessia Paccagnini

Received: 3 November 2021 Accepted: 6 December 2021 Published: 8 December 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

management, insurance companies, portfolio management of specific derivatives ([7]) and assets ([8]). Time-aggregated forecasting is also able to provide some degree of confidence in understanding the general trend in the near future, potentially for the entire following week or months ahead, which is definitely more meaningful than merely understanding what might happen for any single step ahead (predicting *Yn*<sup>+</sup>*<sup>h</sup>* for one value of *h*) in the time horizon. In fact, the quality of forecasts for econometric data has been evaluated through such time-aggregated metrics in [9,10]. In this article, we continue utilizing these timeaggregated metrics to challenge the ability of the NoVaS method for short- and long-term time-aggregated predictions on squared log-returns series. For exploring such capabilities of the existing NoVaS method, we set up comprehensive data analyses to substantiate the efficiency of the NoVaS method and also address the lack of data experiments in NoVaS studies. Apart from this, we also attempt to improve the existing one further by proposing a more parsimonious model. Based on extensive data analysis, our new method shows more stable performance than the state-of-the-art NoVaS method regardless of whether simulation or real-world data are used. We also find that the state-of-the-art NoVaS method is even surpassed by the standard GARCH(1,1) model sometimes. On the other hand, our new method returns consistently excellent forecasting. Notably, our method achieves a remarkable improvement when the dataset at hand is short and volatile.

The rest of this article is organized as follows. In Section 2, we firstly introduce the theoretical background and structure of the existing NoVaS method. Then, our new method is proposed and a simple comparison is made to show the stability of our new method. In Section 3, we substantiate our proposal by extensive simulations and data analysis. Moreover, we utilize the CW test to support our parsimonious model. Finally, a summary and discussion are given in Sections 4 and 5, respectively.

### **2. Method**

### *2.1. The Existing NoVaS Method*

The NoVaS method is a model-free prediction principle. The main idea lies in applying an invertible transformation *<sup>H</sup>*, which can map the non-*i*.*i*.*d*. vector {*Yi*}*<sup>t</sup> <sup>i</sup>*=<sup>1</sup> to a vector { *i*}*t <sup>i</sup>*=<sup>1</sup> that has *i*.*i*.*d*. components. This leads to the prediction of *Yt*+<sup>1</sup> by inversely transforming the prediction of  *<sup>t</sup>*+<sup>1</sup> [11]. The starting point to build the transformation of the existing NoVaS method is the ARCH model [12]. Then, Politis [1] made some adjustments to determine the final form of *H* as:

$$\mathcal{W}\_t = \frac{Y\_t}{\sqrt{\kappa s\_{t-1}^2 + \vec{a}\_0 Y\_t^2 + \sum\_{i=1}^p a\_i Y\_{t-i}^2}} \text{ for } t = p + 1, \dots, n. \tag{1}$$

In Equation (1), {*Yt*}*<sup>n</sup> <sup>t</sup>*=<sup>1</sup> is the log-returns vector in this article; {*Wt*}*<sup>n</sup> <sup>t</sup>*=*p*+<sup>1</sup> is the transformed vector, which we hope to transform to *i*.*i*.*d*.; *α* is a fixed-scale invariant constant; *s*2 *<sup>t</sup>*−<sup>1</sup> is calculated by (*<sup>t</sup>* <sup>−</sup> <sup>1</sup>)−<sup>1</sup> <sup>∑</sup>*t*−<sup>1</sup> *<sup>i</sup>*=<sup>1</sup> (*Yi* <sup>−</sup> *<sup>μ</sup>*)2, with *<sup>μ</sup>* being the mean of {*Yi*}*t*−<sup>1</sup> *<sup>i</sup>*=1; *a*˜0 is the coefficient corresponding with the currently observed value *Y*<sup>2</sup> *<sup>t</sup>* . For reaching a qualified transformation function, Equation (2) is required to stabilize the variance.

$$\alpha \in (0, 1), \mathbb{a}\_0 \ge 0, a\_i \ge 0 \text{ for all } i \ge 1, a + \mathbb{a}\_0 + \sum\_{i=1}^p a\_i = 1 \tag{2}$$

Then, *α* and *a*˜0, *a*1, ···, *ap* are finally determined by minimizing |*Kurtosis*(*Wt*) − 3|. In practice, the transformed {*Wt*} is usually uncorrelated; see [11] for additional processes for correlated {*Wt*}. This method is model-free in the sense that we do not assume any particular distribution for the innovation {*Wt*} except for matching its kurtosis to 3. Once *H* is found, *H*−<sup>1</sup> can be obtained immediately. For example, *H*−<sup>1</sup> corresponding with Equation (1) is:

$$Y\_t = \sqrt{\frac{\mathcal{W}\_t^2}{1 - \vec{a}\_0 \mathcal{W}\_t^2} (a s\_{t-1}^2 + \sum\_{i=1}^p a\_i \mathcal{Y}\_{t-i}^2)} \text{ for } t = p + 1, \dots, n. \tag{3}$$

To obtain the prediction of *Y*<sup>2</sup> *<sup>n</sup>*+1, Politis [11] defined two types of optimal predictors under *L*<sup>1</sup> (Mean Absolute Deviation) and *L*<sup>2</sup> (Mean Squared Error) criteria after observing historical information set F*<sup>n</sup>* = {*Yt*, 1 ≤ *t* ≤ *n*}:

*L*1-optimal predictor of *Y*<sup>2</sup> *<sup>n</sup>*+<sup>1</sup> :

$$\begin{split} &\text{Median}\left\{\mathcal{V}\_{n+1,m}^{2} : m = 1, \cdot, \cdot, M \, | \, \mathcal{F}\_{n} \right\} \\ &= \text{Median}\left\{\frac{\mathcal{W}\_{n+1,m}^{2}}{1 - \bar{a}\_{0} \mathcal{W}\_{n+1,m}^{2}} (\text{as}\_{n}^{2} + \sum\_{i=1}^{p} a\_{i} \mathcal{Y}\_{n+1-i}^{2}) : m = 1, \cdot, \cdot, M \, \middle| \, \mathcal{F}\_{n} \right\} \\ &= (\text{as}\_{n}^{2} + \sum\_{i=1}^{p} a\_{i} \mathcal{Y}\_{n+1-i}^{2}) \text{Median}\left\{\frac{\mathcal{W}\_{n+1,m}^{2}}{1 - \bar{a}\_{0} \mathcal{W}\_{n+1,m}^{2}} : m = 1, \cdot, \cdot, M \right\} \\ &= 1. \end{split} \tag{4}$$

*L*2-optimal predictor of *Y*<sup>2</sup> *<sup>n</sup>*+<sup>1</sup> :

$$\begin{split} &\text{Mean}\left\{Y\_{n+1,m}^{2}:m=1,\cdots,M\,\middle|\,\mathcal{F}\_{n}\right\} \\ &=\text{Mean}\left\{\frac{\mathcal{W}\_{n+1,\text{nr}}^{2}}{1-\mathbb{E}\_{0}\mathcal{W}\_{n+1,\text{nr}}^{2}}(\text{as}\_{n}^{2}+\sum\_{i=1}^{p}a\_{i}Y\_{n+1-i}^{2}):m=1,\cdots,M\,\middle|\,\mathcal{F}\_{n}\right\} \\ &=(\text{as}\_{n}^{2}+\sum\_{i=1}^{p}a\_{i}Y\_{n+1-i}^{2})\text{Mean}\left\{\frac{\mathcal{W}\_{n+1,\text{nr}}^{2}}{1-\mathbb{E}\_{0}\mathcal{W}\_{n+1,\text{nr}}^{2}}:m=1,\cdots,M\right\} \end{split}$$

where {*Wn*+1,*m*}*<sup>M</sup> <sup>m</sup>*=<sup>1</sup> are generated *M* times from its empirical distribution or a normal distribution. Here, the normal distribution is an asymptotic limit of the empirical distribution of {*Wn*+1}. More details about this procedure and multi-step prediction are presented in Section 2.2. {*Y*<sup>2</sup> *n*+1,*m*}*<sup>M</sup> <sup>m</sup>*=<sup>1</sup> are given by plugging {*Wn*+1,*m*}*<sup>M</sup> <sup>m</sup>*=<sup>1</sup> into Equation (3) and setting *t* as *n* + 1. During the optimization process, different forms of unknown parameters in Equation (2) are applied so that various NoVaS methods are established. Chen [13] pointed out that the Generalized Exponential NoVaS (GE-NoVaS) method with exponentially decayed unknown parameters presented in Equation (5) is superior to other NoVaS-type methods.

$$\alpha \neq 0, \vec{a}\_0 = c', a\_i = c' \varepsilon^{-ci} \text{ for all } 1 \le i \le p, \ c' = \frac{1-\alpha}{\sum\_{i=0}^p \varepsilon^{-ci}} \tag{5}$$

### *2.2. A New Method with Less Parameters*

However, during our investigation, we found that the GE-NoVaS method returns extremely large predictions under the *L*<sup>2</sup> criterion sometimes. The reason for this phenomenon is that the denominator of Equation (3) will be quite small when the generated {*W*∗} (from empirical or normal distribution) is very close to 1/*a*˜0. In this situation, the prediction error will be amplified. Moreover, when the long-term ahead prediction is desired, this amplification will be accumulated and the final prediction will be dampened. Therefore, a removing-*a*˜0 idea is proposed to avoid such issues in this article. *H* and *H*−<sup>1</sup> of the GE-NoVaS-without-*a*˜0 method can be rewritten as below:

$$\mathcal{W}\_t = \frac{\mathbf{Y}\_t}{\sqrt{a s\_{t-1}^2 + \sum\_{i=1}^p a\_i \mathbf{Y}\_{t-i}^2}}; \ Y\_t = \sqrt{\mathcal{W}\_t^2 (a s\_{t-1}^2 + \sum\_{i=1}^p a\_i \mathbf{Y}\_{t-i}^2)}; \ \text{for } t = p + 1, \dots, n. \tag{6}$$

We should notice that even without the *a*˜0 term, the causal prediction rule is still satisfied. It is easy to obtain the analytical form of the first-step-ahead *Yn*+1, which can be expressed as below:

$$Y\_{n+1} = \sqrt{W\_{n+1}^2 \left(a s\_n^2 + \sum\_{i=1}^p a\_i Y\_{n+1-i}^2\right)}\tag{7}$$

More specifically, when the first-step GE-NoVaS-without-*a*˜0 prediction is performed, {*W*<sup>∗</sup> *<sup>n</sup>*+1} are generated *M* (i.e., 5000 in this article) times from a standard normal distribution by the Monte Carlo method or bootstrapped from its empirical distribution *F*ˆ *<sup>w</sup>* which is calculated from Equation (1). Then, plugging these {*W*<sup>∗</sup> *n*+1,*m*}*<sup>M</sup> <sup>m</sup>*=<sup>1</sup> into Equation (7), *<sup>M</sup>* pseudo-predictions {*Y*<sup>ˆ</sup> <sup>∗</sup> *n*+1,*m*}*<sup>M</sup> <sup>m</sup>*=<sup>1</sup> are obtained. According to the strategy implied by Equation (4), we choose *L*<sup>1</sup> and *L*<sup>2</sup> risk optimal predictors *Y*ˆ <sup>2</sup> *<sup>n</sup>*+<sup>1</sup> as the sample median and mean of {*Y*<sup>ˆ</sup> <sup>∗</sup> *<sup>n</sup>*+1,1, ··· ,*Y*<sup>ˆ</sup> <sup>∗</sup> *<sup>n</sup>*+1,*M*}, respectively. We can even predict the general form of *Yn*<sup>+</sup>*h*, such as *<sup>g</sup>*(*Yn*<sup>+</sup>*h*), by adopting the sample mean or median of {*g*(*Y*<sup>ˆ</sup> <sup>∗</sup> *<sup>n</sup>*+1,1), ··· , *<sup>g</sup>*(*Y*<sup>ˆ</sup> <sup>∗</sup> *<sup>n</sup>*+1,*M*)}. Similarly, the two-steps-ahead *Yn*+<sup>2</sup> can be expressed as:

$$Y\_{n+2} = \sqrt{\mathcal{W}\_{n+2}^2 (as\_{n+1}^2 + a\_1 Y\_{n+1}^2 + \sum\_{i=2}^p a\_i Y\_{n+2-i}^2)}\tag{8}$$

When the prediction of *Yn*+<sup>2</sup> is required, *M* pairs of {*W*<sup>∗</sup> *<sup>n</sup>*+1, *W*<sup>∗</sup> *<sup>n</sup>*+2} are still generated by bootstrapping or Monte Carlo method from empirically or standard normal distributions, respectively. *Y*<sup>2</sup> *<sup>n</sup>*+<sup>1</sup> is replaced by the predicted value *<sup>Y</sup>*<sup>ˆ</sup> <sup>2</sup> *<sup>n</sup>*+<sup>1</sup> which is derived from running the first-step GE-NoVaS-without-*a*˜0 prediction with simulated {*W*<sup>∗</sup> *n*+1,*m*}*<sup>M</sup> <sup>m</sup>*=<sup>1</sup> under the *L*<sup>1</sup> or *L*<sup>2</sup> criterion. Subsequently, we choose *L*<sup>1</sup> and *L*<sup>2</sup> risk optimal predictors of *Yn*+<sup>2</sup> as the sample median and mean of {*Y*<sup>ˆ</sup> <sup>∗</sup> *<sup>n</sup>*+2,1, ··· ,*Y*<sup>ˆ</sup> <sup>∗</sup> *<sup>n</sup>*+2,*M*}.

Finally, iterating the process described above, we can accomplish multi-step-ahead NoVaS predictions. *Yn*<sup>+</sup>*h*, *h* ≥ 3 can be expressed as:

$$Y\_{n+h} = \sqrt{\mathcal{W}\_{n+h}^2 (\alpha s\_{n+h-1}^2 + \sum\_{i=1}^p a\_i Y\_{n+h-i}^2)}\tag{9}$$

To obtain the prediction of *Yn*<sup>+</sup>*h*, we generate *M* number of {*W*<sup>∗</sup> *<sup>n</sup>*+1, ··· , *W*<sup>∗</sup> *<sup>n</sup>*+*h*} and plug {*Yn*<sup>+</sup>*k*}*h*−<sup>1</sup> *<sup>k</sup>*=<sup>1</sup> with NoVaS predicted values {*Y*<sup>ˆ</sup> *n*+*k*}*h*−<sup>1</sup> *<sup>k</sup>*=1, which are computed iteratively. *L*<sup>1</sup> and *L*<sup>2</sup> risk optimal predictors of *Yn*<sup>+</sup>*<sup>h</sup>* are computed by the sample median and mean of {*Y*<sup>ˆ</sup> <sup>∗</sup> *<sup>n</sup>*+*h*,1, ··· ,*Y*<sup>ˆ</sup> <sup>∗</sup> *<sup>n</sup>*+*h*,*M*}. In short, we can summarize that *Yn*<sup>+</sup>*<sup>h</sup>* is determined by:

$$Y\_{n+h} = f\_{\text{GE-NoVaS- without}} \mu\_0 \left( \mathcal{W}\_{n+1 \star} \cdots \mathcal{W}\_{n+h \star} \mathcal{F}\_n \right) \tag{10}$$

Since F*<sup>n</sup>* is the observed information set, we can simplify the expression of *Yn*<sup>+</sup>*<sup>h</sup>* as:

$$Y\_{n+h} = f\_{\text{GE-NoVaS-width}} \mu\_0 \left( \mathcal{W}\_{n+1} \cdot \cdots \cdot \mathcal{W}\_{n+h} \right) \tag{11}$$

For applying the GE-NoVaS method, we can still build the relationship between *Yn*<sup>+</sup>*<sup>h</sup>* and {*Wn*+1, ··· , *Wn*<sup>+</sup>*h*} as:

$$Y\_{n+h} = f\_{\text{GE-NoVaS}}(\mathcal{W}\_{n+1}, \dots, \mathcal{W}\_{n+h}) \tag{12}$$

We should notice that simulated {*W*<sup>∗</sup> *<sup>n</sup>*+1,*m*, ··· , *W*<sup>∗</sup> *n*+*h*,*m*}*<sup>M</sup> <sup>m</sup>*=<sup>1</sup> for obtaining GE-NoVaS method prediction of *Yn*<sup>+</sup>*<sup>h</sup>* should be generated by the bootstrapping or Monte Carlo method from an empirically or trimmed standard normal distribution. The reason for using the trimmed distribution is |*Wt*| ≤ 1/ <sup>√</sup>*a*˜0 from Equation (1). Here, we summarize Algorithm 1 to perform *h*-step-ahead time-aggregated prediction using the GE-NoVaSwithout-*a*˜0 method. The algorithm of GE-NoVaS can be written out similarly.

**Remark (The advantage of removing the** *a*˜0 **term):** First, after removing the *a*˜0 term, the prediction of the NoVaS method under the *L*<sup>2</sup> criterion is more stable. More details will be shown in Section 2.3. Second, the suggestion of removing *a*˜0 can also lead to less time complexity of our new method. The reason for this phenomenon is simple. If we consider the limiting distribution of {*Wt*} series, 1/ <sup>√</sup>*a*˜0 is required to be larger than or equal to 3 to ensure that {*Wt*} has a sufficiently large range, i.e., *a*˜0 is required to be less than or equal to 0.111 (recall that the mass of standard normal data is within [−3, 3]). However, the optimal combination of NoVaS coefficients may not render a suitable *a*˜0. For this situation, we need to increase the NoVaS transformation order *p* and repeat the normalizing and variance-stabilizing process till *a*˜0 in the optimal combination of coefficients is suitable. This repeating process definitely increases the computation workload.


### *2.3. The Potential Instability of the GE-NoVaS Method*

Next, we provide an illustration to compare the GE-NoVaS and GE-NoVaS-without-*a*˜0 methods in predicting the volatility of the Microsoft Corporation (MSFT) daily closing price from 8 January 1998 to 31 December 1999 and show an interesting finding that the long-term time-aggregated predictions of the GE-NoVaS method are unstable under the *L*<sup>2</sup> criterion. Based on the finding of Awartani and Corradi [14], squared log-returns can be used as a proxy for volatility to render a correct ranking of different GARCH models in terms of a quadratic loss function. Log-return series {*Yt*} can be computed by the equation shown below:

$$Y\_t = 100 \times \log(X\_{t+1}/X\_t) \tag{13}$$

where {*Xt*} is the corresponding MSFT daily closing price series. For achieving a comprehensive comparison, we use 250 financial log-returns as a sliding window to perform POOS 1-step, 5-step and 30-step (long-term) ahead time-aggregated predictions under the *L*<sup>2</sup> criterion. Then, we roll this window through the whole dataset, i.e., we use {*Y*1, ··· ,*Y*250} to predict *Y*<sup>2</sup> <sup>251</sup>, {*Y*<sup>2</sup> <sup>251</sup>, ··· ,*Y*<sup>2</sup> <sup>255</sup>} and {*Y*<sup>2</sup> <sup>251</sup>, ··· ,*Y*<sup>2</sup> <sup>280</sup>}; then, we use {*Y*2, ··· ,*Y*251} to predict *Y*<sup>2</sup> <sup>252</sup>, {*Y*<sup>2</sup> <sup>252</sup>, ··· ,*Y*<sup>2</sup> <sup>256</sup>} and {*Y*<sup>2</sup> <sup>252</sup>, ··· ,*Y*<sup>2</sup> <sup>281</sup>}, for 1-step, 5-step and 30-step aggregated predictions, respectively, and so on. We can define all 1-step, 5-step and 30-step-ahead time-aggregated predictions as {*Y*<sup>ˆ</sup> <sup>2</sup> *<sup>k</sup>*,1}, {*Y*<sup>ˆ</sup> <sup>2</sup> *<sup>i</sup>*,5} and {*Y*<sup>ˆ</sup> <sup>2</sup> *<sup>j</sup>*,30}, which are presented as below:

Assume that there are a total of *N* log-return data points:

$$\begin{aligned} \hat{Y}\_{k,1}^{2} &= \hat{Y}\_{k+1}^{2}, \; k = 250, 251, \; \cdots \; \prime, N - 1\\ \hat{Y}\_{i,5}^{2} &= \sum\_{m=-1}^{5} \hat{Y}\_{i+m}^{2} \; i = 250, 251, \; \cdots \; \prime, N - 5\\ \hat{Y}\_{j,30}^{2} &= \sum\_{m=1}^{30} \hat{Y}\_{j+m}^{2} \; j = 250, 251, \; \cdots \; \prime, N - 30 \end{aligned} \tag{14}$$

In Equation (14), *Y*ˆ <sup>2</sup> *k*+1,*Y*<sup>ˆ</sup> <sup>2</sup> *i*+*m*,*Y*<sup>ˆ</sup> <sup>2</sup> *<sup>j</sup>*+*<sup>m</sup>* are single-step predictions of squared log-returns by the two NoVaS-type methods. To obtain the "Prediction Errors" for the two methods, we can calculate the "loss" by comparing the aggregated prediction results with the realized aggregated values based on Equation (15):

$$L\_{p,h} = \sum\_{p} (\hat{Y}\_{p,h}^2 - \sum\_{m=1}^{h} (Y\_{p+m}^2))^2, \ p \in \{k, i, j\}; \ h \in \{1, 5, 30\} \tag{15}$$

where {*Y*<sup>2</sup> *<sup>p</sup>*+*m*} are realized squared log-returns. To show the potential instability of the GE-NoVaS method under the *L*<sup>2</sup> criterion, we take *α* to be 0.5 to build a toy example. In the algorithm when performing the GE-NoVaS method, *α* could take an optimal value from a discrete set {0.1, ··· , 0.8} based on the prediction performance.

From Figure 1, we can clearly see that the GE-NoVaS-without-*a*˜0 method can better capture different steps' true time-aggregated features. On the other hand, the GE-NoVaS method returns unstable results for 30-step-ahead time-aggregated predictions. Besides, we can see that the 1-step-ahead POOS prediction returned by the GE-NoVaS method is almost a flat curve, which is actually meaningless. Similarly, for the 5-step-ahead time-aggregated prediction case, the POOS prediction of the GE-NoVaS method fails to match the true time-aggregated values.

**Figure 1.** Curves of the true and predicted time-aggregated squared log-returns from GE-NoVaS and GE-NoVaS-without-*a*˜0 methods.

### **3. Data Analysis and Results**

To perform extensive data analysis in a bid to validate our method, we deploy POOS predictions using two NoVaS and standard GARCH(1,1) methods with simulated and real-world data. All results are collated in Table 1. The optimal results for each data cases are highlighted in bold. For controlling the dependence of the prediction performance on the length of the dataset, we build datasets with two fixed lengths—250 or 500—to mimic 1-year or 2-year data, respectively. At the same time, we choose the window size for our rollover forecasting analysis to be 100 or 250 for the 1-year or 2-year datasets.

### *3.1. Simulation Study*

We use the same simulation Models 1–4 from [4], shown below, to mimic four 1-year datasets. Recall that one NoVaS method can generate the *L*<sup>1</sup> or *L*<sup>2</sup> predictor and {*W*∗} can be chosen from a normal distribution or empirical distribution; thus, there are four variants of one specific NoVaS method. We take the best-performing result among four variants of a specific NoVaS method to be its final prediction. Finally, we continue applying the formula in Equation (15) to measure the performance of the different methods, as described in Section 2.3.

**Model 1:** Time-varying GARCH(1,1) with Gaussian errors *Xt* = *σ<sup>t</sup> <sup>t</sup>*, *σ*<sup>2</sup> *<sup>t</sup>* = 0.00001 + *β*1,*tσ*<sup>2</sup> *<sup>t</sup>*−<sup>1</sup> <sup>+</sup> *<sup>α</sup>*1,*tX*<sup>2</sup> *<sup>t</sup>*−1, { *<sup>t</sup>*} ∼ *i*.*i*.*d*. *N*(0, 1) *α*1,*<sup>t</sup>* = 0.1 − 0.05*t*/*n*; *β*1,*<sup>t</sup>* = 0.73 + 0.2*t*/*n*, *n* = 250 **Model 2:** Standard GARCH(1,1) with Gaussian errors *Xt* = *σ<sup>t</sup> <sup>t</sup>*, *σ*<sup>2</sup> *<sup>t</sup>* = 0.00001 + 0.73*σ*<sup>2</sup> *<sup>t</sup>*−<sup>1</sup> <sup>+</sup> 0.1*X*<sup>2</sup> *<sup>t</sup>*−1, { *<sup>t</sup>*} ∼ *i*.*i*.*d*. *N*(0, 1) **Model 3:** (Another) Standard GARCH(1,1) with Gaussian errors *Xt* = *σ<sup>t</sup> <sup>t</sup>*, *σ*<sup>2</sup> *<sup>t</sup>* = 0.00001 + 0.8895*σ*<sup>2</sup> *<sup>t</sup>*−<sup>1</sup> <sup>+</sup> 0.1*X*<sup>2</sup> *<sup>t</sup>*−1, { *<sup>t</sup>*} ∼ *i*.*i*.*d*. *N*(0, 1) **Model 4:** Standard GARCH(1,1) with Student-*t* errors *Xt* = *σ<sup>t</sup> <sup>t</sup>*, *σ*<sup>2</sup> *<sup>t</sup>* = 0.00001 + 0.73*σ*<sup>2</sup> *<sup>t</sup>*−<sup>1</sup> <sup>+</sup> 0.1*X*<sup>2</sup> *<sup>t</sup>*−1, { *<sup>t</sup>*} ∼ *i*.*i*.*d*. *t* distribution with five degrees of freedom

**Result analysis:** From the first block of Table 1, we can read that both NoVaS methods are superior to the GARCH(1,1) model. Although these simulated datasets are generated from GARCH(1,1)-type models, the GE-NoVaS-without-*a*˜0 method can bring around 66% and 48% improvements compared to the GARCH(1,1) model for 5-step-ahead time-aggregated predictions of Model-4 and Model-1 data, respectively. Notably, GARCH(1,1) brings poor results for the 30-step-ahead time-aggregated predictions of Model-4 simulated data, which implies that such a classical method is impaired by error accumulation problems when long-term predictions are required. On the other hand, the model-free NoVaS method can avoid this issue. Taking a closer look at these results, we can observe that almost all optimal results come from applying the GE-NoVaS-without-*a*˜0 method. Moreover, the GE-NoVaS method is surpassed by GARCH(1,1) when forecasting 30-step-ahead time-aggregated Model-2 data. On the other hand, the GE-NoVaS-without-*a*˜0 method provides consistently stable results. These results imply that the GE-NoVaS-without-*a*˜0 method dominates the GE-NoVaS method when predicting long-term or short-term time-aggregated predictions. Besides, using the same generated models from the previous study of the NoVaS method [4] ensures fairness. Additionally, with simulation implementations, the ability against model misspecification of NoVaS methods is verified in Appendix A.

### *3.2. A Few Real Datasets*

We also present a variety of real-world datasets of different size and intrinsic behavior:


Taking into account three types of real-world data is necessary to challenge our new method and explore the existing method in different regimes. We also tactically pay more attention to short and volatile data since this is a more challenging task to handle. Equation (13) is continually used to obtain the log-return series of different datasets.

Before comparing in depth the forecasting performance of the NoVaS-type and GARCH methods, we first investigate the properties of the used datasets. From Figure 2, we can see that there were huge variations in the four datasets during 11.2019∼10.2020, which implies the extreme fluctuations in global economics due to the COVID-19 pandemic. We wished to apply such datasets to test whether the NoVaS-type methods can achieve good forecasting performance for such volatile data.

**Figure 2.** Price series of selected 9 datasets.

Besides, it is natural to question whether these datasets are stationary. In a comprehensive manner, we choose three statistical tests—Augmented Dickey–Fuller (ADF) Test [15], Phillips–Perron (PP) Unit Root Test [16] and Kwiatkowski–Phillips–Schmidt–Shin (KPSS) Test [17]—to check the stationarity of the squared log-returns series of each selected dataset. One aspect that should be noticed is that the number of lags is crucial for the ADF test. If the included lag is too small, then the remaining serial correlation in the errors will bias the test. If this number is larger, the power of the test will suffer. Here, we consider taking the longest lag that is statistically significant. More specifically, we determine this longest lag by observing the last lag that crosses through the confidence interval lines of the autocorrelation plot. Besides, we apply a long version of the truncation lag parameter on both the PP and KPSS tests. The results of the three tests are tabulated in Table A4. Combining these results, we can argue that most of the squared log-return series in the normal time period are stationary. However, during the volatile time period, the squared log-returns of IBM, SP500 and Dow Jones are thought to be non-stationary by the ADF test. The KPSS test also returns small *p*-values for these three datasets. These results are consistent with our conjecture that data tend to show non-stationarity during volatile periods. Again, it will be interesting to see if the NoVaS-type methods can offer good forecasting performance for non-stationary data. Recall that Chen and Politis [3] found that the NoVaS methodology generally outperforms the GARCH benchmark on the one-step-ahead point prediction of

non-stationary data (involving local stationarity and/or structural breaks). However, they only considered two real-world time series. Here, we extend such empirical study to shortand long-term time-aggregated predictions with sufficient data examples.

**Remark (One ARCH-type model for non-stationary data):** Since our stationarity tests suggest that some series may not be stationary, we can consider applying ARCH-withoutintercept, which is a variant of the ARCH model. This variant is non-stationary but stable in the sense that the observed process has non-degenerated distribution. Moreover, it appears to be an alternative to common stationary but highly persistent GARCH models [18]. Inspired by this ARCH-type model, the NoVaS method may be further improved by removing the corresponding intercept term *αs*<sup>2</sup> *<sup>t</sup>*−<sup>1</sup> in Equations (1) and (6). More empirical experiments could be conducted along this direction.

**Result analysis:** From the last three blocks of Table 1, there is no optimal result that comes from the GARCH(1,1) method. When the target data are short and volatile, GARCH(1,1) gives poor results for 30-step-ahead time-aggregated predictions, such as the volatile Djones, CADJPY and IBM cases. Among the two NoVaS methods, the GE-NoVaS-without*a*˜0 method outperforms the GE-NoVaS method for the three types of real-world data. More specifically, around 70% and 30% improvements are created by our new method compared to the existing GE-NoVaS method when forecasting 30-step-ahead time-aggregated volatile Djones and CADJPY data, respectively. We should also notice that the GE-NoVaS method is again surpassed by the GARCH(1,1) model on 30-step-ahead aggregated predictions of 2018∼2019 BAC data. On the other hand, the GE-NoVaS-without-*a*˜0 method performs stably. These comprehensive prediction comparisons cover the shortage of empirical analyses of NoVaS methods, and imply that NoVaS-type methods are indeed valid and efficient for real-world short- or long-term predictions of three main types of econometric data. See Appendix A for more results.

### *3.3. Statistical Significance*

However, one may suggest that the victory of our new methods is only specific to these samples. Therefore, we challenge this superiority by testing the statistical significance. Noting that the GE-NoVaS-without-*a*˜0 method is a nested method (taking *a*˜0 = 0 in the larger model) compared with the GE-NoVaS method, we deploy the CW test [19] to ensure that the removing-*a*˜0 idea is also statistically reasonable; see the *p*-value column in Table 1 for the tests' results. The reason for not performing CW tests on the simulation cases is that the prediction performance of each simulation is the average value of 5 replications. These CW test results imply that the null hypothesis should not be rejected for almost all cases under a 5% level of significance, which confirms the equivalence of the new method to the existing one.



*Note:* The values presented in the GE-NoVaS and GE-NoVaS-without-*a*˜0 columns reflect the relative performance compared with the 'standard' GARCH(1,1) method. The null hypothesis of the CW test is that parsimonious and larger models have equal mean squared prediction error (MSPE). The alternative is that the larger model has a smaller MSPE.

### **4. Summary**

In previous studies of NoVaS methods, only a few real-word data analyses were performed [2–4]. Here, we provide extensive data analyses to address the lack of realworld data experiments. Our results are consistent with previous findings and substantiate the effectiveness of the NoVaS method again, i.e., the NoVaS method is more efficient and stable than the classical GARCH method for short-term predictions. Further, we reveal the ability of NoVaS-type methods to perform long-term time-aggregated forecasting. Beyond this, we propose a new NoVaS method that outperforms the state-of-the-art GE-NoVaS method. Our findings in this article are summarized as follows:


### **5. Discussion**

In this article, we explored the GE-NoVaS method toward short and long timeaggregated predictions and proposed a new variant that is based on a parsimonious model, has better empirical performance and yet is statistically reasonable. Although our new method is in a parsimonious form, it still obeys the autoregressive prediction rule and it is more stable for performing predictions under *L*<sup>2</sup> risk criterion than current the GE-NoVaS method. We should note that the unknown coefficients of both the GE-NoVaS (*a*˜0, *a*1, ··· , *ap*) and GE-NoVaS-without-*a*˜0 (*a*1, ··· , *ap*) methods are in exponential form, which implies that the correlations within series data are decreasing in exponential speed with the increasing time order. However, this specific form is not suitable for use for predicting all datasets. In other words, we anticipate performing NoVaS prediction without fixing the unknown coefficients in an invariant form to satisfy the variety of real-world econometric datasets. Therefore, building a NoVaS method with a more arbitrary coefficient form can be a future research direction. In addition, we should also note that there is a high demand to perform efficient forecasting for integer time series data. For example, a relevant topic regarding such integer-value prediction is forecasting COVID-19 cases. It will be beneficial to develop a variant of NoVaS for integer-value data. Moreover, in the financial market, the stock data move together. Thus, it would be exciting to see if one can perform model-free predictions in a multiple time series scenario. We hope that this article will open up avenues where one can explore other specific transformation structures to improve the existing forecasting frameworks and aid in specific tasks.

From a statistical inference point of view, one can also construct prediction intervals for these predictions using bootstrap. Such prediction intervals are well sought after in the econometrics literature and some results on the asymptotic validity of these can be provided. Additionally, we can also explore dividing the dataset into test and training in some optimal way and see if this can improve the performance of these methods.

In addition, there are some model-free methods based on machine learning to perform prediction tasks. These modern techniques enjoy high accuracy, but are time-consuming and lack of statistical inference. On the other hand, our new method and existing NoVaS methods are time-efficient and outperform classical GARCH-type methods significantly. More importantly, NoVaS-type methods can provide concrete statistical inference. Thus, it will be interesting to challenge NoVaS-type methods' forecasting accuracy with machinelearning-based methods.

**Author Contributions:** Data curation, K.W. and S.K.; Formal analysis, S.K.; Investigation, K.W. and S.K.; Methodology, S.K.; Software, K.W. and S.K.; Visualization, K.W.; Writing—original draft, K.W.; Writing—review and editing, S.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** The second author's research is partially funded by NSF-DMS 2124222.

**Data Availability Statement:** We collected all data presented here from www.investing.com (accessed on 3 November 2021) manually. Then, we transformed the closing price data into financial log-returns based on Equation (13).

**Acknowledgments:** We are thankful to the two anonymous referees who helped us to improve the paper significantly. The first author is thankful to Professor Politis for the introduction to the topic and useful discussions.

**Conflicts of Interest:** The authors declare no conflicts of interest.

### **Appendix A. Additional Simulation Study and Data Analysis Results**

*Appendix A.1. Additional Simulation Study: Model Misspecification*

In the real world, it is difficult to convincingly state whether the data obey one particular type of GARCH model, so we wish to provide four more GARCH-type models to simulate one-year datasets to see if our methods are satisfactory regardless of the underlying distribution and GARCH-type model. The simulation study results are presented in

Table A1, which implies that the NoVaS-type methods are more robust against model misspecification and GE-NoVaS-without-*a*˜0 is the best method.

**Model 5:** Another time-varying GARCH(1,1) with Gaussian errors *Xt* = *σ<sup>t</sup> <sup>t</sup>*, *σ*<sup>2</sup> *<sup>t</sup>* = *ω*0,*<sup>t</sup>* + *β*1,*tσ*<sup>2</sup> *<sup>t</sup>*−<sup>1</sup> <sup>+</sup> *<sup>α</sup>*1,*tX*<sup>2</sup> *<sup>t</sup>*−1, { *<sup>t</sup>*} ∼ *i*.*i*.*d*. *N*(0, 1) *gt* = *<sup>t</sup>*/*n*; *<sup>ω</sup>*0,*<sup>t</sup>* = −4*sin*(0.5*πgt*) + 5; *<sup>α</sup>*1,*<sup>t</sup>* = −1(*gt* − 0.3)<sup>2</sup> + 0.5; *<sup>β</sup>*1,*<sup>t</sup>* = 0.2*sin*(0.5*πgt*) + 0.2, *n* = 250 **Model 6:** Exponential GARCH(1,1) with Gaussian errors *Xt* = *σ<sup>t</sup> <sup>t</sup>*, log(*σ*<sup>2</sup> *<sup>t</sup>* ) = 0.00001 + 0.8895 log(*σ*<sup>2</sup> *<sup>t</sup>*−1) + 0.1 *<sup>t</sup>*−<sup>1</sup> + 0.3(| *<sup>t</sup>*−1| − *E*| *<sup>t</sup>*−1|), { *<sup>t</sup>*} ∼ *i*.*i*.*d*. *N*(0, 1) **Model 7:** GJR-GARCH(1,1) with Gaussian errors *Xt* = *σ<sup>t</sup> <sup>t</sup>*, *σ*<sup>2</sup> *<sup>t</sup>* = 0.00001 + 0.5*σ*<sup>2</sup> *<sup>t</sup>*−<sup>1</sup> <sup>+</sup> 0.5*X*<sup>2</sup> *<sup>t</sup>*−<sup>1</sup> <sup>−</sup> 0.5*It*−1*X*<sup>2</sup> *<sup>t</sup>*−1, { *<sup>t</sup>*} ∼ *i*.*i*.*d*. *N*(0, 1) *It* = 1 if *Xt* ≤ 0; *It* = 0 otherwise **Model 8:** Another GJR-GARCH(1,1) with Gaussian errors *Xt* = *σ<sup>t</sup> <sup>t</sup>*, *σ*<sup>2</sup> *<sup>t</sup>* = 0.00001 + 0.73*σ*<sup>2</sup> *<sup>t</sup>*−<sup>1</sup> <sup>+</sup> 0.1*X*<sup>2</sup> *<sup>t</sup>*−<sup>1</sup> <sup>+</sup> 0.3*It*−1*X*<sup>2</sup> *<sup>t</sup>*−1, { *<sup>t</sup>*} ∼ *i*.*i*.*d*. *N*(0, 1) *It* = 1 if *Xt* ≤ 0; *It* = 0 otherwise

**Table A1.** Comparisons of different methods' forecasting performance on simulated 1-year data.


*Appendix A.2. Additional Data Analysis: 1-Year Datasets*

To make our data analysis more comprehensive, we present more results of predictions on 1-year real-world datasets in Table A2. One interesting finding is that the GE-NoVaS method is significantly overcome by using the GARCH(1,1) model for some cases, such as the BAC, TSLA and Smallcap datasets. The GE-NoVaS-without-*a*˜0 method still maintains great forecasting performance.

**Table A2.** Comparisons of different methods' forecasting performance on real-world 1-year data.



**Table A2.** *Cont.*

### *Appendix A.3. Additional Data Analysis: Volatile 1-Year Datasets*

Similarly, we consider more volatile 1-year datasets. All prediction results are tabulated in Table A3. It is clear that both NoVaS-type methods still outperform the GARCH(1,1) model for short- and long-term time-aggregated forecasting. Although the GE-NoVaS method yields optimal performance in some cases, we should note that the GE-NoVaSwithout-*a*˜0 method still gives almost the same but slightly worse results. Interestingly, the GE-NoVaS-without-*a*˜0 method can introduce a significant improvement compared with the GE-NoVaS method for 30-step-ahead predictions. This again hints towards the superior robustness of our new method specifically for long-term aggregated predictions.

**Table A3.** Comparisons of different methods' forecasting performance on volatile 1-year data.


### **Appendix B. Stationarity Test Results of Some Real-World Datasets**

**Table A4.** *p*-values of three stationarity tests.


*Note:* The null hypothesis of the ADF and PP tests is that the tested series is non-stationary. Therefore, if the ADF and PP tests are rejected, it means that this tested series is stationary. On the other hand, the null hypothesis of KPSS is that the series is stationary.

### **References**


### *Article* **Bootstrapped Holt Method with Autoregressive Coefficients Based on Harmony Search Algorithm**

**Eren Bas 1, Erol Egrioglu 1,\* and Ufuk Yolcu <sup>2</sup>**


**Abstract:** Exponential smoothing methods are one of the classical time series forecasting methods. It is well known that exponential smoothing methods are powerful forecasting methods. In these methods, exponential smoothing parameters are fixed on time, and they should be estimated with efficient optimization algorithms. According to the time series component, a suitable exponential smoothing method should be preferred. The Holt method can produce successful forecasting results for time series that have a trend. In this study, the Holt method is modified by using time-varying smoothing parameters instead of fixed on time. Smoothing parameters are obtained for each observation from first-order autoregressive models. The parameters of the autoregressive models are estimated by using a harmony search algorithm, and the forecasts are obtained with a subsampling bootstrap approach. The main contribution of the paper is to consider the time-varying smoothing parameters with autoregressive equations and use the bootstrap method in an exponential smoothing method. The real-world time series are used to show the forecasting performance of the proposed method.

**Keywords:** Holt method; subsampling bootstrapped; harmony search algorithm; forecasting

### **1. Introduction**

Exponential smoothing methods were published in the late 1950s [1–3], and they are known as some of the most successful forecasting methods in the literature. There are many exponential smoothing methods in the literature, such as the single exponential smoothing method, Holt method, Holt-Winters method, etc. Each exponential smoothing method is used in different situations. If data has no trend and no seasonality, a simple exponential smoothing method is used for forecasting. If data has a linear trend and no seasonality, the Holt method is used for forecasting. If data has both trend and seasonality, the Holt-Winters method is used for forecasting. In the coming years, the damped trend model was proposed by [4] if data has an over-trend. The reason why exponential smoothing methods are popular in the literature is that the forecasting success of exponential smoothing methods is superior to complicated approaches such as [5–7]. In addition to these methods, [8] proposed a simple modification of the exponential smoothing method named the ATA method, which is an effective and simple method to use compared with complex approaches in recent years.

Moreover, ref. [9,10] developed state-of-the-art guidelines for the application of the exponential smoothing methodology. Ref. [11] proposed a uniformly-sampled-autoregressivemoving-average model for a second-order linear stochastic system. Ref. [12] introduced the optimal procedure of the Boolean Kalman filter over a finite horizon. Ref. [13] presented a general benchmarking framework applicable to computational intelligence algorithms for solving forecasting problems. Ref. [14] proposed a new enhanced optimization model based on the bagged echo state network and improved by a differential evolution algorithm

**Citation:** Bas, E.; Egrioglu, E.; Yolcu, U. Bootstrapped Holt Method with Autoregressive Coefficients Based on Harmony Search Algorithm. *Forecasting* **2021**, *3*, 839–849. https:// doi.org/10.3390/forecast3040050

Academic Editor: Kuo-Ping Lin

Received: 20 September 2021 Accepted: 3 November 2021 Published: 4 November 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

to estimate energy consumption. Ref. [15] introduced a two-stage Bayesian optimization framework for scalable and efficient inference in state-space models.

The method proposed by [2] is one of the effective exponential smoothing methods for forecasting data with trend. The Holt method has a forecasting equation and two smoothing equations, which are for the level of the series and slope of the trend as given in Equations (1)–(3).

> *x*ˆ*n*+<sup>1</sup> = ˆ *ln* + ˆ *bn* (1)

$$
\hat{I}\_n = \lambda\_1 \mathbf{x}\_n + (1 - \lambda\_1)\mathbf{x}\_n \tag{2}
$$

$$
\hat{\sigma}\_n = \lambda\_2 \left( \hat{l}\_n - \hat{l}\_{n-1} \right) + (1 - \lambda\_2) \hat{\theta}\_{n-1} \tag{3}
$$

In Equations (1)–(3), *λ*<sup>1</sup> and *λ*<sup>2</sup> are the smoothing parameters of mean level and slope, respectively, and these parameters get values between zero and one. In these equations, the initial values are obtained by applying simple linear regression to the series. In addition, in these equations, trend and level update formulas are only based on a lag.

In this study, the Holt method is modified by using time-varying smoothing parameters instead of fixed on time, and the smoothing parameters of mean level and slope are obtained for each observation with first-order autoregressive models. The parameters of the autoregressive models are estimated by using the harmony search algorithm (HSA). With these contributions, the proposed method eliminates the initial parameter determination problem. Moreover, the forecasts for the proposed method are obtained from sampling distributions of forecasts.

The proposed method is applied to Istanbul Stock Exchange data sets between the years 2000 and 2017 with different test lengths. The obtained results are compared with many methods in the literature. The brief information for HSA is given in Section 2. The proposed method is introduced, and the implementation results are given in Sections 3 and 4 respectively. The final section is for conclusion and discussion.

### **2. Harmony Search Algorithm**

HSA algorithm was proposed by [16]. HSA is a heuristic algorithm that simulates the notes of musicians. The principle of HSA is that the musicians in an orchestra play the best melody harmonically with the notes they play. Just as a chromosome in the genetic algorithm or a particle in particle swarm optimization represents a solution, a harmony in a harmony memory represents a solution in the harmony search algorithm. In HSA, each musician has a decision variable and each note in the memory of each musician corresponds to a different solution of that decision variable. Each harmony consists of different notes and each note corresponds to the decision variable. HSA aims to investigate whether the obtained solution vector is better than the worst solution in memory. The HSA is given below in steps in Algorithm 1.

**Algorithm 1** The algorithm of HSA

Step 1. Determination of parameters to be used in HSA:

• XHM: Harmony memory;


#### **Algorithm 1** Cont.

Step 2. Creating of the harmony memory.

HM for HSA is generated as in Equation (4).


Here, *xij*, *i* = 1, 2, . . . *HMS* ; *j* = 1, 2, ··· , *n* is expressed as a note value and is generated randomly.

In HSA, each solution vector is denoted by *x i* , *i* = 1, 2, ··· , *HMS*. In HSA, there are HMS solution vectors. The representation of the first solution vector is given in Equation (5).

$$\mathbf{x}'\_1 = [\mathbf{x}\_{11}, \mathbf{x}\_{12}, \dots, \mathbf{x}\_{1n}] \tag{5}$$

Step 3. Calculation of objective function values.

The objective function values are calculated for each solution vector generated randomly as given in Equation (6).

$$
\begin{bmatrix}
\begin{array}{cccc}
\boldsymbol{\chi}\_{11} & \boldsymbol{\chi}\_{12} & \cdots & \boldsymbol{\chi}\_{1n} \\
\boldsymbol{\chi}\_{21} & \boldsymbol{\chi}\_{22} & \cdots & \boldsymbol{\chi}\_{2n} \\
\vdots & \vdots & \vdots & \vdots \\
\boldsymbol{\chi}\_{HMS1} & \boldsymbol{\chi}\_{HMS2} & \boldsymbol{\chi}\_{HMS3} & \boldsymbol{\chi}\_{HMSn}
\end{bmatrix} = \begin{bmatrix}
\boldsymbol{\chi}\_{1}^{\prime} \\
\boldsymbol{\chi}\_{2}^{\prime} \\
\boldsymbol{\chi}\_{HMS}^{\prime}
\end{bmatrix} = \begin{bmatrix}
f\begin{pmatrix}\boldsymbol{\chi}\_{1}^{\prime} \\
f\begin{pmatrix}\boldsymbol{\chi}\_{2}^{\prime}\end{pmatrix} \\
\vdots \\
f\begin{pmatrix}\boldsymbol{\chi}\_{1}^{\prime}\end{pmatrix}
\end{bmatrix} \\
\end{bmatrix} \tag{6}
$$

Step 4. Improvement of a new harmony.

While the probability of *HMCR* with a value between 0 and 1 is to select a value from the existing values in the HM, (1-HMCR) value is the ratio of a random value selected from the possible value ranges. The new harmony is obtained with the help of Equation (7).

$$\mathbf{x}\_{i|inuc} = \begin{cases} \mathbf{x}\_{i|inuc} \in \left\{ \mathbf{x}\_{i|}; \mathbf{i} = 1, 2, \dots, \text{HMS} \right\} & \text{ifrnd} < \text{HMCR} \\ \mathbf{x}\_{i|inuc} \in \left\{ \left[ \min(\mathbf{x}\_{i|}), \max(\mathbf{x}\_{i|}) \right]; \ i = 1, 2, \dots, \text{HMS} \right\} & \text{otherwise} \end{cases} \tag{7}$$

It is decided by the *PAR* parameter whether the toning process can be applied to each selected decision variable with the possibility of *HMCR* or not as given in Equation (8).

$$\chi\_{l/uvcpt;ult} = \begin{cases} \begin{array}{c} \text{Yes} \\ \text{No} \end{array} \begin{array}{c} rmd < PAR \\ \text{otherwise} \end{array} \tag{8}$$

In Equation (8), *rnd* is generated randomly between *U*(0, 1). If this random number is smaller than the *PAR* value, this value is changed to the closest value to it. If the tonalization will be made for each *xijnew* decision variable and the value of *xijnew* is assumed to be the *k*th value within the vector of the value variable, the new value of *xijnew*(*k*) is *xij* ← *xij*(*k* + *m*), and *m* ∈ {··· , −2, −1, 1, 2, ···} is the neighboring index.

Step 5. Updating the harmony memory.

If the new harmony vector is better than the worst vector in the *HM*, the worst vector is removed from the memory, and the new harmony vector is included in the *HM* instead of the removed vector.

Step 6. Stop condition check.

Steps 4–6 are repeated until the termination criteria are met. Possible values for HMCR and PAR in literature are between 0.7–0.95 and 0.05–0.7, respectively [17].

### **3. Proposed Method**

Although the Holt method is used as an efficient forecasting method, it has many problems that are obvious and need to be resolved. The first of these problems is the determination of initial trend and level values. The second problem of the Holt method is that the trend and level update formulas are only based on a lag. To avoid these problems and increase the forecasting performance of the Holt method, the advantages and innovations of the proposed method are given step by step as below:


The algorithm of the proposed method is also given in Algorithm 2.

**Algorithm 2** The algorithm of the proposed method

Step 1. Determine the parameters of the training process:


Step 2. Select bootstrap samples from the training set randomly.

Steps from 2.1. to 2.2 are repeated *nbst* times. *x*∗ *<sup>t</sup>*,*<sup>j</sup>* presents *j*th bootstrap time series. Step 2.1. Select a starting point of the block (*spb*) as an integer from a discrete uniform distribution with parameters [1, *ntrain*-bss+1].

Step 2.2. Create bootstrap time series as given in Equation (9).

$$\mathbf{x}\_{1,j}^{\*} = \left\{ \mathbf{x}\_{spb}, \mathbf{x}\_{spb+1}, \dots, \mathbf{x}\_{spb+b\infty-1} \right\}, \qquad j = 1, 2, \dots, n \text{bst} \tag{9}$$

Step 3. Apply regression analysis to determine the initial bounds for level (*L*(0)) and trend (*B*(0)) parameters by using *x*∗ *<sup>t</sup>*,*<sup>j</sup>* bootstrap time series as the training set by using Equations (10)–(12).

$$X = \begin{bmatrix} 1 \ 1 \ \cdots \ 1 \end{bmatrix} \begin{bmatrix} 2 \ \cdots \ b \ \text{ss} \end{bmatrix}'\_{b \: ss \ast 2} \tag{10}$$

$$\mathbf{x}^\* = \mathbf{x}\_{t,\circ}^\* = \begin{bmatrix} \mathbf{x}\_{sph}, \mathbf{x}\_{sph+1}, \dots, \mathbf{x}\_{sph+bas-1} \end{bmatrix} \prime \tag{11}$$

$$\boldsymbol{\beta} = \begin{bmatrix} \hat{\beta}\_0 \\ \hat{\beta}\_1 \end{bmatrix} = (\mathbf{X}\boldsymbol{\ell}\mathbf{X})^{-1}\mathbf{X}\boldsymbol{\ell}\mathbf{Y} \tag{12}$$

(*L*(0) ∈ *β*ˆ 0/2, 2*β*ˆ 0 ) and trend (*B*(0) ∈ *β*ˆ 1/2, 2*β*ˆ 1 )

Step 4. HSA is used to obtain the optimal parameters of the Holt method with autoregressive coefficients for each bootstrap time series. Steps 4.1 and 4.4 are repeated for each bootstrap time series.

Step 4.1. Generate the initial positions of HSA. The positions of harmony are *L*(0), *B*(0), *λ*1(0), *λ*2(0), *φ*11, *φ*12, *φ*<sup>21</sup> *and φ*22.

*L*(0) and *B*(0) are generated from *U β*ˆ 0/2, 2*β*ˆ 0 *and U β*ˆ 1/2, 2*β*ˆ 1 , respectively. *λ*1(0), *λ*2(0), *φ*<sup>11</sup> and *φ*<sup>21</sup> are generated from *U*(0, 1). *φ*<sup>12</sup> and *φ*<sup>22</sup> are generated from *U*(−1, 1). The creation of the harmony memory for the proposed method is given in Equation (13), and the parameters that correspond to *k*th harmony are given in Table 2.

$$HM = \begin{bmatrix} \begin{array}{ccccc} \chi\_1^1 & \chi\_2^1 & \chi\_3^1 & \cdots & \chi\_8^1 \\ \chi\_1^2 & \chi\_2^2 & \chi\_3^2 & \cdots & \chi\_8^2 \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ \chi\_1^{HMS} & \chi\_2^{HMS} & \chi\_3^{HMS} & \cdots & \chi\_8^{HMS} \end{array} \end{bmatrix} \tag{13}$$

#### **Algorithm 2** Cont.

Step 4.2. According to the initial positions of each harmony, fitness functions are calculated. The root of mean square error (RMSE) is preferred to use as a fitness function and is calculated as given in Equation (14).

$$f\_l = RMSE\_l = \sqrt{\frac{1}{b \text{ss}} \sum\_{l=1}^{b \text{ss}} \left( \mathbf{x}\_{t\_{\gamma'}}^{\*} - \mathbf{x}\_{t\_{\gamma'}}^{\*} \right)^2}, \ i = 1, 2, \dots, HMS \tag{14}$$

In Equation (14), *x*ˆ [?] *<sup>t</sup>*,*<sup>j</sup>* is the output for *j* th bootstrap time series data and *k*th harmony. *x*ˆ<sup>∗</sup> *<sup>t</sup>*,*<sup>j</sup>* is obtained by using Equations (15)–(19).

$$
\lambda\_1(t) = \phi\_{11} + \phi\_{12}\lambda\_1(t-1) \tag{15}
$$

$$
\lambda\_2(t) = \phi\_{21} + \phi\_{22}\lambda\_2(t-1) \tag{16}
$$

$$L(t) = \lambda\_1(t)\mathbf{x}\_{t\_\vartheta}^\* + (1 - \lambda\_1(t))(L(t-1) + B(t-1))\tag{17}$$

$$B(t) = \lambda\_2(t)(L(t) - L(t-1)) + (1 - \lambda\_2(t))(B(t-1))\tag{18}$$

$$\pounds\_{t+1,j}^\* = L(t) + B(t) \tag{19}$$

Obtain RMSE values for each harmony, and save the best harmony which has the smallest RMSE.

Step 4.3. Improve new harmony.

*HMCR* shows the probability that the value of a decision variable is selected from the current harmony memory. (1-*HMCR*) represents the random selection of the new decision variable from the existing solution space. *x <sup>i</sup>* shows the new harmony, obtained as in Equation (20).

$$\mathbf{x}'\_{l} = \begin{cases} \mathbf{x}'\_{l} \in \{ \mathbf{x}^{1}\_{l}, \mathbf{x}^{2}\_{l}, \dots, \mathbf{x}^{MMS}\_{l} \} & \text{if } \text{rand} < HM \text{CR} \\ \mathbf{x}'\_{l} \in \mathbf{X}\_{\text{.}} & \text{otherwise} \end{cases} \tag{20}$$

After this step, each decision variable is evaluated to determine whether a tonal adjustment is necessary. This is determined by the PAR parameter, which is the tone adjustment ratio. The new harmony vector is produced according to the randomly selected tones in the memory of harmony as given in Equation (21). Whether the variables are selected from the harmonic memory is determined by the HMCR ratio, which is between 0 and 1.

$$\mathbf{x}'\_{l} = \begin{cases} \mathbf{x}'\_{l} + rnd(0,1) \* bw & if \text{ rnd} < \text{PAR} \\ \mathbf{x}'\_{l} & otherwise \end{cases} \tag{21}$$

*bw* is a bandwidth selected randomly; *rnd* (0; 1) represents a random number generated between 0 and 1.

Step 4.4. Harmony memory update.

In this step, the comparison between the newly created harmonies and the worst harmonies in the memory is made in terms of the values of the objective functions. If the newly created harmony vector is better than the worst harmony, the worst harmony vector is removed from the memory, and the new harmony vector is substituted for it.

Calculate RMSE values for *j th* bootstrap time series data and *k*th harmony. Find the best harmony which has the minimum RMSE value for *j* th bootstrap time series data. Step 5. Calculate the forecasts for test data by using the best harmony for each bootstrap sample

and their statistics. The obtained forecasts from the updated Equations for *j* th bootstrap time series at *t* time is represented by *F<sup>i</sup> <sup>t</sup>* . Forecasts and their statistics are calculated just as in Table 1. In addition, the flowchart of the proposed method is given in Figure 1.


**Table 1.** Forecasts for bootstrap samples.

**Figure 1.** The flowchart of the proposed method.

**Table 2.** The parameters corresponding to *k*th harmony.


### **4. Applications**

To evaluate the performance of the proposed method, the proposed method is applied to the Istanbul Stock Exchange (BIST) data sets observed daily between the years 2000 and 2017 with different test lengths as 10 and 20. To evaluate the performance of the proposed method, the proposed method is compared with the ATA method proposed by [8], Holt method, fuzzy regression functions approach (FF) proposed by [18], random walk (RW), multilayer perceptron artificial neural networks (MLP-ANN) and adaptive neural-fuzzy inference systems (ANFIS) method proposed by [19]. For a fair comparison of the methods, we used both statistical and computational intelligence forecasting methods. While the random walk was used as a simple forecasting method, the Holt and ATA methods were used as statistical forecasting methods. Moreover, MLP-ANN, ANFIS, and FF methods were used as computational intelligence forecasting methods. In the analysis process, the number of bootstrap samples and the bootstrap sample size is given as 100 for each data set. The RMSE and MAPE criteria were used for the comparison of the methods. The mean absolute percentage error (MAPE) is one of the most widely used measures of forecast accuracy, due to its advantages of scale-independency and interpretability [20]. The use of RMSE is very common, and it is considered an excellent general-purpose error metric for numerical predictions [21]. Table 3 gives the all-analysis results for each data set for the RMSE criterion when the length of the test set is 10.


**Table 3.** All analysis results for each data set for RMSE criterion when the length of the test set is 10.

In Table 3, the proposed method has 59% success compared with the other methods in terms of the RMSE criterion when the test set is 10. To see the actual comparison results of the proposed method with other methods, we compare the rank values of each method and obtain the average rank values. For this purpose, we rank each method according to their success status for each time series analyzed. In such a ranking, the method with the lowest RMSE value will be named as the best method, and the rank value of it will be taken as 1. For this purpose, all methods were calculated according to rank order considering the RMSE criterion when the length of the test set is 10, and average rank values were obtained as in Figure 2.

**Figure 2.** The average rank values of each method for RMSE criterion when the length of the test set is 10.

From Figure 2, it is seen that the proposed method has a minimum average rank value compared with other methods, and the proposed method is the best method for RMSE criterion when the length of the test set is 10. In addition, Table 4 gives the all-analysis results for each data set for the MAPE criterion given in Equation (22) when the length of the test set is 10.

$$MAPE = \frac{1}{n \text{test}} \sum\_{t=1}^{n \text{test}} \left| \frac{X\_t - \hat{X}\_t}{X\_t} \right| \tag{22}$$

**Table 4.** All-analysis results for each data set for MAPE criterion when the length of the test set is 10.


In Table 4, the proposed method has 39% success compared with the other methods in terms of the MAPE criterion when the test set is 10. Looking at the rank evaluation results for the MAPE criterion when the test set length is 10 given in Figure 3, it is seen that the proposed method is in third place among all methods.

**Figure 3.** The average rank values of each method for MAPE criterion when the length of the test set is 10.

Table 5 also gives the all-analysis results for each data set for the RMSE criterion when the length of the test set is 20. In Table 5, the proposed method has a 61% success rate. Considering the situations where the proposed method is not the best, it stands out as the second-best method in many time-series analyses. Moreover, the rank evaluation results for all methods for the RMSE criterion when the length of the test set is 20 are given in

Figure 4. In addition, Table 6 gives the all-analysis results for each data set for the MAPE criterion when the length of the test set is 20.


**Table 5.** All-analysis results for each data set for RMSE criterion when the length of the test set is 20.

**Figure 4.** The average rank values of each method for RMSE criterion when the length of the test set is 20.

When the analysis results given in Table 6 are examined, even in the analyses in which the proposed method is not the best method, the proposed method often appears to be either the second-best or third-best method. We examine rank values to verify and highlight these results given in Figure 5.

Considering the average rank obtained from all methods, it can be said that the proposed method for the MAPE criterion has more successful results than other methods. As a final comment, when all analysis results are examined, it can be said from both average rank results and analysis results that the proposed method is a more successful method than other methods used in the comparison.


**Table 6.** All-analysis results for each data set for MAPE criterion when the length of the test set is 20.

**Figure 5.** The average rank values of each method for MAPE criterion when the length of the test set is 20.

### **5. Conclusions and Discussion**

Although the Holt method is used as a traditional time series forecasting method, it is known that it has some problems, such as the determination of the initial trend and level values and determining the trend and level update formulas. In this study, to overcome these problems, the parameters of the Holt method are optimized by using HSA, the smoothing parameters are varied by using first-order autoregressive equations, and the forecasting performance is improved by using the subsample bootstrap method.

When comparing the classical Holt method and the proposed method, it is clear that time-varying smoothing parameters and HSA provide important improvements in the forecasting results. The proposed method produces smaller RMSE values than the classical Holt method by about 70% in all analyses. If we compare the computation time of the proposed method with the classical Holt method, the proposed method needs more computation time because of using bootstrap and HSA algorithms, as expected. However, the computation time of the proposed method is very close to computational intelligence forecasting methods, and the computation time is not a problem for today's personal computers. For the BIST series, the computation time is about three minutes.

In future studies, different artificial intelligence optimization techniques can be used to determine the optimal parameters of the Holt method, or the forecasts can be obtained by different bootstrap methods.

**Author Contributions:** Conceptualisation, E.E. and E.B.; methodology, E.E., U.Y. and E.B.; software, E.E. and E.B.; validation, E.E., U.Y. and E.B.; formal analysis, E.B.; investigation, E.E., U.Y. and E.B.; writing—original draft preparation, E.E. and E.B. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Data Availability Statement:** The data set is available at https://datastore.borsaistanbul.com/. The access date is 1 November 2020.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**


### *Article* **A Real-Time Data Analysis Platform for Short-Term Water Consumption Forecasting with Machine Learning**

**Aida Boudhaouia and Patrice Wira \***

IRIMAS Laboratory, University of Haute Alsace, 61 rue Albert Camus, 68093 Mulhouse, France; aida.miled@uha.fr

**\*** Correspondence: patrice.wira@uha.fr

**Abstract:** This article presents a real-time data analysis platform to forecast water consumption with Machine-Learning (ML) techniques. The strategy fully relies on a web-oriented architecture to ensure better management and optimized monitoring of water consumption. This monitoring is carried out through a communicating system for collecting data in the form of unevenly spaced time series. The platform is completed by learning capabilities to analyze and forecast water consumption. The analysis consists of checking the data integrity and inconsistency, in looking for missing data, and in detecting abnormal consumption. Forecasting is based on the Long Short-Term Memory (LSTM) and the Back-Propagation Neural Network (BPNN). After evaluation, results show that the ML approaches can predict water consumption without having prior knowledge about the data and the users. The LSTM approach, by being able to grab the long-term dependencies between time steps of water consumption, allows the prediction of the amount of consumed water in the next hour with an error of some liters and the instants of the 5 next consumed liters in some milliseconds.

**Keywords:** load curve; unevenly spaced time series; long short-term memory (LSTM); backpropagation neural network (BPNN); machine learning; water consumption

### **1. Introduction**

Water consumption analysis is crucial as it assists building managers and operators to adopt better strategies to plan usages [1]. Forecasting is an important part for continuous monitoring and efficient management of consumption [2]. Furthermore, an accurate forecasting of consumption is essential for efficiently detecting and avoiding water leakages and wastes in distribution networks and installations [3]. Various methods to predict near-realtime water consumption and demand have been investigated. A complete literature review has been proposed in [4]. Among then, statistical methods, filtering and signal processing techniques, fuzzy logic, intelligent techniques and combinations of several models have shown more or less success. More recently, innovative models such as Machine-Learning (ML) techniques showed superior results when compared with classical models. Specifically, deep neural networks have emerged as efficient forecasting approaches. Regardless of the method, the robustness of the forecasting performance mainly depends on not only on the past water demand data but on contextual and environmental information (weather conditions, well-identified user profiles, knowledge about the architecture of the water distribution system, etc.), on redundancy of measurements and on the short, medium and long-term planning decisions to be addressed. Water demand forecasting remains a major research problem when no information is available behind the consumption of a single water meter.

This article presents a real-time data analysis platform to forecast water consumption with ML techniques only based on past water consumption, i.e., with no prior and contextual information. The strategy fully relies on a web-oriented architecture to ensure better management and optimized monitoring of water consumption [5]. It is a complete

**Citation:** Boudhaouia, A.; Wira, P. A Real-Time Data Analysis Platform for Short-Term Water Consumption Forecasting with Machine Learning. *Forecasting* **2021**, *3*, 682–694. https:// doi.org/10.3390/forecast3040042

Academic Editor: Sonia Leva

Received: 3 August 2021 Accepted: 22 September 2021 Published: 26 September 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Advanced Metering Infrastructure (AMI) based on integrated Internet of Things (IoT) technologies [6] that offers the possibility of collecting, analyzing and monitoring daily water consumption [7]. To predict water consumption, we also propose a framework based on ML algorithms such as the Long Short-Term Memory (LSTM) [8] and the Back-Propagation Neural Network (BPNN). The water consumption data are stored as unevenly spaced time series constructed from the collected data issued from distributed smart meters. Then, time series are handled in two different ways, with an explicitly and an implicitly sampling [9]. With explicitly sampled time series, the ML approaches predict the quantity of water consumed in the next coming hours [10]. With implicitly sampled time series, the ML approaches predict the instants when the next liters will be consumed. Both cases are achieved using the LSTM [8,11] and the BPNN [12]. The accuracy and usability of the forecast are evaluated and compared. This study can be generalized for any other type of consumption such as electricity and gas for example.

The rest of this article is organized as follows: Section 2 briefly presents appropriate ML approaches for analyzing consumption data with different forecasting horizons. Section 3 details the architecture of the AMI for collecting data. Consumption data are presented in terms of water volumes, indexes and dates of events. In other words, these data are considered to be unevenly spaced time series or Load Curves (LC). A preprocessing strategy is also developed in this section to handle and to compensate for missing and abnormal water consumptions. The two ML strategy, for forecasting the number of consumed liters in the next hour and the instants of the future consumed liters, are presented in Section 4. This section also includes some experimental tests and evaluations. Finally, concluding remarks are provided in Section 5.

### **2. Machine-Learning Algorithms for Water Consumption Forecasting**

### *2.1. Forecasting with Machine-Learning Algorithms*

Short-term forecasts, whether in water [2,13], in electricity [14,15] or even in gas [16], have been reported in the literature with a variety of approaches and with different horizons. However, very few of them have treated individual customers in domestic buildings [8] with high resolution. In fact, the approach proposed in [17] is based on a model of nonhomogeneous Markov chains allowing knowledge of the dynamics of water consumption. This model can predict behaviors of daily consumption based on other parameters such as exogenous factors represented by the climate [18], the day type, etc. Another study [19] deals with the water demand forecasting on weekly and hourly scales with an autoregressive model based on a periodic component on time series data to refine daily demand values and hours. This prediction uses a multitude of period models. Most of these studies focus on forecasting consumption by introducing other parameters using different predictive models depending on the nature of the input data and the sought objectives. Indeed, we note that the provided forecast horizon mainly depends on the input databases of the models. These database generally have annual, seasonal, monthly, weekly, daily or hourly resolutions. Most of the work, even based on intelligent techniques, are based on additional information. For example, the study in [20] uses support vector machines with monthly water demands, number of users, and total water consumption bills. Ref. [21] discusses residential water demand management based on pricing, restriction policies, climate, weather and demographic characteristics. For now, there is no study based on learning architectures such as direct or recurrent BPNN, Hopfield networks or LSTM to predict the water demand based on historical data from only one single measurement point.

On the other side, we propose more precise forecasts with data issued from smart meters with high resolution and no additional contextual information. In this paper, we focus on forecasting water consumption from a private building without any knowledge about appliances using water and the number of inhabitants.

### *2.2. Forecasting Framework Based on LSTM*

The LSTM [8] is a special type of recurrent neural network [8]. It is a sequential learning model which can establish temporal correlations between a previous instant *t* − *1* and a current instant *t*. Consequently, the LSTM seems the most suitable model for forecasting consumption processes, given its ability to deduce the intrinsic daily consumption resident routines. The LSTM is based on the Back-Propagation Through Time (BPTT) learning algorithm [8] to calculate the weights. It is made up of units called memory blocks. Each memory block contains an "input gate", an "output gate" and a "forget gate", as shown in Figure 1.

**Figure 1.** The LSTM unit architecture.

The behavior of each gate is represented by an equation. The input gate *i*(*t*) given in (1) consists of transmitting the output *h* at the previous instant *t* − 1 and the input *x* at instant *t* through a sigmoid function *σ*(*x*) = <sup>1</sup> <sup>1</sup>+e−*<sup>x</sup>* :

$$i(t) = \sigma\left(\mathcal{W}\_{i}.[h(t-1), x(t)] + b\_{i}\right) \tag{1}$$

A hyperbolic tangent function is applied to the input and the output data from the previous step to create a vector of a new value *C*˜(*t*) to be an internal state. The update of the internal state is carried out through:

$$\tilde{\mathcal{C}}(t) = \tanh(\mathcal{W}\_{\mathfrak{c}}[h(t-1), \mathfrak{x}(t)] + b\_{\mathfrak{c}}) \tag{2}$$

The forget gate *f*(*t*) is calculated with another sigmoid function that takes for its input the output *h*(*t* − 1) and the input *x*(*t*):

$$f(t) = \sigma\left(\mathcal{W}\_f.[h(t-1), \mathbf{x}(t)] + b\_f\right) \tag{3}$$

Finally, the output gate *O*(*t*) described by (5) is based on the state *C*(*t*). This state is updated with a hyperbolic tangent multiplied with the output of a sigmoid:

$$\mathbb{C}(t) = f(t) \times \mathbb{C}(t-1) + i(t) \times \vec{\mathbb{C}}(t) \tag{4}$$

$$O(t) = \sigma(\mathcal{W}\_{\vartheta}.[h(t-1), \mathfrak{x}(t)] + b\_o) \tag{5}$$

*Wi*, *Wc*, *Wf* , *Wo*, and *bi*, *bc*, *bf* , *bo* represent respectively the weights and the biases at the different levels in the LSTM memory block. They are adjusted iteratively with the BPTT learning algorithm [8] until convergence. At each step of the learning process, the performance of the LSTM can be evaluated by an error such as the Root Mean Square Error (*RMSE*) [22] where *yi*, *y*˜*<sup>i</sup>* and *n* are respectively the reference, the estimated value and the number of data:

$$RMSE = \sqrt{\frac{1}{n} \sum\_{i=1}^{n} (y\_i - \bar{y}\_i)^2} \tag{6}$$

This learning approach will be used in the following to forecast short-term water consumption.

### **3. Proposed Architecture and ML Framework to Collect and Analyze Water Consumption Data**

### *3.1. Data Collecting with Smart Meters*

All the data used in this study are collected in an online database from smart water meters. Smart meters are IoT devices that are appropriate to build a sustainable and advanced consumption data system [23]. Most water distributors collect data from smart meters with a resolution of several minutes, for example every 15, 30 or 60 min, or even once a day [5]. This implies that the capacities of smart water meters are clearly not fully exploited [7]. This also means that the resolution of the consumption data is low. We use smart meters with the communication strategy proposed and developed in [24] to compress and to transmit the data with a very high resolution [7] according to industrial specifications. This strategy allows the dating on the server side of each liter consumed and reduces the energy consumption on the meter side. Indeed, emission duration that consumes a lot of energy for the smart meters have been greatly reduced. This strategy is embedded in the smart meters and transmits data in the form of frames with a *Tmax* interval which does not exceed 5 min. This interval is completely adaptive and related to the amount of consumed water [7]. Higher water consumption results in more data frames. To guarantee the reception of frames with no missing data, a sliding window is proposed which consists of *RE* = 6 packages. These packages are numbered and can be considered to be independent broadcasts in the transmitted frame. This ensures the redundancy of the data through successive frames. This principle is illustrated in Figure 2. The maximum length of a frame, *Lf* = *RE* × *lp* with *lp* the length of a package, is set depending on the radio technology and frequency that are used. In our AMI, we chose a maximum value of 120 bytes for *Lf* which is the limit of the frame size.

**Figure 2.** Operating principle of the sliding window for ensuring the redundancy of transmitted data from smart water meters through successive frames [7].

A web server receives all the transmitted frames from several smart meters. Here, a script receives, decompresses, and retrieves the data from the frames for storage in an SQL database [10]. This process runs continuously since 2014 and allows completion of the database in real time and under real operating conditions. The database contains raw data for each individual smart meter, i.e., the index which represents the volume of consumed water in liter and the instant when each liter has been consumed in millisecond. This instant is called a pulse or an event [7]. It is obvious that the data collected and stored according to this platform are of high resolution and therefore precisely represent consumption habits.

At any time, it is possible to extract from the database with another script, the data related to a well-defined smart meter by specifying the beginning and the end of a period. This is called a set of row data.

### *3.2. Data Description*

The collected data are of a great value and must be analyzed. For this, the raw data must interpreted and therefore associated with some theoretical concepts and models. Among them are unevenly spaced time series or Load Curves (LC).

### 3.2.1. Water Consumption Time Series

A time series is a sequence of temporal data [25]. The time stamp of the series can be explicit such that a date is given for each data value or controlled by the appearance of the data represented by events perfectly dated. This is referred to as an unevenly spaced time series [9] defined by *S* in (7). In the context of water consumption, an event corresponds to each consumed liter and *S* is thus a sequence of scalar values of an incremented variable *Yi*+<sup>1</sup> = *Yi* + 1. *S* therefore corresponds to the raw data extracted from the previously described platform for one smart meter and is the result of a process observed during a period *T*. The platform and AMI proposed by [7] offer the possibility of recording the instants of consumption of each liter.

$$S = \left[ \mathbf{Y}\_1(t\_1), \mathbf{Y}\_2(t\_2), \dots, \mathbf{Y}\_i(t\_i), \dots, \mathbf{Y}\_T(t\_T) \right] \tag{7}$$

3.2.2. Cumulated Water Consumption: The Index and the Load Curve

Each *Yi* represents the index of a smart meter which is the cumulated volume of consumed water at each instant *ti*. The time between two instants *ti* and *ti*−<sup>1</sup> is not constant. The evolution of *Yi* during a period *T* is called a cumulative LC. An example is provided by Figure 3, it is an alternative representation of *S*. LC are very useful for analyzing and

comparing consumption over days, weeks, months. We than speak of daily LC, weekly LC or monthly LC.

### 3.2.3. Sampled Water Consumption Data Series

The data collected from the platform is unevenly spaced in time. Each consumed liter represents an event, the process of water consumption can also be seen as a process generating dated event. To make the data compliant with most of the popular data analysis tools and concepts, a sampling is proposed to make the series evenly spaced in time. The sampling can be made in minutes or in hours and results in a sequence of 1440 data per day or 24 data per day.

**Figure 3.** Example of a cumulative load curve (LC) which shows the raw data by red dots unevenly spaced in time as recorded and transmitted by a smart sensor (the black curve is an interpolation) and with results from the sequence of events corresponding to each consumed liter.

We also chose to derivate the cumulative LC in order work with sequences of *n* data that represents the number of liters consumed in each minute or hour. Consequently, a natural order of appearance constitutes an implicitly sampled chronological time series such as:

$$C = \left[ y\_{1\prime}, y\_{2\prime}, y\_{2\prime}, \dots, y\_{i\prime}, \dots y\_n \right] \tag{8}$$

### *3.3. Data Integrity Checking and Interpolation*

Under real operating conditions, the integrity of the data must be checked. Indeed, failures or malfunctions can lead to missing raw measurements in the database. We therefore propose a preprocessing step of the raw data to verify the data and to complete by interpolation eventually missing data. The whole proposed preprocessing strategy is represented by Figure 4. The raw time series is extracted from the database for each day. Since a forecast of water consumption is targeted with an accuracy of one hour, the data are sampled with a resolution of minute (i.e., 1440 mn per day). This preprocessing is achieved separately for each day. Then, periods without consumed liters, i.e., events, are identified and corrected by interpolation.

Data analysis and forecasting with ML algorithms needs to be achieved with no missing or inconsistent values. It is thus necessary to identify and separate abnormal consumption (such as water leakage, occasional consumption) which can influence water consumption) from normal and usual consumption. Abnormal water consumption is always due to an unusual and occasional behavior from the users [25]. The detection of abnormal water consumption is achieved as follows. A reference cumulative LC is calculated for each day of the week. This reference LC is completed with a minimum LC and a maximum LC for each day. Generally, a load profile for one day *j* is strongly correlated [26] with that for the previous day *j* − 1 and to the day for the previous week (*d* − 7). The reference cumulative LC is calculated with:

$$\mathbf{C}\_{j}(t\_{i}) = \operatorname{avg}(\mathbf{C}\_{j-\mathcal{T}}(t\_{i}), \mathbf{C}\_{j-1}(t\_{i})) \tag{9}$$

**Figure 4.** Global architecture of the water consumption LC preprocessing.

The minimum and a maximum LC for each day are calculated by the same way by changing the average *avg*() function in (9) by *min*() and *max*() functions. The detection of normal consumption is based on the criteria given by:

$$\text{abss}\left[y\_j(t) - \text{avg}(\sum\_{i=1}^n (y\_j(t)))\right] \ge a \times std(\sum\_{i=1}^n (y\_j(t)))\tag{10}$$

where *std*() is standard deviation for each value of the LC and *α* is a numerical variable chosen empirically, in our case *α* = 5. Additional tests can be achieved to see if the instantaneous consumption is out of the range defined by the minimum and maximum LC for the same day of the week and allow the detection of any additional consumption that deviates significantly from the "normal consumption" [10]. It can be noticed that the detection of abnormal and unusual consumptions is only based on water consumption data and some statistical indicators [10]. Abnormal and unusual consumptions are corrected by an interpolation during their duration and will not be taken into account in the learning processes. At the end, we obtain a time series *C*¯ *<sup>j</sup>* sampled in minutes which corresponds to the LC *Cj* without loss of data and without abnormal and unusual consumptions.

### **4. Water Consumption Forecasting**

To evaluate the efficiency of the platform and the ML techniques, we focused on the water consumption of a private building. The water consumption is collected from a smart meter which is a single measurement point for the whole building. These are the only data available from the building and the users. The objective consists of forecasting the number of liters of consumed water with a horizon of one hour and to predict the instant of the next consumed liter by different ML approaches.

All the algorithms have been developed with the Matlab R2018b environment on a desktop computer with 4 cores (Intel i7 processors at 3.6 GHz) and 16 GB of memory. Experiments and tests have been carried out under the same conditions to find the values of the learning parameters by trial and error (learning rate value, number of neurons, number of hidden layers, type of activation function) to provide the smallest error.

### *4.1. Hourly Water Consumption Forecasting*

A three-month database (from October 2018 to December 2018) has been chosen to forecast the number of consumed water liters in the next coming hour. The data sequence is resampled with a resolution of one hour and is represented by Figure 5. This consumption has been recorded in a domestic house in France occupied by two people who consume on average 194 L per day (l/d). Household information will not be used by the ML approaches.

**Figure 5.** Water consumption time series: (**a**) LC from 1 October to 31 December 2018, (**b**) close-up view of the same time series for the first 24 h, (**c**) cumulative water LC over the whole period, (**d**) number of liters consumed per day.

Two ML approaches have been implemented for a one-hour water consumption forecasting, the LSTM and the BPNN. For this case, the series represented by Figure 5c is the input of the forecast approaches. With the LSTM, input *x*(*t*) in Equations (1), (2),

(4) and (5) is the preprocessed cumulative LC *Cj*. We use the Adam algorithm, i.e., an optimization stochastic gradient descent for training deep learning approaches [27] to handle the noisy data. Indeed, the Adam algorithm is suitable for data with a lot of noise. We chose a learning rate value of 10−<sup>4</sup> for the LSTM and 10−<sup>5</sup> for the BPNN model and the training ends with a maximum number of epochs chosen at 100.

The forecast performances with the two learning approaches are evaluated with the RMSE and results are presented in Table 1. It can be seen that the LSTM can forecast the water consumption in the next hour with a precision of 6 L while the BPNN predicts the future consumption with a precision of 24 L (the consumption range is approximately between 1 to 50 L per days).


**Table 1.** Hourly prediction of water consumption in liters with the LSTM and BPNN.

### *4.2. Forecasting Events of Water Consumption in Milliseconds*

We also forecast the coming events, i.e., the instants when the next liters will be consumed. For this purpose, we chose a dataset composed of 4321 events dated in milliseconds, each representing the time difference between two consecutive liters. Obviously, this dataset provides more detailed information about the water consumption than in the previous experiment. The dataset has been recorded between December the 2nd to the 20st, 2018 and is represented by Figure 6. The dataset is divided into three subsets for the learning of the LSTM and the BPNN, the training, validation and test subsets which are respectively distributed in a percentage of the dataset: 60%, 0.3% and 40%. The parameters of two learning approaches are summarized in Table 2. Their input vector *x*(*t*) is composed of the time difference between two successive consumed liters, i.e., the values of *δ<sup>i</sup>* represented by Figure 3. The Adam algorithm is also used her to optimize the learning of the LSTM and BPNN which use the same parameters as in the previous experiment optimization because the data are noisy. The learning rate is lr = 10−4. The training ends when the maximum number of epochs, 100 in our case, has been reached.

The results of two learning approaches are provided in Table 2. The instant of the next consumed liter of water is predicted respectively with an error (test RMSE) of 13 ms and 48 ms respectively with the LSTM and the BPNN. In addition, the forecast of the instant of the 5 next liters have also been calculated and are respectively estimated to occur at instants 450,925, 450,800, 451,200, 451,500 and 451,300 milliseconds. In other words, the next consumed liters have been correctly predicted on 21 December 21 (2018) at 00:08:07.487, 00:15:38.287, 00:23:09.487, 00:30:40,987 and at 00:38:12.287. The accuracy objective of the predicted instants is justified by industrial specifications.


**Table 2.** Event prediction of water consumption in ms with the LSTM and BPNN.

**Figure 6.** Time representation of the water consumption, (**a**) Time gap between 4321 events (i.e., consumed liters) from 02/12/2018 09:11:21.750 until 20/12/2018 23:23:40.625, (**b**) Cumulated duration *δ* as a function of consumed liters.

### *4.3. Discussion on the Hourly and Events Water Consumption Forecasting*

Two forecasting tests have been experimented with the proposed water consumption collecting platform, i.e., hourly and event forecasting. The first case consists of predicting the amount of water consumed during the first hour that follows the period of the collected

dataset. In the second case, the instant of the next consumed liters is predicted. In both cases, an LSTM and a BPNN architectures have been designed. Their performance has been evaluated under the same conditions and have been compared in terms of precision, computational resources and execution time. With very close resources and approximately the same execution time, the forecasting error obtained with LSTM is 3 times lower than with the BPNN. In both experiments, the LSTM is more appropriate than the BPNN to grab the temporality of the data sting tests have been experimented with the proposed water consumption collecting platform, i.e., hourly and event forecasting. This is because of its property of selectively remembering patterns in time series for long durations of time. Another reason is that the LSTM can better take into account the time-dependent structure of the data, i.e., the non-stationarity of the water data. The LSTM is therefore well suited to handle precise datasets over large periods of time such as water consumption.

### **5. Conclusions**

In this study, we presented a web-oriented platform to collect in real-time water consumption data and to predict them with machine-learning approaches. The data are issued from smart meters and are transmitted to a server to be handled as unevenly spaced time series with high resolution, i.e., in milliseconds. Data sets are then extracted, preprocessed and eventually sampled to be used by machine-learning algorithms to predict the next consumptions. The preprocessing of the data consists of detecting missing values and in identifying abnormal consumption using a reference load curve for each day of the week. Then, machine-learning approaches such as the LSTM and BPNN have been implemented to forecast the next consumption. Two tests have been experimented for hourly and event water consumption forecasting in a private building. The first case consists of predicting the amount of water consumed during the hour that follows the period of the collected data. In the second case, the instants of the next consumed liters are predicted. By evaluating the performance of the LSTM and BPNN, it can be seen that the LSTM is more accurate than the BPNN. Indeed, the LSTM can predict the amount of consumed water in the next coming hour with an error of less than 6 L and is able to predict the instants of the 5 next consumed liters with an error of less than 15 ms. This can be considered to be very accurate prediction in the context of water consumption measurement and forecasting. This web-oriented platform endowed by its learning capabilities is generic and can be extended to other additional smart meters to measure and predict other variables such as power or gas consumptions.

**Author Contributions:** Conceptualization, A.B.; Funding acquisition, P.W.; Methodology, A.B., P.W.; Supervision, P.W.; Validation, P.W.; Visualization, P.W.; Writing— original draft preparation, A.B.; and Writing—review and editing, P.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Data were obtained from our IUT in Mulhouse, France.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **Abbreviations**

The following abbreviations are used in this manuscript:


### **References**


### *Article* **Battery Sizing for Different Loads and RES Production Scenarios through Unsupervised Clustering Methods**

**Alfredo Nespoli 1, Andrea Matteri 1,***∗***, Silvia Pretto 2, Luca De Ciechi <sup>3</sup> and Emanuele Ogliari 1,***<sup>∗</sup>*

	- **\*** Correspondence: andrea.matteri@polimi.it (A.M.); emanuelegiovanni.ogliari@polimi.it (E.O.)

**Abstract:** The increasing penetration of Renewable Energy Sources (RESs) in the energy mix is determining an energy scenario characterized by decentralized power production. Between RESs power generation technologies, solar PhotoVoltaic (PV) systems constitute a very promising option, but their production is not programmable due to the intermittent nature of solar energy. The coupling between a PV facility and a Battery Energy Storage System (BESS) allows to achieve a greater flexibility in power generation. However, the design phase of a PV+BESS hybrid plant is challenging due to the large number of possible configurations. The present paper proposes a preliminary procedure aimed at predicting a family of batteries which is suitable to be coupled with a given PV plant configuration. The proposed procedure is applied to new hypothetical plants built to fulfill the energy requirements of a commercial and an industrial load. The energy produced by the PV system is estimated on the basis of a performance analysis carried out on similar real plants. The battery operations are established through two decision-tree-like structures regulating charge and discharge respectively. Finally, an unsupervised clustering is applied to all the possible PV+BESS configurations in order to identify the family of feasible solutions.

**Keywords:** battery energy storage system; battery sizing; photovoltaic power production; performance ratio; electrical load; decision tree; k-means clustering

### **1. Introduction**

The rising penetration of Renewable Energy Sources (RESs), together with the progressive digitization of grids, is leading to an energy scenario where power production is increasingly decentralized [1] and those who were once only energy consumers become producers themselves and are called "prosumers" [2,3].

Nowadays, RESs are widely connected to distribution grids thanks to the advantages they offer: clean energy and additional generation to address the ever increasing electricity demand [4]. Between RESs power generation technologies, solar PhotoVoltaic (PV) systems are a promising option offering a significant potential for providing energy in a sustainable way [5], directly generating it onsite [6]. However, solar energy is, by nature, intermittent and not programmable [7]. For this reason, energy storage systems, endowed of a proper management software, are needed [8].

Among all possible storage systems, the electrochemical ones represent an attractive option [9]. Electrochemical technologies store energy through specific chemical components. Being available in modules, the desired voltages and currents can be achieved by connecting single modules in series and/or in parallel [10]. Currently, a growing fraction of installed utility-scale PV systems incorporates Battery Energy Storage Systems (BESS) [11,12]. This allows to achieve a flexibility improvement in power generation by shifting production from the peak of non-programmable solar energy towards hours of large consumption [13,14].

**Citation:** Nespoli, A.; Matteri, A.; Pretto, S.; De Ciechi, L.; Ogliari, E. Battery Sizing for Different Loads and RES Production Scenarios through Unsupervised Clustering Methods. *Forecasting* **2021**, *3*, 663–681. https://doi.org/10.3390/ forecast3040041

Academic Editor: Cong Feng

Received: 6 August 2021 Accepted: 21 September 2021 Published: 24 September 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

When coupling a BESS with a PV power production system, a key design consideration is constituted by the selection between DC- and AC-coupling. AC-coupled systems have largely independent PV and batteries, each using its own inverter, and the coupling is located on the AC side of the inverters. On the contrary, DC-coupled systems, where the PV field the and battery share a common inverter, have the advantage of potentially reducing costs from shared components [15,16].

In general, the design phase of PV+BESS hybrid systems requires a large number of decisions due to the large number of possible configurations in terms of overall system architecture as well as the sizing of various components [17]. Before constructing a new PV power production facility, feasibility studies are needed to assess its viability from both financial and technical perspectives [18]. In detail, simulations are carried out to assess the energy production permitted by a given plant configuration in a given geographical position [19] and to evaluate the expected investment costs [20].

The main objective of the present work is to provide a preliminary forecast that identifies a family of batteries which is suitable, from both a technical and a financial point of view, for a given scenario. Techno-economical simulations are carried out for new gridconnected PV+BESS hybrid power production plants. Several scenarios are considered in terms of PV plant configuration, load curves and battery technologies available on the market.

### **2. Case Study and Procedure**

In this paper, a procedure is proposed to forecast a family of batteries which are suitable to be coupled with a given PV plant configuration.

The proposed procedure is applied to new hypothetical PV facilities installed on the rooftop of two different buildings: a single-brand point of sale and a ceramics factory. According the analyzed buildings, two different load types will be considered, namely a commercial and an industrial load curves. The energy production is simulated on the basis of an analysis carried out on real PV plants and thanks to irradiance databases available online. The battery operation is managed by means of a specific control logic defined in decision-tree-like diagrams considering all possible operating conditions for both charge and discharge. Several PV+BESS configurations are simulated and, for each one, a set of performance and economic indicators are computed. In the end, an unsupervised clustering algorithm is applied to all the analyzed PV+BESS configurations, aimed at detecting the family of battery solutions which are the most suitable according to the considered scenario.

In the following Sections, all aspects of the proposed procedure are thoroughly discussed: Section 2.1 analyzes the performance of several real PV plants in order to compute a proper value of Performance Ratio to be used during the following power production simulations; Section 2.2 describes the load curves corresponding to the industrial and the commercial buildings involved in the analysis; Section 2.3 explains how to simulate the PV power production; Section 2.4 provides a list of all the battery technologies considered in couple with the PV plant; Section 2.5 displays and discusses two decision-tree-like structures providing indications about the control logic of batteries during both charge and discharge; Section 2.6 describes a set of useful parameters used to evaluate the technical and economical viability of the considered PV+BESS configurations; Section 2.7 discusses how to apply a clustering method to all the possible PV+BESS configurations in order to find a group of batteries that are suitable for coupling with a given PV plant.

### *2.1. Plant Monitoring and Performance Ratio Calculation*

The first part of the present study takes into account 22 monitored PV plants distributed all over the Italian territory with a total peak power installed of about 7 MW. These facilities can be divided based on four different types of installation:

• Fixed tilt: the solar field presents a fixed tilt angle. In general, the modules are installed either on concrete ballasts or metal structures placed on flat roofs and convection is allowed on their back surface.


A single plant can be composed of multiple sections with different tilt, azimuth or type of installation, that are considered independently.

Figure 1 reports the location of the considered PV facilities on the Italian territory considering five different regions: North-West, North-East, Center, South and Islands. Moreover, the chart highlights the fraction of plants corresponding to each installation type.

**Figure 1.** Fraction of plants located in a certain region (**a**) and with a given configuration (**b**).

For each plant, the following characteristics are known: the nominal power, the peak power of all plant sections, the tilt and azimuth angles of the modules, the temperature coefficient of the module (accounting for temperature-related power losses) and the degradation factor. Monitoring campaigns carried out for each of the considered facilities allowed to collect hourly-basis data about the active energy produced at Alternating Current (AC) side, the solar irradiation on module's plane, the cell's temperature on the back side of the module and the ambient temperature. In case of plant sections with different exposure, the monitored parameters are recorded independently for each section.

Different plants started their operation in different years. However, the start of operation period does not always correspond with the starting date of monitoring: for instance, the oldest facility started to produce in August 2012, while its monitoring started in 2018.

Data from each PV facility are properly cleared out of inconsistent and unreliable samples determined by erroneous measurement, like negative values of produced energy, values of produced energy exceeding the corresponding value of irradiation, values of produced energy larger than the maximum feasible ones (computed on the basis of the plant nominal power increased by 5% to account for inverter overpower) and values of solar irradiation lower than lunar irradiation (4 W/m2) or larger than 1200 W/m2.

The data available allow to compute a performance index which is crucial for further analyses: the Performance Ratio (PR), which allows to compare the performance of PV facilities with different configurations and geographical location [21]. PR represents the overall effect of losses on the array's rated output, due to array temperature, incomplete utilization of the irradiation (soiling and shading losses) and system component inefficiencies and failures [22]:

$$\text{PR} = \frac{Y\_{f,t}}{Y\_{r,t}} \tag{1}$$

In the equation: *Yf* ,*<sup>t</sup>* represents the final PV system yield in the time interval *t*, hence the portion of net energy output of the entire PV plant which was supplied by the array per kW installed; *Yr*,*<sup>t</sup>* corresponds to the reference yield in the time interval *t*, hence the ratio between total in-plane irradiation and module's reference in-plane irradiance [23].

Starting from the historical data available, PRs are computed for each of the analyzed plants, first on a daily basis and then on a yearly basis (starting from the daily values). Then, the average value of both daily and yearly PR is computed for all the plants sharing the same type of installation. In the present work, the yearly PR values will be useful to provide some considerations about the performances of different types of plant, while the averaged daily PR values are crucial in estimating the power production of new plants.

### *2.2. Load Curves*

In the present work, two different types of building are chosen to hypothetically install a new PV+BESS facility on their rooftop: one dedicated to commercial activities and the other devoted to industrial production. The power requirements of the two structures, given their different purposes, are described by distinct load curves.

The commercial load curve considered corresponds to a single-brand point of sale, whose building covers an area of about 6100 m2. It is located in Italy, in the region of Piemonte, in climatic zone E, where the heating system start-up is allowed from 15 October to 15 April. The annual consumption of electric energy in 2019 (chosen as reference year) is equal to 828 MWh. The hourly consumption is visualized in Figure 2, in form of heat map covering all the hours and all the days of the reference year. Moreover, the seasonal loads during a typical week are plotted in Figure 3.

**Figure 2.** Heat map corresponding to commercial loads.

As shown in the chart, the building is closed on the first day of the Year, on Easter, on the 1st of May, in mid-August and on Christmas. During these periods, the photovoltaic energy self-consumed onsite is expected to be very low because only related to security equipment and perimeter lights. The maximum power absorbed is about 320 kW in summer due to chillers operation. In general, among the seasons, the PV production fits well the load: both the peaks in energy production and consumption are expected during summer, while the lowest values are registered in winter. The daily load curves present a peak in the late afternoon. During autumn and winter, another peak is observed also in early morning, due to HVAC machines start-up.

**Figure 3.** Seasonal commercial loads during a typical week.

The industrial load curve considered corresponds to a ceramics factory, whose structure covers an area of about 17,300 m2. It is located in Italy, in the region of Emilia-Romagna, in climatic zone F. The industrial process covers the entire day and the corresponding consumption is much larger the one related to conditioning and lighting systems. The annual consumption of electric energy in 2019 (chosen once again as reference year) is 7.5 GWh. The hourly consumption is visualized in Figure 4, in form of heat map covering all the hours and all the days of the reference year. Moreover, the seasonal loads during a typical week are plotted in Figure 5.

**Figure 4.** Heat map corresponding to industrial loads.

The power consumption ranges from 0 to 1280 kW. Saturdays and Sundays correspond to the yellow lines, representing a power absorption of about 650 kWp. The production is stopped during some periods in April, May, August and December. The load curve is constant among weeks and the electric consumption is generally constant among all the working days. The PV power production does not fit this type of load curve as good as the commercial one.

**Figure 5.** Seasonal industrial loads during a typical week.

### *2.3. PV Energy Production Simulation*

A preliminary study on new PV plants is needed in order to estimate their potential energy production. In this analysis, the input variables are: the plant geographical coordinates, the peak power installable on a roof or on a specific area, the type of installation, the tilt and the azimuth of the roof. Notice that, in case of a PV facility where different sections present different exposures, the last four variables are considered independently for each exposure. Different sections may differ also in the type of installation and, consequently, in the mean daily PR. Finally, hourly irradiation data from the first to the last day of the considered reference year are acquired from SoDa Helioclim database for each section of the new plant, exploiting the information about the geographic coordinates, the tilt angle and the azimuth angle.

The energy production is calculated hour by hour using the solar irradiation data and the performance ratio:

$$E\_{PV,i} = \frac{H\_i}{1000} \cdot PR\_{daily,i} \cdot P\_i \tag{2}$$

In the equation: *i* stands for a generic plant section; *Hi* is the hourly solar irradiation on the surface of the modules in a given section; *PRdaily*,*<sup>i</sup>* is the daily Performance Ratio derived from the monitoring of real PV plants; *Pi* is the total peak power installed for a given section. The total plant production in each hour is given by the sum of the energy produced by each section.

The simulations are performed under the assumption of ideal rooftop, where either fixed tilt, flush mount or East-West installations are possible. A total of six cases are considered, one for each combination between the three different PV plant configurations and the two possible load curves.

The tilt angle, the azimuth angle and the exposure are set for each configuration and thus they are independent from the load curve. In detail:


The peak power of the plant is fixed: in case of commercial load, the peak power is 500 kWp, while in the case of industrial load the peak power is 2 MWp. The characteristics of each configuration are summarized in Table 1.


**Table 1.** New PV plants characteristics: (a) fixed tilt; (b) flush-mount; (c) East-West.

### *2.4. Battery Energy Storage System Models*

A list of the battery models to be analyzed is obtained choosing between the products available on the market: different brands, sizes and technologies are adopted and compared in the simulations. All the batteries considered present the possibility to be recharged from the grid. All the batteries useful parameters are retrieved from catalogs. Two main technologies are considered: LiFePo and Li-ion NMC batteries.

The maximum volume of the technical room where batteries are installed is arbitrarily set at 50 m3: this constitutes an upper limit to the maximum number of battery modules installable. The volume occupied from each battery pack accounts for the dimension of the battery and the minimum space necessary for heat dissipation, reported in the data sheets. The weight of the system is kept into account.

The batteries that are simulated in combination with the PV system are listed in Table 2.

A total of 14 different battery models are chosen, and their corresponding 242 feasible configurations are simulated in couple with each considered PV facility. Remembering that 3 type of PV installation and 2 type of load are considered, a total of 1452 PV+BESS systems are evaluated.


**Table 2.** List of battery models considered and corresponding characteristics (extracted from data sheets).

### *2.5. Battery Energy Storage System Control Logic*

In the simulations, batteries are evaluated in terms of model and number, assuming that more packs of the same model can be considered in series or parallel connection. The battery simulation starts from a single pack of the first model of battery and ends at the maximum number of packs of the last type. A BESS configuration is simulated only if its volume is lower than the maximum volume of the technical room.

The batteries are connected to the grid, and therefore it is evaluated the convenience of recharging the battery when price of energy is lower. In order not to have the battery fully charged at the morning of a sunny day, the maximum state of charge achievable in F3 band is limited to the monthly difference between load and PV production divided by the number of days in that month. Moreover, the batteries are assumed to be AC-coupled with the PV system.

Real charge/discharge operations are always constrained by technical limits. However, in a preliminary battery assessment like the one proposed here, there is no need to account for these constraints. In real applications there is the necessity to identify as soon as possible a group of batteries suitable for a given application. Then and only then a specific battery model is chosen between the possible one (the choice is most of the times constrained by the availability of the different models) and further detailed analyses are carried out by means of specific software.

The battery operation is based on a precise control logic, capable of optimally managing the system. Decision-tree-like structure are constructed to visually represent the BESS control logic adopted. In Table 3, the terms adopted in the decision-tree-like structures are listed and explained.

The control logic of BESS charge is defined in the decision-tree-like structure reported in Figure 6.

In particular, the charge is permitted in three different modes:



**Table 3.** Notation adopted in decision-tree-like structures defining BESS charge and discharge control logic.

**Figure 6.** BESS charge control logic.

The control logic of BESS discharge is defined in the decision-tree-like structure reported in Figure 7.

**Figure 7.** BESS discharge control logic.

The battery discharge takes place in four different modes:


load is taken from the grid. This discharge mode is allowed evaluating if the battery is connected to the grid or not, as discussed in the first type of discharge.

### *2.6. Characteristic Features for PV+Bess Configurations*

The prediction of the feasible BESS configurations accounts for some key indicators: the PayBack Time (PBT) of the battery capital expenditure, to be minimized; the number of residual cycles at end of life, to be minimized; the self-consumption, the coverage and the on-site self-production, to be maximized.

The Self-Consumption (SC) is defined as [24]:

$$\text{SC} = \frac{E\_{PV \to load}}{E\_{PV, \text{y}}} \tag{3}$$

In the equation: *EPV*→*load* is the PV energy consumed by the load; *EPV*,*<sup>y</sup>* is the total annual PV production.

The coverage, sometimes also called self-sufficiency, is defined as [25]:

$$cov = \frac{E\_{PV \to load}}{E\_{load, y}} \tag{4}$$

In the equation: *EPV*→*load* is the PV energy consumed by the load; *Eload*,*<sup>y</sup>* is the total annual energy consumption.

The Self-Production (SP) is defined as [24]:

$$\text{SP} = \frac{E\_{PV,y}}{E\_{load,y}} \tag{5}$$

In the equation: *EPV*,*<sup>y</sup>* is the annual PV production; *Eload*,*<sup>y</sup>* is the total annual energy consumption.

The PayBack Time (PBT) is computed as [26]:

$$\text{PBT} = \frac{BESS \text{ investment cost}}{Annual \text{ economic saving}} \tag{6}$$

The annual economic saving is the amount of money saved thanks to the presence of the BESS with respect to the same facility without any energy storage. In order to calculate it, a database with hundreds of electricity bills is exploited. The bills are divided according to zone, voltage (medium or low) and type of contract (peak-off peak, monorary, fixed multi-hourly and variable multi-hourly). Then, economic savings are calculated on the basis of the mean value of bills expenditures varying in function of energy.

Notice that the computed values of PBT refer only to the storage system and not to the entire power generation facility, including the PV plant. The OPEX (OPerating EXpense) related to the storage system consist of batteries O&M (Operation & Maintenance) costs (for instance related to maintenance interventions, remote monitoring etc.) and insurance costs. However, considering the purpose of the current preliminary analysis, all those factors can be neglected: they would be estimated equally for all the considered battery models and therefore they would not have any influence on the identification of the optimal capacity.

### *2.7. Battery Sizing Optimization by Means of Unsupervised Clustering*

Finally, an unsupervised clustering based on k-means algorithm [27] is applied to all the analyzed BESS configurations. This final step aims at identifying a family containing all the feasible BESS solutions. K-means divides the dataset into a fixed number (*k*) of clusters according to some feature variables describing each sample. In this analysis, each sample corresponds to a possible BESS configuration. The feature variables chosen to fulfill the above-mentioned task are: the total photovoltaic energy stored in the battery within one year, the self-consumption, the number of residual cycles and the payback time.

In order to properly choose the number of clusters *k*, the Silhouette index [28] is exploited. This index provides a measure of how similar each sample is to samples in its own cluster, when compared to samples in other clusters and thus constitutes a tool to evaluate the quality several possible partitions of the available dataset. In practical terms, the Silhouette index is computed in function of the number of cluster *k*, and then the *k* corresponding to the highest Silhouette value is selected as number of clusters to be identified with k-means clustering.

The Silhouette plot for the BESS configurations coupled with the commercial load, representing the Silhouette value in function of the number of clusters *k*, is displayed in Figure 8. The number of clusters to be identified by k-means algorithm is equal to 2, coinciding with the largest Silhouette value.

**Figure 8.** Silhouette plot for the commercial load scenario.

The silhouette plot for the BESS configurations coupled with the industrial load, representing the Silhouette value in function of the number of clusters *k*, is displayed in Figure 9. The number of clusters to be identified by k-means algorithm is equal to 3, coinciding with the largest Silhouette value.

**Figure 9.** Silhouette plot for the industrial load scenario.

### **3. Results and Discussion**

All the considered plants, their geographical position, their type of installation and their annual PR value are listed in Table 4. The observed annual PR ranges from 0.69 to 0.91. The two types of installation generally showing better performances are the fixed tilt and the East-West configurations, except for some outliers. The case of carport installation has little relevance in the current analysis: data are available only for one plant and the PR is calculated over a time period of only eight months.


**Table 4.** Location, type of installation and annual Performance Ratio for each of the considered PV facilities.

As already described, the PR value for single plants is averaged over all the plants characterized by a specific type of installation. The result of this operation is reported in Figure 10. The box plot shows, for both fixed tilt and East-West configurations, an average annual PR higher than 0.80. However, the variability of the performances observed with fixed tilt PV facilities is much larger than that of East-West PV plants.

Finally, Figure 11 displays an heat map representing the daily average values of PR in the reference year in function of the plant configuration, computed averaging the daily PR values of single plants.

The new PV+BESS hybrid plants simulations return the forecast of the total amount of energy self-consumed, sold to the grid, stored in the battery or acquired from the grid in order to balance the demand. It is then possible to discuss the results in terms of PBT of the battery. As expected, increasing the number of battery packs in series, thus the capacity of the storage system, the energy self-consumed by the load grows but also the PBT increases significantly.

The results reported in Table 5 identify the BESS configuration that minimizes the PBT for each PV system configuration.

Most of the configurations identified result in a PBT approximately equivalent to the lifetime of the battery, equal to 15 years. In the last two cases, the PBT that is even larger than the battery lifetime. The cost of energy storage technologies is still too high to conclude that nowadays it is convenient to install a BESS system for large buildings. However, if the investment cost per kWh of capacity decreases, it will be possible to install a large capacities and to achieve a significant advantage also in terms of additional selfconsumption. Changing the PV installation type for the commercial load, the choice of battery models remains unchanged, as well as the PBT. In case of industrial load, the same battery model with the same number of modules shows a decrease in PBT for the fixed tilt configuration thanks to higher annual economic savings.

**Figure 10.** Annual PR averaged in function of the PV plant configuration.

**Figure 11.** Daily PR (computed on the chosen reference year) averaged over all the plants characterized by a given configuration.



The results reported in Table 6 identify the BESS configuration that minimizes the number of residual cycles for each PV system configuration.

Most of battery models optimized in terms of number of residual cycles are different from the ones optimized in terms of PBT. Focusing on the industrial load case, the fixed tilt configuration with the battery storage could be an interesting solution in case of decreasing in investment cost for batteries, because it has the minimum PBT between batteries with


the optimal value of residual cycles. The last configuration has a PBT which is way too high for the feasibility of the investment.


The results obtained from k-means clustering application are reported in the following. As already discussed, the clustering procedure exploits the total photovoltaic energy stored in the battery within one year, the self-consumption, the number of residual cycles and the payback time as relevant features to characterize each possible BESS configuration.

Figure 12 represents all the 242 possible BESS configurations for the case of commercial load with a fixed tilt PV installation divided in two clusters. The clustering results does not show significant differences for other types of installation. As discussed before, the number of clusters is chosen on the basis of the Silhouette index and is equal to 2. The feature space is represented by three different points of view: on the PBT/residual cycles plan, on the self-consumption/residual cycles plan and on the self-consumption/PBT plan. The last diagram represents the overlap between clusters in terms of PBT. All values on the axes are standardized in the range between −1 and 1.

The purple cluster represents the family of BESS that are best suited to be coupled with the analyzed PV facility configuration. The trend of self-consumption over the PBT confirms what stated before: increasing the capacity of the battery, the self-consumption increases but, as a drawback, the PBT increases as well.

Figure 13 represents all the 242 possible BESS configurations for the case of industrial load with a fixed tilt PV installation divided in two clusters. Even when k-means is applied to the industrial scenario, the results are similar among different PV installation types, as observed for the commercial case. The number of clusters is chosen on the basis of the Silhouette index and is equal to 3. The feature space is represented in the same way as the commercial case and all values on the axes are standardized in the range between −1 and 1.

The large number of batteries with high capacity (and consequently high PBT) and low number of residual cycles, in the top left region of the upper diagrams, is related to the high electric consumption typical of an industrial load. Once again, the purple cluster represents the family of BESS that are best suited to be coupled with the analyzed PV facility configuration. The green cluster correspond to BESS configurations with low number of residual cycles and high PBT, while the orange cluster represents high-capacity batteries that are strongly oversized and thus not suited for the considered PV facility.

**Figure 12.** Possible BESS configurations divided in clusters (commercial load).

**Figure 13.** Possible BESS configurations divided in clusters (industrial load).

### **4. Conclusions**

In recent years, the technological development and the increasing market competitiveness of RESs-based power production systems determined favorable conditions to switch from electricity generation in large centralized facilities to small decentralized energy systems.

In this scenario, PV facilities find profitable conditions for the grid connected users when the produced energy is self-consumed. Due to the intermittent and stochastic nature of the solar source, PV plants require the addition of an energy storage system to compensate fluctuations and to meet the energy demand even during night hours.

In this paper, a procedure is developed to forecast a family of batteries which is suitable to be coupled with a given PV plant configuration and is applied to some new PV facilities.

The PV+BESS hybrid plant energy production simulation is possible by:


Two different types of load curve are considered in the current work, namely:


The battery operations are managed by means of a control logic defined in decisiontree-like diagrams. The two diagrams, provided in the current work, consider all possible operating conditions during both charge and discharge. The main strategies behind the defined control logic are:


For each possible PV+BESS configuration, performance features, like the number of residual cycles at the end of lifetime and the self-consumption, and economic features, as the payback time, are computed. The self-consumption is defined as the ratio between PV energy consumed by the load and total annual PV production. On the other hand, PBT is based on the annual economic savings allowed by the presence of an energy storage system compared to the case of PV plant without battery.

The following observations are derived from the analysis performed:


Finally, a clustering algorithm based on k-means algorithm is applied to all the considered PV+BESS configurations, aimed at detecting the family of battery solutions which is the most suitable according to the scenario considered. The number of clusters to be identified is established by means of the Silhouette index. As expected, the cluster of the best solutions contains all those configurations characterized by low PBT and number of residual cycles.

Possible future developments of the present work consist in adopting different clustering criteria and different features to possibly improve the identification of the family of batteries that are suitable for a given application.

**Author Contributions:** Data curation, S.P. and L.D.C.; Formal analysis, A.N., A.M., S.P., L.D.C. and E.O.; Investigation, A.N., A.M., S.P., L.D.C. and E.O.; Methodology, A.N., A.M., S.P., L.D.C. and E.O.; Software, L.D.C.; Writing—original draft, A.M. and L.D.C.; Writing—review & editing, A.N., S.P. and E.O. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **Abbreviations**

The following abbreviations are used in this manuscript:


SP Self-Production

### **References**


### *Article* **Influence of the Characteristics of Weather Information in a Thunderstorm-Related Power Outage Prediction System**

**Peter L. Watson \*, Marika Koukoula and Emmanouil Anagnostou**

Department of Civil & Environmental Engineering, University of Connecticut, Storrs, CT 06269, USA; MARIKA.KOUKOULA@uconn.edu (M.K.); emmanouil.anagnostou@uconn.edu (E.A.) **\*** Correspondence: peter.watson@uconn.edu

**Abstract:** Thunderstorms are one of the most damaging weather phenomena in the United States, but they are also one of the least predictable. This unpredictable nature can make it especially challenging for emergency responders, infrastructure managers, and power utilities to be able to prepare and react to these types of events when they occur. Predictive analytical methods could be used to help power utilities adapt to these types of storms, but there are uncertainties inherent in the predictability of convective storms that pose a challenge to the accurate prediction of stormrelated outages. Describing the strength and localized effects of thunderstorms remains a major technical challenge for meteorologists and weather modelers, and any predictive system for storm impacts will be limited by the quality of the data used to create it. We investigate how the quality of thunderstorm simulations affects power outage models by conducting a comparative analysis, using two different numerical weather prediction systems with different levels of data assimilation. We find that limitations in the weather simulations propagate into the outage model in specific and quantifiable ways, which has implications on how convective storms should be represented to these types of data-driven impact models in the future.

**Keywords:** power outages; machine learning; thunderstorms; numerical weather prediction

### **1. Introduction**

Weather-related power outages, and the severe weather events that cause them, pose a persistent threat to the functioning of the infrastructure and economy of the United States. These types of power outages affect millions of people and cost the U.S. economy tens of billions of dollars every year; moreover, the rate at which they occur appears to be increasing [1]. Anticipating the damages that storms can cause is a critical step in electrical utility managers' storm outage management process. They need reliable information before a storm to be able to stage repair crews and effectively prepare for the damages that the storm will cause [2]. As such, there has been a recent surge in research and development activity into methods to predict storm damages and weather-related power outages.

Arguably, the most destructive types of storms in the United States are thunderstorms, including the associated convective phenomena (tornadoes, microbursts, hail, etc). While hurricanes often receive special attention because they are larger and more dramatic, thunderstorms are more common and cause more damage to the electrical infrastructure every year than any other type of weather. Indeed, investigations of major outage events reported to the Department of Energy have found that convective storms are responsible for the majority of weather-related outage events, the greatest number of customer outages, and the most outage hours [3,4]. Additionally there is every indication that the severity of thunderstorms is going to increase in the future. Changes in the climatic patterns of thunderstorms can already be seen in a time series analysis [5], and long-term climate projections suggest that, because of climate change, thunderstorms are likely to become stronger, more frequent, and more damaging [6,7].

**Citation:** Watson, P.L.; Koukoula, M.; Anagnostou, E. Influence of the Characteristics of Weather Information in a Thunderstorm-Related Power Outage Prediction System. *Forecasting* **2021**, *3*, 541–560. https://doi.org/10.3390/forecast 3030034

Academic Editor: Sonia Leva

Received: 27 June 2021 Accepted: 3 August 2021 Published: 5 August 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Despite the demonstrated risk that thunderstorms present to the electrical infrastructure, they have not received much attention in the recent research for modeling weatherrelated power outages. While there are some outage modeling approaches that are generalized to a range of types of weather [8–11], much of research in this field has been focused on other types of storms. The vast majority of the work has focused on tropical storms and hurricanes, which can have particularly dramatic impacts [12–16], but several mature modeling approaches, specifically for extratropical storms [17–19], have also been developed.

In the existing general outage models, thunderstorms are sometimes included in the analysis [9,10,20,21], but the weather characteristics of these storms are treated in a similar fashion to other, more structured types of weather. There are also some studies that infer a focus on thunderstorms by including information about lightning strikes [11,22,23], but do not have an explicit focus on thunderstorms because they also include other types of weather events in their analysis.

This lack of focus on thunderstorms may be a result of the technical difficulty associated with describing and simulating them. Convective storms are particularly challenging for established numerical weather prediction (NWP) models and meteorological forecasts. While the increased horizontal resolution of convective-allowing configurations can lead to improved simulations, even with state-of-the-art high-resolution NWP models, reliable deterministic forecasts of thunderstorms longer than several hours are elusive [24–26]. As Yano et al. describe, there may be limitations to modern NWP models' ability to simulate convective storms because of the wide-spread use of assumptions and parameterizations that are reasonable for synoptic-scale weather patterns but are much less applicable to more complex convective phenomena [26]. These potential limitations of NWP simulations are long standing, and multiple strategies for mitigating them have emerged. Assimilating radar or even lightning observations into initial conditions of simulations can be used to improve short-term predictions [27,28]; forecasting systems that leverage this type of data assimilation for rapidly-updating nowcasts are currently operational [29]. In addition, for forecasts longer than several hours, stochastic predictions from convective-allowing ensembles have shown an improved forecasting skill by being able to capture the range of potential outcomes, instead of one deterministic scenario [30–32].

Similar approaches and findings can be seen in the few studies in the literature that specifically focus on predicting thunderstorm-related power outages. In Alpay et al., the authors take a rapid-refresh nowcasting approach to modeling thunderstorm-related outages, using an LSTM neural network trained on data from a rapidly updating radaringesting weather model from NOAA [33]. The works of Shield and Kabir et al. both describe a thunderstorm outage prediction system trained on weather data from the National Digital Forecast Database for an area in Alabama [34,35]. Shields investigates the limitations of the model he develops and identifies that it has better skill at a synoptic scale, which illustrates the difficulty of forecasting with thunderstorms [34]. Kabir et al. take a more stochastic approach and develop a quantile regression model, which allows the communication of the significant uncertainties associated with predicting the impacts of thunderstorms [35].

While this previous work attempts to manage the known limitations of weather simulations of thunderstorms, how these limitations propagate from weather simulations into machine-learning based impact models remains poorly described. The problem of poor inputs for a computational algorithm has been recognized since the dawn of computation [36], but the effects in this context are not fully understood. In this paper, we attempt to shed light on this matter by analyzing the quality of the weather data from two different weather simulation systems with differing amounts of data assimilation, determine how outage models trained on these different sets of weather data differ in skill and accuracy, and what information the outage models learn from. This knowledge is critical to build an understanding of the limitations of the data used to build impact models

for thunderstorms and to suggest how improved representations of weather will improve the quality of the insights that can be derived from them.

### **2. Materials and Methods**

This study involved the creation and comparison of two separate machine-learning models designed to predict thunderstorm-related power outages, using data from NWPbased weather simulations and a wide range of other data sources in a region covering three states: Connecticut, Massachusetts, and New Hampshire, and five distinct electrical utility service territories: Eversource Connecticut (CT), Eversource Western Massachusetts (WMA), Eversource Eastern Massachusetts (EMA), Eversource New Hampshire (NH), and AVANGRID United Illuminating (UI). For geographical details of the modeling domain, refer to Figure 1.

**Figure 1.** The location of the outage model grid cells by service territory as well as the location of the airport weather stations used in the meteorological analysis.

### *2.1. Data*

The outage models developed in this analysis use data describing 372 thunderstorm events that occurred in the utility service territories from 2016 to 2020, as well as a range of environmental characteristics, such as vegetation and drought status, as well as proprietary outage and infrastructure data provided by the power utilities, aggregated to the grid cells of the weather simulations. We included as many thunderstorm events that could be observed in weather station reports from each utility service territory, and aggregated the data to the RTMA grid cells of each service territory for each thunderstorm event. For details about the amount of data used from each territory, see Table 1.


**Table 1.** The amount of data available for training the thunderstorm-related outage models.

### 2.1.1. Weather

The core of the analysis centers around datasets produced by two separate NWP gridded weather simulation systems: a hybrid NOAA analysis system, and a WRF 2 km simulation system. The NOAA analysis dataset is a combination of data from the Real-Time Mesoscale Analysis (RTMA) [37] and Stage IV Quantitative Precipitation Estimates (Stage IV) [38]. RTMA is a weather analysis product that produces a gridded estimate of weather conditions by statistically downscaling a 1 h short-term forecast and adjusting it with weather station observations. It produces a high-resolution, near real-time estimate of temperature, humidity, dew point, wind speed and direction, wind gusts, and surface pressure for the entire United States. The RTMA data were sourced from the archive hosted on the Google Earth Engine [39]. Stage IV is a Quantitative Precipitation Estimate (QPE) dataset created by the National Weather Service and the National Centers for Environmental Prediction (NWS, NCEP), using a blend of NEXRAD radar and the NWS River Forecast Center precipitation processing system [40]. It takes gridded precipitation estimates derived from radar scans, adjusts the values based on rain gauge data, and aggregates the data to produce gridded hourly estimates of precipitation for the continental United States. It is popular for analytical purposes and is often used as a reference to analyze the accuracy of satellite and other precipitation estimates [38]. By using a blend of RTMA and Stage IV, we are able to have a reasonable estimate of the average hourly weather conditions in each grid cell during each storm used in this analysis. For the sake of brevity, this dataset will sometimes be referred to as the "RTMA" system.

We compare this hybrid NOAA analysis dataset with another weather dataset developed from a configuration of the Weather Research and Forecasting Model (WRF), similar to one that was used in several outage predictions models [17,18], but with an increased horizon resolution to potentially help resolve convection. This model is initialized with the North American Mesoscale Forecast System analysis [41], which has 2 km horizontal grid spacing with one 6 km external domain. For configuration details, please see Table 2. These WRF simulations use a different projection than the RTMA system, so the results were resampled with bilinear interpolation to match the spatial characteristics of the RTMA analysis product.


**Table 2.** Details of the WRF simulation configuration.

For outage modeling purposes, 24 h time series of a common set of weather variables generated from both weather simulation systems were processed to generate descriptive data features for each thunderstorm in this analysis. The weather variables considered are dew point temperature, specific humidity, air temperature, surface pressure, wind speed, wind gust speed, wind direction, and hourly precipitation rate. Established weather parameters that directly describe convective potential, such as CAPE and CIN, were unfortunately not available for this study because they are not published in RTMA, which is primarily a surface analysis product. For each of the included variables, the mean, max, minimum, standard deviation, 4 h mean during peak winds, and total were calculated for each storm, except for wind direction for which we took the median value. The median was taken to limit its sensitivity to outliers. Several additional features were calculated: the number of hours of winds above various wind speeds, calculated using various thresholds applied to wind speeds and gusts; typical wind direction by taking the mean of the median wind direction of included storms; and the difference between the typical wind direction and the median wind direction for that storm. To preserve its characteristics, all computation and analysis of wind direction was performed via the circular library in R [48]. Additionally, we included an additional set of features describing the time series of wind stress exerted on the trees by taking the product of the leaf area index (see below) and the square of the wind speed. For details, please see Appendix A, which contains a detailed table of all data features used for modeling.

### 2.1.2. Infrastructure and Outage Data

Proprietary data of the infrastructure and historical outages are made available for this study for the five utility service territories. Using rgdal and rgeos [49,50] for the area within each outage model grid cell, we calculated the length of overhead power lines, the number of utility poles, the number of fuses and cutouts, and the number of circuit reclosers.

The historic outage data describes the time and location of where damage occurred to the power distribution grid for a period of five years (2016 to 2020). Based on this information, we were able to calculate the number of damage locations within each outage model grid cell associated with each storm. A damage location is a physical location where repair crews are dispatched to repair damage after a storm. In the vast majority of cases, this meant counting the damage locations that were identified in the 24 h storm period, but in several cases, additional "nested" storm-related outages were recognized by utility operators after the storm period, so a longer window was sometimes used. These damage data were extracted from the utility outage management system, which is a software tool used by most large utilities to identify outages and dispatch repair crews.

### 2.1.3. Environmental Data

Because weather-related power outages are the result of interactions between the weather, the infrastructure and the environment, a range of environmental information was considered for this analysis. We processed the environmental data in several different ways depending on spatial resolution. When working with datasets with a resolution higher than the 2.5 km RTMA grid, for each grid cell, the raster data were sampled from a 60 m buffer around the overhead lines in that cell, and we calculated the representative percentage for the categorical data, or the average and standard deviation for the numerical data. We applied this process to a range of datasets, including the following: categorical land cover from the 2016 National Land Cover Database (NLCD) [51], 2016 NLCD Tree Canopy Coverage [52], vegetation height estimates from the Global Ecosystem Dynamics Investigation (GEDI) lidar instrument on the International Space Station [53], USGS 3DEP DEM elevation [54], and several other datasets, which required special processing. For example, we sampled the soils dataset developed by Watson et al. [18] from the USDA SSURGO database [55] to describe the soil characteristics (density, porosity, hydraulic conductivity, composition, and saturation). Additionally, because that previous work suggests that systemic biases caused by differences in the elevations of the weather predictions and the infrastructure may be present, we used the difference between those two elevations as an additional feature, elvDiff.

As seen in other outage modeling work [15,16], high-resolution data from the Individual Tree Species Parameter Maps (developed to support the USDA National Insect and Disease Risk Map) were used to calculate information about the density of the forest and the presence of various tree species [56]. However, because these data contain information about 264 individual tree species, we aggregated the basal area and stand density index of the species data by wood type (hardwood or softwood). Additionally, we were able to calculate the mean and standard deviation of the basal area (BA), stand density index (SDI), quadratic mean diameter (DQ), total frequency (TF), and trees per acre (TPA) for all trees, and generate statistics for the area around the infrastructure as described in the previous paragraph.

Data at the courser resolutions were handled more simply by sampling the data using the centroid of the grid cell. This included data describing the climatological leaf area index generated by Cerrai et al. [9], and a collection of drought indices published by the West Wide Drought Tracker [57]. While drought data was used in outage modeling in the past [12,15], we included more information, including the 1, 3 and 12 month Standardized Precipitation Index (SPI) of the month of the storm, as well as 12 month SPI from 1 to 5 years before the storm occurred. This information was included to capture not only the immediate drought conditions, but also any lingering effects of long-term drought stress on the vegetation.

### *2.2. Outage Modeling*

To generate a robust outage prediction system based on the 131 data features, generated via the processes described in the previous section, additional steps were taken to confirm each variable's importance for the modeling outage, tune the model's hyperparameters, and test the system's performance via cross-validation. All modeling processes were coded in R [58], using a range of support libraries.

Variable importance for modeling was initially confirmed via a Boruta variable selection process. This process involves calculating the variable importance in a random forest model, and comparing each variable's importance against the importance of a randomized variable with the same distribution of values. Over many iterations, this process can confirm the importance of each variable in a dataset in comparison to random noise [59]. This was implemented via the Boruta R library [60].

Based on experience and the previous literature [9,10,18], we chose the Bayesian Additive Regression Tree (BART) model for this analysis [61], implemented via the BART R library [62]. While this is a quantile regression algorithm, we simplified outputs to deterministic predictions for each storm by taking the mean of the outputs of the model. The hyperparameters used by the BART algorithm (sparse parameters a and b, shrinkage parameter k, the number of trees, the number of posterior draws, and the number of iterations used to initialize the Monte-Carlo Markov Chains) were tuned for this dataset via differential evolution [63] implemented via the DEoptim library [64]. It was used to find the optimal configuration of the BART algorithm based on the mean root mean square logarithmic error (RMSLE) of a fixed 5-fold cross-validation of the RTMA system dataset. To maintain comparability, these optimized hyperparameter values were consistently applied to all models and experiments in this analysis. RMSLE was chosen because it is less sensitive to extreme errors.

### *2.3. Analysis*

To understand the differences between the hybrid NOAA analysis dataset, the WRF simulation dataset, and the outage prediction models built on them, we evaluated each weather simulation's ability to represent the local weather conditions by comparing its predictions against weather station observations. Then, to understand the different qualities of the two outage models, as well as evaluate the importance of individual and groups of variables in the outage models, we compared the cross-validation results, using traditional and spatial error metrics.

More specifically, to evaluate the two-gridded weather simulations, data were collected from METAR and SPECI reports via the Integrated Surface Data archive maintained by the National Centers for Environmental Information [65]. Any data flagged with quality issues were removed, and all observations reported were averaged for every hour to produce a 24 h time series. Any station or variable with more than two hours of missing data were removed from the analysis. Then, the same summary statistics used to generate the outage model features (mean, minimum, maximum, standard deviation, total, 4 h mean during peak winds) were calculated based on the weather station observations. Any mean or maximum gust values reported as zero by the weather stations were also removed from consideration.

For this analysis, all weather stations in the proximity of the outage prediction service territories were considered, with the exception of Northern New Hampshire. We removed that area from consideration because it is dominated by the White Mountains, and the complex topography would cause biased results. See Figure 1 for the detailed weather station location information used in this analysis. While additional data cleaning steps are common when this process is used for weather model evaluation, we determined that this would not be appropriate because the localized differences between the weather station observations and gridded NWP data are of interest.

The outage model performance was evaluated using leave-one-date-out cross-validation. This validation process simulates the operational predictability of the outages caused by each weather event by iteratively isolating the information of each storm event, and testing the model's ability to predict it. More specifically, for each storm date and time present in the database of storms, we reserved the data from that date and time, trained the outage model on the remaining data, and tested that trained model on the reserved data. This way, we had a comprehensive evaluation of all storms in our database, but prevented any spatial or temporal correlations in the weather data from influencing the model performance. While 372 thunderstorm events were considered in this analysis, because of overlapping times, each outage model was only trained and tested 226 times for this cross-validation. To evaluate the overall cross-validation results, we calculated the median absolute percent error (MdAPE), mean absolute percent error (MAPE), centered root mean squared error (CRMSE), correlation coefficient (R2), and the Nash–Suttcliffe efficiency (NSE) [66]. For definitions of these error metrics, please see Appendix C.

Because the spatial predictability of thunderstorm outages is also of interest, we also applied the fraction skill score (FSS) to evaluate the spatial skill of the outage models. FSS uses a threshold, or a series of thresholds, to generate binary rasters of predictions and actual values, and compares the two within a series of neighborhoods [67]. A skillful model is able to predict a similar fraction of values above the threshold as the actual in a small area. This metric is becoming a widely accepted method to evaluate the spatial skill of precipitation forecasts, especially in the U.S. [68]. Under ideal conditions, an FSS value greater than 0.5 indicates a "useful" skill, but depending on the conditions of the baseline performance (FSS*uni f orm*), it is subject to change as defined by the following equation:

$$\text{FSS}\_{uniform} = 0.5 + \text{FSS}\_{random} / 2 \tag{1}$$

where FSS*random* is the total of the derived binary raster, divided by the number of cells in the domain [67]. For precipitation, the threshold tends to increase with smaller domains and as the prevalence of precipitation increases [69]. For this analysis, we calculated the FSS for each storm by service territory for a range of scales (3 × 3 to 21 × 21 cells), and outage thresholds between upscaled outage predictions and actual outages via the validation library [70]. Upscaling the predicted and actual values for the FSS calculation was important because the resolution of our model and the frequency of actual damages is such that the actual values are extremely zero-inflated and very sparse (96.3% zeros, and mean of 0.048 damages per grid cell). The outage model predictions however, tend to be small (median of 0.0292 and 0.0314 for RTMA and WRF systems respectively) and are more evenly distributed. This difference in spatial distribution was minimized by applying boxcar smoothing to a small 3 × 3 neighborhood on both the actual and predicted outages for each event and territory via the SpatialVx library [71]. While this process effectively degrades the precision of the analysis, it generates more continuously distributed values that are more comparable, while not affecting the total number of damage locations for each event.

To measure the variable importance of each outage model, we applied the variable permutation technique described by Fisher et al. [72] via the DALEX library in R [73]. This technique is model agnostic and uses a loss function to measure model performance as the input variables are perturbed. This allows for a quantitative understanding of each variable's influence on the model performance. Doing this evaluation via crossvalidation would be prohibitively complex and computationally expensive, so to evaluate the variable importance within the outage models, all available data were used to train the models before variable importance was measured. In addition, because there is a significant random component in this analysis, we calculated this variable importance over ten iterations for both outage models, and calculated the confidence intervals. The loss metric used to evaluate variable importance, root mean squared logarithmic error (RMSLE), was chosen because it is robust to the inclusion of zeros and is less sensitive to rare cases of extreme errors, which can be present because of the statistical distribution of actual outages as described above. However, because it is a logarithmic error metric, differences in RMSLE can often appear small, despite being significant.

### **3. Results**

### *3.1. Weather Analysis*

As demonstrated in Figure 2, the NOAA analysis dataset represents almost all weather parameters used in the outage models more accurately than the WRF simulation dataset. Very significant differences are seen between the quality of the precipitation parameters, as well as several wind and gust features. Both systems are able to represent parameters associated with synoptic scale processes, such as temperature, humidity, and surface pressure dynamics, much more accurately than mesoscale and microscale processes, such as wind and precipitation. Some surface pressure parameters appear to be poorly captured, but this is likely due to differences in elevation between the NWP data and weather station data, which are not accounted for in this evaluation. In general, these results are quite consistent with what we would expect from the state of the art of NWP of a deterministic 24 h simulation of thunderstorms. For detailed metrics, see Appendix B.

**Figure 2.** Point-to-point comparison of the NOAA analysis parameters (RTMA, (**Top**)) and the WRF simulation parameters (WRF, (**Bottom**)) versus weather station observations for select variables, describing 24 h thunderstorm events.

*3.2. The Outage Models*

The RTMA-based outage model performs slightly better than the WRF-based model based on all metrics used in our analysis as seen in Table 3 and Figure 3.

**Table 3.** Error metrics of the event-level performance of the cross-validation of the outage prediction systems.


**Figure 3.** Scatterplots of cross-validation predictions versus actual outages for all thunderstorm events for RTMA- (red, (**left**)) and WRF (blue, (**right**))-based outage prediction systems.

While a direct comparison is not particularly fair because of the differences in the events used in the analysis and the domains of the models, both outage models presented here perform reasonably well in comparison to other outage prediction models of a similar architecture. Wanik et al. [10] describe a warm weather outage model that has a slightly higher MdAPE (35.1 to 38.7%). In Cerrai et al. [9], the best overall outage model has an overall MdAPE of 43%, a MAPE of 59% and an NSE of 0.53. In Yang et al. [17], their conditional outage prediction system designed for severe events has a MdAPE of 38%, MAPE of 46%, and NSE of 0.79. In Watson et al. [18], their best performing rain/wind storm model has a MdAPE of 38%, MAPE of 57%, and an NSE of 43%. The thunderstorm outage models described here have competitive APE metrics, but have a comparatively low NSE, in part because of one under-predicted extreme event.

Overall, the cross-validation results indicate that the models presented here are sensitive to the overall severity of the different thunderstorms. The model has a good dynamic range, especially if one considers that the median daily outages for CT, WMA, EMA, NH, and UI are 35, 6, 20, 22, and 25, respectively. The models shown here demonstrate a dynamic range of around 10 times the typical daily outage level for each service territory, depending on storm severity.

### 3.2.1. Spatial Skill

As seen in Figure 4, the RTMA-based outage model has slightly better spatial performance than the WRF-based model, but the differences between the outage models are small in comparison to the differences between the events and territories. While many thresholds were evaluated, we show the results for a threshold of 0.111 damage locations, which correspond to having one damage location smoothed out in a 3 × 3 pixel area (approximately 7.5 km2).

**Figure 4.** FSS for all events by territory for the RTMA- and WRF-based outage models with a moderate outage risk threshold (0.111 damage locations), plotted for neighborhood sizes 3 × 3 to 21 × 21 grid cells. The colored lines are FSS values for each event; the black line indicates the average FSS over all events; and the horizontal dark grey line indicates the average FSS*uni f orm*.

### 3.2.2. Outage Model Variable Importance

The grouped variable importance analysis of the outage models in Figure 5 shows that, while infrastructure-related variables are the most important by far, there are differences between the two models as to which weather parameters contribute the most to the models. While the RTMA-based system finds precipitation information to be very useful, the WRFbased system has stronger preference for winds, temperature, and humidity than the RTMA model. The WRF model also appears to fit more on such environmental variables as land cover, vegetation, and elevation, which do not vary storm-by-storm in a given service territory. The results of an individual variable importance analysis is displayed in Appendix A. Although the importance of any one variable to the model is relatively small, given the large number of variables used, and the logarithmic error metric used to measure the dropout loss only makes the differences appear smaller, there are some interesting differences between the two models. Most notably, the maximum precipitation rate is one of the least important variables in the WRF model but is the second most important variable in the RTMA model.

**Figure 5.** Grouped variable importance as measured by dropout loss (RMSLE) over 10 iterations of permuted groups of variables. The 95% confidence intervals are also shown for both the RTMA-based outage model (red, (**left**)), and the WRF-based outage model (blue, (**right**)).

### **4. Discussion**

Based on these results, several conclusions can be made about the predictability of thunderstorm-related power outages. Firstly, while the NOAA analysis data represent local weather conditions more accurately than the WRF simulation, many weather features used in the outage prediction models have significant errors in both systems. Rather than these errors being simulation or forecasting errors, because of the amount of observational data assimilated into the NOAA analysis system, they are likely due to the representativeness error caused by depicting complex and locally variable phenomena as deterministic and uniform in the 2.5 km × 2.5 km area. This type of error has been documented in the literature for precipitation and winds [74–77], and the errors in the RTMA data for winds and the Stage IV are comparable to the magnitude of representativeness error found in these works.

Secondly, because the NOAA analysis data have higher quality weather data than the WRF simulations, it is unsurprising that the RTMA outage model is more accurate than the WRF-based one. However, what is surprising is how modest the performance differences between these outage models are. Even with the large amount of observational data incorporated into the RTMA and Stage IV analysis products, which have much fewer simulation errors present than the WRF simulations, the outage model is unable to predict thunderstorm-related outages with greater accuracy.

This suggests that the randomness of storm damages is quite significant, and more precise outage predictions may require significantly more precise information. One possibility is that additional factors that are not considered in this study, such as the age of the infrastructure, limit the outage model. However, there are also differences between the two models that suggest other possibilities. As described above, the spatial resolution of the representation of the weather data is a readily apparent source of imprecision in our data. Although all data used in these models, including the environmental and infrastructure information, may suffer from similar representativeness errors, we can see that some weather variables are better represented at 2.5 km × 2.5 km than others. How the precision of the weather data affects the outage models can be understood with a more detailed analysis of the variable importance.

By comparing the R<sup>2</sup> values of the weather feature evaluation and the importance of the weather variables in the outage models, we find that there is a weak but real correlation between the two (0.23 ± 0.07 for RTMA, 0.29 ± 0.07 for WRF). This indicates that the outage models have a preference for precise and accurate weather information. This may be obvious, but this preference also appears regardless of whether or not the weather phenomena directly causes power outages. Both RTMA and WRF outage systems find temperature and humidity to be somewhat important to its predictions, although these variables are not direct causes of outages in thunderstorms. They are more indicators of convective potential and are, thus, indirectly related to power outages, but because of their accurate representation, the machine learning algorithms of the outage models find them useful for understanding the risk of weather-related damage.

At the same time, there is also a distinct preference for variables that have a more direct causative relationship with weather-related outages. This can best be seen in how the RTMA system has a strong preference for precipitation variables. maxPREC is the 2nd most important variable of all for that model, despite it having only a moderate correlation with local conditions (R<sup>2</sup> of 0.5298). It can also be seen in how both models find useful information in wind and gust variables, despite the most precisely predicted variable in that group, avgWIND, only having a moderate correlation with local conditions (R<sup>2</sup> of 0.6346 and 0.5879 for RTMA and WRF, respectively). This is because both wind and precipitation are good indicators of the location and intensity of a convective storm, and more direct indicators of the risk of weather-related damages. Indeed, in the case of the RTMA system, the strong preference for precipitation information comes with a comparatively weaker preference for most other variable groups.

This suggests that if the precision of the precipitation and wind information could be increased further, we can expect corresponding increases in the accuracy of outage prediction models for thunderstorms. Additionally, if we consider how the apparent lack of precision in these data is likely from representativeness error, as described above, future directions for research become apparent.

Lastly, the spatial skill of the outage prediction system appears to vary significantly from storm-to-storm as well as territory-by-territory. It is beyond the scope of this paper to speculate about the storm-to-storm variability in the FSS scores, which may also be a function of the accuracy and precision of the weather simulations, but the distinct differences in spatial predictability of outages in different service territories is suggestive of distinct differences between them. It has been documented for precipitation that the FSS calculations change significantly depending on the size of the domain. However, in the case of outages, this effect is likely only moderate because the average value of FSS*uni f orm* does not vary much between territories. The most apparent and potentially impactful difference for outage models between the territories is the densities of the infrastructure. As seen in Figure 5 and Appendix A, infrastructure is a very influential variable for outage modeling, and while all the service territories included in this study contain some urban areas, some are much more consistently urbanized than others. As such, the mean density of overhead lines for each territory varies widely with a minimum of 8.5km per grid cell in WMA, and a maximum 27.5 km per grid cell in UI. If the mean density of the overhead lines and mean FSS as shown in Figure 4 for each territory are compared, we see that the Pearson correlations between the two are 0.927 for RTMA, and 0.946 for WRF: a very strong correlation between the overall spatial predictability of outages and the density of the infrastructure in the region. This is a clear indication of the influence that the infrastructure density has over the spatial predictability of power outages. However, this also may be an indication of over-fitting on the infrastructure features. Infrastructure is by far the most important variable group in this analysis, but in the case of the RTMA outage model, better spatial skill comes with a corresponding lower importance of infrastructure.

### **5. Conclusions**

While the two thunderstorm-related outage models shown here are acceptably skilled at predicting the total number of damages for each storm event, they have difficulty predicting the location of storm impacts. Both the models based on the NOAA analysis dataset and the WRF simulation dataset appear to fit strongly on the amount of infrastructure

present in an area and a combination of weather variables that are either directly related to storm damages but imprecisely represented (precipitation, winds), or are more general indicators of convective potential but more precisely represented (temperature, humidity).

Because predictions of the weather conditions and power outages appear to have similar limitations for thunderstorms, there are established analytical methods that could be readily applied to improve the modeling of power outages and other impacts associated with thunderstorms. Just as weather ensembles allow meteorologists to predict the potential intensity of thunderstorms beyond the capabilities of deterministic forecasts, an outage model coupled to a weather ensemble may allow us to predict the potential impacts in a similar way. Because of the high uncertainties, rapidly-refreshing outage models, such as that described in Alpay et al. [33], may be more useful in an operational decision-making context for thunderstorm preparedness.

If one considers how strong convective storms are an increasing threat, globally, there is an implicit call to accelerate investment in global weather prediction and the observation infrastructure. The impact models presented here, even with their limitations, are only possible because of the availability of high-resolution nowcasting products in the United States. While recent developments in global convective-allowing NWP systems are encouraging [78], for this type of impact modeling to be applied in other countries, more work in this space is needed.

Based on our findings, we can expect that as better representations of local weather conditions during thunderstorms are developed both in the United States and globally, outage model accuracy, overall as well as spatially, will improve; the outage models will learn more and more of the phenomena directly linked to weather-related power outages, such as strong winds and extreme precipitation, instead of the synoptic patterns that are correlated to them. To progress along that path, a more granular understanding of the weather conditions that cause damage in convective storms and how they can be represented is needed. Further research involving an analysis or modeling of storm impacts based on microscale numerical weather prediction, large eddy simulations, or even observations from radar or lidar instruments could be very informative about how weather information can be generated in a way that improves our ability to understand and anticipate the impacts of convective storms.

**Author Contributions:** Conceptualization, P.L.W.; methodology, P.L.W.; software, P.L.W. and M.K.; validation, P.L.W.; formal analysis, P.L.W.; investigation, P.L.W. and E.A.; resources, E.A.; data curation, P.L.W. and M.K.; writing—original draft preparation, P.L.W.; writing—review and editing, E.A. and M.K.; visualization, P.L.W.; supervision, E.A.; project administration, E.A.; funding acquisition, E.A. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by Eversource Energy.

**Institutional Review Board Statement:** Not Applicable.

**Informed Consent Statement:** Not Applicable.

**Data Availability Statement:** Restrictions apply to the availability of these data. Data were obtained from Eversource Energy and United Illuminating. They are available from the authors with the permission of Eversource Energy and United Illuminating.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

### **Abbreviations**

The following abbreviations are used in this manuscript:



### **Appendix A. Data Features**

**Table A1.** Description of variables used in outage prediction models. The dropout loss of the top ten variables for each model are in bold. Higher dropout loss indicates greater importance.


**Table A1.** *Cont*.



**Table A1.** *Cont*.

### **Appendix B. Weather Correlations**

**Table A2.** Correlation between RTMA and WRF weather datasets, and METAR and SPECI observations.



**Table A2.** *Cont*.

<sup>1</sup> Not enough variance to compute.

### **Appendix C. Error Metrics**

$$\text{MAPE} = \frac{1}{N} \sum\_{i=1}^{N} \frac{|P - A|}{A} \times 100 \tag{A1}$$

$$\text{CRMSE} = \sqrt{\frac{1}{N} \sum\_{i=1}^{N} [(P\_i - P) - (A\_i - A)]^2} \tag{A2}$$

$$\text{NSE} = 1 - \frac{\sum\_{i=1}^{N} (P\_i - A\_i)^2}{\sum\_{i=1}^{N} (P\_i - \bar{A})^2} \tag{A3}$$

$$\mathbf{R}^2 = \left(\frac{\sum\_{i=1}^{N} (P\_i - \bar{P})(A\_i - \bar{A})}{\sqrt{\sum\_{i=1}^{N} (P\_i - \bar{P})^2 \sum\_{i=1}^{N} (A\_i - \bar{A})^2}}\right)^2 \tag{A4}$$

$$\text{RMSE} = \sqrt{\frac{1}{N} \sum\_{i=1}^{N} (\log(P\_i + 1) - \log(A\_i + 1))^2} \tag{A5}$$

### **References**


### *Article* **Tobacco Endgame Simulation Modelling: Assessing the Impact of Policy Changes on Smoking Prevalence in 2035**

**Michael Chaiton 1,2,\*, Jolene Dubray 1, G. Emmanuel Guindon <sup>3</sup> and Robert Schwartz 1,2**


**\*** Correspondence: michael.chaiton@camh.ca; Tel.: +1-416-978-7096

**Abstract:** Smoking causes substantial amount of mortality and morbidity. This article presents the findings from simulation models that projected the impact of five potential Tobacco Endgame strategies on smoking prevalence in Ontario by 2035 and expected impact of smoking prevalence "less than 5 by 35" on tax revenue. We used Ontario SimSmoke simulation for modelling the expected impact of four strategies: plain packaging, free cessation services, decreasing the number of tobacco outlets, and increasing tobacco taxes. Separate models were used to project the impact of increasing the minimum age to legally purchase tobacco to 21 years on smoking prevalence and impact of price and tax increase to achieve "less than 5 by 35" on taxation revenue. The combined effect of four strategies in Ontario SimSmoke Model are expected to reduce smoking prevalence by 8.5% in 2035. Increasing tobacco taxes had the greatest independent predicted decrease in smoking prevalence (2.8%) followed by raised minimum age for legal purchase to 21 years (2.4%), decreasing tobacco outlets (1.5%), free cessation services (0.7%), and plain packaging (0.6%). Increasing tobacco excise tax and prices are projected to have minimal impact on taxation revenue, with a decrease from 1.5 billion to 1.2 billion annual tax receipts.

**Keywords:** tobacco endgame; policy; simulation model; tobacco tax revenue

### **1. Introduction**

Great strides have been made in tobacco control in Canada and globally over the past few decades through implementation of various measures, including those endorsed by the international Framework Convention for Tobacco Control [FCTC] [1]. Nevertheless, smoking prevalence remains substantial: 18.1% of Canadians over 12 years of age, representing 5.4 million Canadians, were current smokers in the year 2014 [2]. The overall burden of smoking related illness and death from cancer and from respiratory and cardiovascular diseases continues to be devastating. In 2002, 37,000 Canadians died from tobacco associated illnesses–the size of a small town being wiped off the map each year [3]. Canadians lose an estimated 515,607 person years of life every year as a result of premature mortality from tobacco smoking [3]. The idea of a "Tobacco Endgame" is based on the perspective that "control" of tobacco will never be enough to deal with the epidemic of tobacco related diseases and that the focus must be shifted to develop strategies to reach a future that is free of commercial tobacco. This notion of "endgame" is qualitatively different from tobacco control strategies currently in place. This recognition is becoming more widespread and is increasingly leading to the view that a strategy for an "endgame" for commercial tobacco is required.

In October 2016, a Tobacco Endgame for Canada Summit was convened with over 80 experts, researchers, government officials, advocates, and health professionals in attendance to discuss possible strategies to the target goal "less than 5 by 35"; that is, to achieve

**Citation:** Chaiton, M.; Dubray, J.; Guindon, G.E.; Schwartz, R. Tobacco Endgame Simulation Modelling: Assessing the Impact of Policy Changes on Smoking Prevalence in 2035. *Forecasting* **2021**, *3*, 267–275. https://doi.org/10.3390/ forecast3020017

Academic Editor: Konstantinos Nikolopoulos

Received: 17 March 2021 Accepted: 8 April 2021 Published: 13 April 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

less than 5% smoking prevalence by 2035. In this report, we describe the findings from simulation models that assessed the impact in Ontario of five potential Tobacco Endgame strategies [4]. They include:


In addition, we also modeled the impact of tax and price increase to achieve "less than 5 by 35" on government taxation revenue. Cigarette taxes bring in significant revenue to governments at the national and provincial level. Apart from sales taxes, in 2014–2015 Canadian Federal and Provincial governments received \$8.2 billion from the sale of tobacco [5]. There is concern expressed by those opposed to tobacco elimination that reducing the number of smokers would decrease government revenue and that this would be of such a magnitude that it could not happen. However, there is overwhelming Canadian and international evidence that increases in tobacco taxes can reduce tobacco use and increase government tax revenue [6–13]. At current taxation and tobacco use rates, taxes on tobacco products have the dual effect of decreasing the demand for tobacco and increasing government revenue. In fiscal year 2014–2015, the federal government collected more than \$3 billion in cigarette taxes [14]. In Ontario and Québec, Canada's largest provinces, the provincial governments collected more than \$1 billion each.

If Canada achieves "less than 5 by 35" through non-tax interventions, total taxes collected on the sale of tobacco products would dwindle substantially. Given that in 2014, 18.1% of Canadians aged 12 and older smoked either daily or occasionally [2], it could be expected that annual tobacco tax receipts decrease by as much as 75% from 2035. Moreover, during the period of transitioning from 18% to 5% smoking prevalence, the cumulative amount of tax losses year over year would be far from negligible. Achieving "less than 5 by 35", however, need not be achieved solely on the back of non-tax interventions. In the case, albeit extreme, that "less than 5 by 35" is achieved solely through tax and price increases, the cumulative tax revenue gains during the transition period could be considerable. Irrespective of the substantial cost savings gained from reductions in health care spending and reductions in indirect costs to society detailed above, there might be minimal changes in government revenue during the period of transition to "less than 5", if increased tax rates are a component of an endgame strategy.

The purpose of this paper is to evaluate the expected impact of endgame policies and understand the expected tax revenue impact of reducing smoking prevalence to less than 5%.

### **2. Materials and Methods**

### *2.1. Ontario SimSmoke Model*

Four of the Tobacco Endgame strategies were modelled using the Ontario SimSmoke simulation model. The Ontario SimSmoke model is adapted from the SimSmoke simulation model of tobacco control policies, previously developed for the U.S. and other countries [15–17]. The model uses population, smoking rates, and tobacco control policy data for Ontario. It assesses, individually and in combination, the effect of seven types of tobacco control policies (taxes, clean air, mass media, advertising bans, warning labels, cessation treatment, and youth access policies) on smoking prevalence and associated future premature mortality [18]. Each policy parameter in the model is accorded an effect size developed for the SIMSMOKE model based on literature reviews and expert panel. These existing parameters were then either maximized to represent full implementation of the intervention or the parameter effect sizes themselves were adapted according the new assumptions. Modifications were made to the Ontario SimSmoke policy levels or policy effect sizes to assess the impact of each Tobacco Endgame strategy on smoking prevalence in Ontario between 2019 and 2035. The following represent the changes in the SIMSMOKE model to represent the effect of the endgame scenarios.

To simulate the impact of plain packaging, the comprehensive marketing ban (both direct and indirect) policy level in Ontario SimSmoke was increased to 90% (up from 25%) as a proxy measure for plain packaging in which the package itself was assumed to be the primary method of direct consumer marketing in Ontario..

Free cessation services were modeled adapting two parameters in Ontario SimSmoke. The first parameter incorporated free cessation services (pharmacotherapy and behavioural therapy) in all primary care and hospital settings,. The second parameter expanded the number of settings offering free cessation to also include offices of health professionals, community, and 'other.' Free cessation services are currently limited in Ontario.

Analyses conducted by Chaiton, Mecredy, and Cohen [19] identified an increased risk of relapse among smokers who resided within 500 m from a tobacco outlet (Hazard ratio: 1.41) compared to those who lived further away. As a proxy measure for decreasing the number of outlets selling tobacco products, the policy effect sizes in Ontario SimSmoke for the five cessation treatment policies (treatment availability, treatment access, quitlines, quitlines with treatment access, and brief interventions) were increased by a value of 1.41.

Price elasticities were doubled in the Ontario SimSmoke model to assess the impact of increased tobacco taxes on smoking prevalence. Specifically, the policy effects were increased to −0.6 for youth less than 18 years (60% reduction in smoking), −0.4 for young adults aged 18 to 24 (40% reduction in smoking), −0.3 for adults aged 25 to 34 years (30% reduction in smoking) and −0.2 for adults aged 35 years or more (20% reduction in smoking).

### *2.2. Ontario Population Model*

Our final endgame model, increasing the minimum age of legal purchase to 21 years and tax revenue, was modelled separately from the SIMSMOKE model. In this model, we simulated the impact of minimum age laws by using a population program in which the baseline status quo rate of change in smoking prevalence was estimated to be 1.1% per year. We adjusted our model for effects in age group less than 19 and eliminated the effect of cessation in our model. This model was also used to evaluate the effect of taxation using a separate model that simulates the impact of tax and price increases required to achieve "less than 5 by 35".

Based on the analyses conducted by Callaghan et al. [20], it was assumed that the rate of onset for new smokers aged 20–22 would be 2.7 percentage points lower than it would have been under the standard projection for each year if the minimum age ban took effect immediately. No changes in prevalence were modelled for older ages at the time on the implied onset of the law; however, the effect was carried through as the cohort aged. Additionally, it was assumed that the increased age of onset would be associated with increased cessation in this cohort (natural rate of decrease adjusted from 0.011 to 0.022). No adjustment was made for any effects in youth younger than 19 who might be affected by reduced access to tobacco. No adjustment was made for any additional social normative effects.

This model obtained smoking prevalence from 2014 Canadian Community Health Survey (CCHS) [2]. We used Statistics Canada medium growth population projection scenario (M1: medium-growth, 1991/1992 to 2010/2011 trend, CANSIM Table 052-0005) [21]. The number of people aged 20–22 was obtained from the Ontario Ministry of Finance for years 2018–2035 [22]. Smoking prevalence and daily number of cigarettes consumed per smoker, by age: We used the most recent cycle (2014) of a large national survey, the CCHS, and obtained point estimates for smoking prevalence and intensity. Excise tax rate and revenue: We obtained current tobacco excise tax rates and more recent estimates of tobacco excise tobacco tax revenue from provincial Ministries of Finance. Total cigarette tax paid sales: As a measure of tax-paid sales, we used cigarette wholesale data as reported by tobacco manufacturers to Health Canada. Underlying trend: Smoking prevalence in Canada has steadily decreased since the mid-1960s. In 1965 about half of all Canadians aged 15 and above smoked. By the early 2010s, only about 20% did [23]. This steady decline was due to

many factors such as information on the harmful effects of active smoking and secondhand smoke, tobacco control policies such as smoke free policies, advertising bans and taxation, and changes in anti-smoking sentiment. Although it is difficult to disentangle the effects of each of these factors, it seems reasonable to assume that the downward trend in smoking prevalence observed between the early 2000s and the present would not abruptly end in the near future. In the last decade for which data are available, smoking prevalence, on average, declined annually by about 2% to 3% depending on the province. We assumed an underlying trend of 2.5% in annual decrease in both smoking prevalence and daily number of cigarettes consumed per smoker.

### *2.3. Tax Revenue*

This model that simulates the impact of tax and price increases required to achieve "less than 5 by 35" by examining the impact on taxation revenues under three different scenarios: (1) excise taxes are increased only to keep up with inflation; (2) "less than 5 by 35" is achieved solely through excise tax increases; and (3) "less than 5 by 35" is achieved through non tax intervention and excise tax increases that raise prices by 5% in real terms annually. We used accepted parameters of elasticity for changes in tobacco prices for adults (−0.4) and twice that for youth [13]. The model accounts for population growth, inflation, and tax evasion. We used data for the province of Ontario to simulate the impact of tax and price increases required to achieve "less than 5 by 35" on tax revenue. At the current tax rates, it is expected that Ontario will collect about \$1.5 billion in 2016. All monetary figures below are in constant 2016 dollars. To estimate the changes on tax revenue, we made the following baseline model parameters and assumptions.

Own-price elasticity: There is overwhelming evidence that individuals respond to changes in tobacco prices. In high-income countries such as Canada and the United States, it is generally accepted that a 10% increase in prices would reduce total consumption by about 4%; and that half of the reduction comes from a reduction in the number of smokers and half from a reduction in consumption among continuing smokers [13]. It is also generally accepted that youth respond more to changes in prices—about twice as much as older adults [13]. Consequently, as a baseline assumption for own-price elasticity for cigarettes, we used −0.4 for adults (20 years of age and above) (−0.2 for own-price prevalence elasticity and −0.2 for own-price consumption elasticity), and twice that for youth (12 to 19 years of age).

Pass-through rate: Tax changes do not necessarily lead to price changes as manufacturers are rarely required to pass on the full extent of tax increases to consumers. Manufacturers often under- or over-shift tax changes. In mature cigarette markets such as Canada, manufacturers typically over-shift tax increases [24]. As a baseline assumption, we assumed that tobacco manufacturers over-shift tax increases by 10%.

Prices: In order to estimate the effect of tax changes on smoking, it is necessary to first estimate the effect of tax changes on current prices. We used \$0.40 per cigarette stick.

Expected inflation: As a measure of expected inflation, we used 2% annual increases to reflect the Bank of Canada's 2% inflation-control target [25].

Cigarette tax evasion: Although cigarette tax evasion has many causes, high taxes undeniably create an incentive for tobacco users and manufacturers to elaborate ways to evade tobacco taxes. While the illegal nature of cigarette tax evasion makes it intrinsically difficult to measure accurately, cigarette tax evasion in some Canadian regions such as southern Ontario is not negligible [26]. Our model allows for a portion of the effect of tax and price increases on tobacco use and consumption to be directed towards contraband cigarettes.

### **3. Results**

*3.1. Smoking Prevalence Modelling*

Results from the Ontario SimSmoke simulation model indicate that each of the Tobacco Endgame strategies predicts a greater reduction in smoking prevalence by 2035 compared to the status quo scenario (Table 1 and Figure 1).

**Table 1.** SimSmoke Model Predicted Smoking Prevalence, for Both Sexes, Ages 15–85, With and Without Tobacco Endgame Policies, Ontario, 2018–2035.


<sup>a</sup> Status quo represents the policy levels prior to the first projection year (2019). Source: Ontario SimSmoke.

**Figure 1.** SimSmoke Model Predicted Smoking Prevalence, for Both Sexes, Ages 15–85, With and Without Tobacco Endgame Policies, Ontario, 2018–2035. Status quo represents the policy levels prior to the first projection year (2019). Note: Full data table for this graph provided in the Appendix A (Table A1) Source: Ontario SimSmoke.

Increased taxation had the greatest independent impact on smoking prevalence. By 2035, smoking prevalence is projected to reach 10.1% with increased tobacco taxes, while the status quo prevalence is projected to be 12.9% in 2035 (a 2.8 percentage point reduction).

Decreased tobacco availability is projected to reduce smoking prevalence by 1.5 percentage points in 2035, from 12.9% with the status quo scenario to 11.4% with fewer tobacco outlets.

Offering free cessation services in primary care and hospital settings (i.e., Ottawa Model of Smoking Cessation model) is projected to reduce smoking prevalence to 12.2% in 2035, while free cessation services offered in primary care, hospitals, offices of health professionals, community and 'other' settings is projected to further reduce smoking prevalence to 12.1% in 2035. Both cessation policy models project lower smoking prevalence in 2035 compared to the status quo scenario (12.9% in 2035; a 0.61 and 0.78 percentage point reduction, respectively).

Plain packaging is projected to reduce smoking prevalence by 0.6 percentage points in 2035, from 12.9% with the status quo scenario to 12.3% with plain packaging.

The combined effect of all four Tobacco Endgame strategies modelled in Ontario SimSmoke is projected to reduce smoking prevalence to 8.5% in 2035, a 4.4 percentage point reduction compared to the status quo scenario (12.9% in 2035).

In the model assessing the impact of a higher minimum age for legal purchase, population smoking prevalence was expected to decline 3.7 percentage points by 2035 to 13.2% from an imputed value of 16.9% under the baseline status quo scenario. Increasing the minimum legal purchase age to 21 would be expected to reduce smoking prevalence to 10.5% (8.0% among the 20–34 year olds; 2.7 and 5.2 percentage point decrease, respectively). Eliminating the effect on cessation in the model would predict a 2035 prevalence of 11.2% (10.8% among the 20–34 year olds; 2.0 and 2.4 percentage point decrease, respectively) (Figure 2).

**Figure 2.** Model Predicted Smoking Prevalence, for Both Sexes, With and Without Increased Minimum Age Tobacco Purchasing Law, Ontario, 2018–2035.

### *3.2. Taxation Revenue Models*

Average number of cigarettes per day was expected to be 4.0 cigarettes smoked per day among the 5% who were expected to continue smoking on average by 2035 down from 13.3 cigarettes a day in 2014.

Scenario 1. "Less than 5 by 35" achieved through non-tax interventions (excise taxes assumed to keep up with inflation):


Scenario 2. "Less than 5 by 35" achieved solely through excise tax increases (assuming an underlying annual downward trend in smoking prevalence and consumption of 2.5%). Note that such a scenario requires that taxes increase annually by more than 20%:


Scenario 3. "Less than 5 by 35" achieved through non-tax interventions and excise tax increases that raise prices by 5% in real terms, annually:


### **4. Discussion**

The modelling results presented in this report highlight the effects of five key Tobacco Endgame strategies to reduce the smoking prevalence in Ontario by the year 2035. Increasing the tobacco taxes had the greatest independent predicted decrease in smoking prevalence by the year 2035 (2.8%), followed by increasing the minimum age for legal purchase to 21 years (2.4%) and decreasing the number of tobacco outlets (1.5%). Offering free cessation services and introducing plain packaging on all tobacco products each reduced the smoking prevalence by less than 1% compared to the status quo. Notably, none of the Tobacco Endgame strategies (either independently or combined) projected a smoking prevalence that was less than 5% by 2035.

Regarding impact of tax interventions on government revenue, our model shows that if Canada achieves "less than 5 by 35" through non-tax interventions, annual tobacco tax receipts would decrease from about \$1.5 billion to about \$160 million in 2035. However, if tax rates increase such that prices increase by 5% annually (in excess of inflation)—a policy pursued by France from 1991 to the early 2000s—average annual tax revenue would amount to about \$1.2 billion and the cumulative taxes collected between 2016 and 2035 would near \$25 billion.

The scenario 2 model showing the potential prices needed to achieve "less than 5 by 35" through taxation alone demonstrates the need for a comprehensive policy for the Tobacco Endgame that relies on both tax and non-tax interventions. Allowing for a portion of the effect of tax and price increases on tobacco use and consumption to be directed towards contraband cigarettes, as expected, reduces tax receipts, but does not invalidate any of the key findings. Similarly, our results are not sensitive to the use of a more conservative own-price elasticity estimates of −0.3. Taxation revenue should not be a barrier to the endgame. The analysis shows that with a sensible taxation policy, fiscal cost impact over the period of implementation is minimal compared to the health care and social costs of tobacco which currently are estimated at \$16.2 billion per year [27]. Ultimately, however, it is important to recognize that the massive health and mortality burden due to tobacco is not worth sustaining for any amount of profit or revenue.

Caution should be taken when interpreting the projections presented in this report as they depend on the reliability of the data, and the estimated parameters and assumptions used in the models. A reduction in smoking prevalence and consumption in excess of current trends would inevitably lead to future populations that are larger than projected by Statistics Canada's medium growth population projections. There is strong evidence that higher incomes increase the demand for tobacco products [13]. However, income growth in Canada is projected to be relatively low [28]. Consequently, income effects are unlikely to affect the above results. Our approach examines the effect of changes in tobacco excise rates on tobacco excise revenue and not on harmonized sales tax (HST) which is a non-tobacco specific tax applicable on any taxable supplies in Canada, as ex-smokers and continuing smokers that reduce their consumption will very likely divert their spending towards goods and services that are also subject to HST. Our approach does not address the issue of tax avoidance such as brand switching. Because governments in Canada rely entirely on tobacco specific excise taxes and not on specific ad valorem taxes, which differs between brands of tobacco products. More broadly, the endgame potential interventions here are only a possible subset of innovative strategies that could change the landscape of tobacco control. For instance, this study does not consider the role of e-cigarettes, reduced nicotine, or structural changes to the tobacco industry. These other interventions may have a greater impact on smoking prevalence or health burden than the intervention set considered here.

### **5. Conclusions**

Simulation models project that increasing tobacco taxes would result in the greatest decrease in smoking prevalence, and that reducing smoking prevalence to "less than 5 by 35" by both non-tax interventions and excise tax increase would result in minimal impact on government tax revenue. However, despite significant projected decrease in smoking prevalence, achieving "less than 5 by 35" might not be possible through the five key Tobacco Endgame strategies, either independently or combined.

**Author Contributions:** Conceptualization, M.C., G.E.G., R.S.; methodology, G.E.G., M.C.; formal analysis, G.E.G., J.D. writing—original draft preparation, G.E.G. writing—review and editing, M.C., J.D., G.E.G., R.S.; funding acquisition, R.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** Health Canada Substance Use and Addiction Program.

**Data Availability Statement:** Model for forecasting tax simulation available on request.

**Acknowledgments:** The Ontario Population model was developed for the Tobacco Endgame Summit and we appreciate the contributions of the steering committee and summit attendees.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

### **Appendix A**

**Table A1.** SimSmoke Model Predicted Smoking Prevalence, for Both Sexes, Ages 15–85, With and Without Tobacco Endgame Policies, Ontario, 2018–2035.


<sup>a</sup> Status quo represents the policy levels prior to the first projection year (2019). Note: Data table is for Figure 1.

### **References**

1. World Health Organization. *Report on implementation of the Framework Convention on Tobacco Control*; World Health Organization: Geneva, Switzerland, 2014.

2. Statistics Canada. Canadian Community Health Survey, 2014. Cansim Table 105-0501. Available online: https://www150.statcan. gc.ca/t1/tbl1/en/tv.action?pid=1310045101 (accessed on 13 March 2021).


### *Article* **Load Forecasting in an Office Building with Different Data Structure and Learning Parameters**

**Daniel Ramos 1,2, Mahsa Khorram 1,2, Pedro Faria 1,2 and Zita Vale 2,\***


**Abstract:** Energy efficiency topics have been covered by several energy management approaches in the literature, including participation in demand response programs where the consumers provide load reduction upon request or price signals. In such approaches, it is very important to know in advance the electricity consumption for the future to adequately perform the energy management. In the present paper, a load forecasting service designed for office buildings is implemented. In the building, using several available sensors, different learning parameters and structures are tested for artificial neural networks and the K-nearest neighbor algorithm. Deep focus is given to the individual period errors. In the case study, the forecasting of one week of electricity consumption is tested. It has been concluded that it is impossible to identify a single combination of learning parameters as different parts of the day have different consumption patterns.

**Keywords:** building energy management; forecast; neural network; SCADA; user comfort

### **1. Introduction**

Energy consumption forecast is very important in the context of energy consumption management towards improved energy efficiency. The forecast's accuracy may be improved based on retraining with a fixed size of training, discarding older information while retaining new information. The selection of sensors from smart technologies is another aspect that provides more training data that are expected to decrease the forecast errors [1].

The electricity markets face possible generation costs caused by environmental issues [2,3]. Smart grids are implemented in many of these markets, supporting efficient energy use [4]. Solutions involving smart grids consist of an adequate consumer schedule aimed to reduce the electricity consumption in particular periods [5]. These solutions are contextualized when markets launch demand response programs to make the consumption schedule adequate to reduce electricity costs interpreted by peaks [6].

Smart buildings play an important role in the electricity sector to satisfy occupants' electric needs and exploit operational flexibilities. Therefore, the launch of model optimization evidences the need to control the microgrids' power flows [7]. To deal with the situation, it requires solutions from demand response programs, reducing the energy costs using the smart grid opportunities to readapt the consumption to play an important role in load management and energy efficiency [8].

The optimization of electrical energy is possible with data monitored from a measurement system that captures real-time data and automatic forecasting [9,10]. With regard to forecasting, several machine learning algorithms can be used [11–14]. An artificial neural network (ANN) is described by layers containing neurons with weighted connections starting in an input layer, at least one hidden layer, and an output layer [15]. An alternative technique, K-nearest neighbor (KNN), performs data searches and associations in a large resource space with non-linear mapping support [16].

**Citation:** Ramos, D.; Khorram, M.; Faria, P.; Vale, Z. Load Forecasting in an Office Building with Different Data Structure and Learning Parameters. *Forecasting* **2021**, *3*, 242–255. https: //doi.org/10.3390/forecast3010015

Received: 30 January 2021 Accepted: 17 March 2021 Published: 20 March 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Various types of time-scaled forecast data may be evidenced in the field of energy, with Short Time Load Forecasting (STLF) being a good option. ANN is recommended for many short-term applications, including the prediction of daily peaks by using the training data with past data framed on past years [17]. KNN is suggested for both classification and regression tasks, and in the suggested approach, it is used for regression problems that involve energy predictions. The reduction of data complexity is a relevant aspect evidenced in the algorithm, possible with the nearest neighbors' readaptation to several subsets of data [18]. An even more innovative algorithm is suggested in [19] featuring a KNN-ANN model that uses the K-nearest neighbors process while adding a backpropagation function known to be a particular aspect of an artificial neural network (ANN). The application of the KNN-ANN model is suggested for a stock price prediction problem. The NPower Forecasting Challenge, taken in the year 2015 edition evidenced in [20], challenges the participants to perform daily energy predictions of a customer group. Several algorithms, including artificial neural network and Random Forest, are suggested. In another study, students' classification in algorithms like artificial neural network and Support Vector Machines is analyzed and their limitations are studied [21].

A research area of high interest is the energy efficiency of buildings—more specifically, the power distribution network that connects the equipment to end-users. The energy efficiency is highlighted on several worldwide applications including Supervisory Control And Data Acquisition (SCADA) and IoT systems [22]. These technologies allow the monitoring and management of consumption data on all the types of building from residential to commercial level. Thess data are relevant for the forecasting of data in the field of energy that are associated with electricity markets and policy formulations [23].

The forecasting of energy consumption with daily profile data usually improves the financial profit of consumers considering the monthly electricity bills reducing the peaks of energy detected in particular periods. The accuracy of energy forecasting algorithms depends on infrastructure and planning [24]. There are three ways to model an energy forecasting system mentioned in [25], including physics-based, data-driven, and hybrid models. While pros and cons are in question, the data-driven method has been proven as the best option for merging buildings in the smart grids. An additional factor that may improve the forecast reliability is to use sensor data that performs different measures according to each device according to smart meters [26,27]. The validation of forecasting models is another factor that should be taken into account in several smart buildings [28]. Real-time automatic energy forecasts with access to electric energy are recommended to be performed with data monitored in a building to achieve energy management optimization [29]. In [30], the component estimation technique is used for electricity consumption forecasts; historic consumption data were used. In [31], the impact of data quality in the electricity consumption forecast is discussed. The main focus is given to the dataset cleaning.

This paper provides a methodology to improve electricity consumption forecasting accuracy with sensor data measured by different devices, including presence, temperature, consumption, and humidity. The forecasting algorithms, namely ANN [32] and KNN [33], are implemented as a service and are the recommended options for the decision-making approaches to be used in the present paper. The innovative scientific aspect relies on the specific manipulations of data to overcome anomalies in data, including missing and excess occurrences. Second, the systematic analysis of different learning parameters is implemented to define the most relevant parameters in different periods of the day. This major aspect is usually treated in the literature by analyzing overall average forecasting errors without looking in detail at particular periods [34]. This aspect refers to a limitation in the recent literature, including the one published by the authors of the present paper in [1]. The forecasts are done for intervals (referred to as periods) of 5 min.

After this introduction, the proposed method is explained in Section 2, describing what is done at each stage. Proceeding to Section 3, the results of using the method are presented. The discussion is made in Section 4, and the main conclusions are presented in Section 5.

### **2. Materials and Methods**

This section illustrates and explains the different phases of a method. The parameterization definition, the data reduction, the training and forecasting tasks, and the error calculation are parts of the tasks presented in Figure 1. The presented method is very important to support a building's participation, namely an office building, in demand response programs [35]. Addressing consumer comfort, a SCADA system can make autonomous decisions for participation in demand response programs issued by the distribution network operator [36].

**Figure 1.** Proposed methodology diagram.

The innovative aspect of the present method is highlighted in green in Figure 1. As can be seen in the green arrow, the forecasting provides feedback to the training service regarding the accuracy of different learning parameters in different periods of the day. The test service is adapted to accommodate the fact that different periods of the day are related to different consumption patterns, so the test service must be run for each period. Different time frames are considered in the "Test service for different periods", namely: weekly Symmetric Mean Absolute Percentage Error (SMAPE) accuracy; daily SMAPE accuracy; period of day SMAPE accuracy; specific period accuracy. SMAPE is defined in Equation (3). The periods in a day for SMAPE in this paper are considered to be three periods: 00:00 to 08:00; 08:00 to 17:00; and 17:00 to 24:00.

The tuning process performs parametrization of data required for later use on forecasting tasks with the support of analysis, studies, optimizations, and data manipulations. Two main aspects describe this process. The first one evaluates the data content analyzing the best possible forecasting technique that should provide better results in that specific situation. The second one performs data transformations to the initial dataset reducing the original version of data to a more accurate version fed by the forecasting technique that should provide more accurate forecasts. There is a balance between the completion and simplicity of data to avoid wrong interpretations. Therefore, data structure and reliability are two main aspects to improve the accuracy of the algorithm.

The real-time data consist of all monitored and persistent data that the building technologies track in the system more concretely with consumption and sensors data. The correlation process has the goal of analyzing which sensors are more associated with

consumption. Both the tasks of providing a sample and the correlation study influence the participation towards reducing the dataset.

Despite reducing the dataset to the entire historic series, the same rules apply for real-time data. The forecasting methodology studies which technique is better for the sampling of data. Both the reduced version of the dataset and the forecasting method are sent to the training service.

The cleaning operation makes data more accurate for further use on forecasting tasks. It goes through several phases, starting with reorganizing all data in a unique spreadsheet with data split into several fields, including year, month, day of the month, days of the week, hours, and minutes. The criterion applied for missing information is to make sequential copies of previous records.

Outliers treatments are applied to detect erroneous readings made by technology devices. The outlier's detection occurs with the support of the mean and standard deviation operations, as seen in Equations (1) and (2). The conditions implicit in the outlier's detection with the support of the mean and standard deviation are presented in Equation (3), suggesting scenarios where a point is outside of an interval between two values: the average minus or plus of a product between the error factor and the standard deviation. In the present paper, consumptions above 4800 W or below 300 W are considered outliers. These values have been established according to the authors' knowledge about building consumption.

$$A = \frac{\sum\_{t=n-F}^{n} P(t)}{F} \tag{1}$$


$$S = \sqrt{\frac{1}{F} \times \sum\_{t=n-F}^{n} \left(P(t) - A\right)^2} \tag{2}$$


The service ends by extracting the cleaned data into a suitable structure that is understandable by the forecasting technique.

The forecast service is triggered the first time after the end of the training service. There are alternative ways, including testing requests or scheduling a new iteration after the error calculation process. The forecasting service reads the test parameters that are synchronous with each iteration with the support of a schedule that forecasts different contexts according to the forecasting technique [11–16] determined in the tuning service representing the total target consumptions. The test service is triggered the first time by default after the forecasting service ending. This service goal is to calculate the forecasting errors in each context which interprets how distance is the actual value from the forecast counterpart. The errors are calculated based on three possible metrics: Weight Absolute Percentage Error (WAPE), Symmetric Mean Absolute Percentage Error (SMAPE), and Root Mean Square Percentage Error (RMSPE). This paper highlights the use of SMAPE, as seen in Equation (3), as it has been identified as the adequate one for this application [37].

$$SMAPE = \frac{1}{F} \ast \sum\_{t=n-F}^{n} \frac{|PF(t) - P(t)|}{(P(t) + PF(t))/2} \tag{3}$$


Following this, a trigger is activated, sending a new retrain request [1] to rerun the training service with more updated information that will discard previous data while also retaining new ones until the trigger point while keeping the same size data. In the present paper, artificial neural network (ANN) and K-nearest neighbor (KNN) forecasting algorithms are used [23]. ANN features a set of artificial neurons connected and structured in layers with a learning process that resembles the biological brain. The layers' structures describe an input and output layer separated by a hidden layer that performs calculations iteratively, learning a logic that associated the input to output data. The neurons transmit data to other neurons with signals according to the edges and layers' structures. The data received from the neurons are propagated afterward to other neurons following a process where the output of each neuron is computed through a non-linear function of the sum of inputs. All the combinations composed of neurons and edges are associated with a weight that adjusts during the learning process [15]. An alternative technique, K-nearest neighbor (KNN), performs data searches and associations in a large resource space with the support of non-linear mapping. This alternative is a method used both for classification and regression applications. In both cases, the input consists of different subsets named neighbors described by the historical data's closest examples.

The output differs from the classification and regression applications following different logics. For classification, the output consists of a class component that associates the nearest neighbor with the most common features. For regression, the output consists of a property of an object value calculated through the average of the set of nearest neighbors [16]. In [1] and [15], the authors have explored using different algorithms in the forecasting of office building consumption, namely ANN, KNN, Random Forest, and SVM. It has been concluded that ANN and KNN are adequate for the specific application under study in this paper. Other deep learning and ensemble learning algorithms can be explored in future work. Nonetheless, the present paper's main idea is to show that different algorithms can be more advantageous in different periods of the day or the week.

### **3. Results**

This section presents the case study, including scenarios and the respective results. The building's historical data have been used as input data, so that the building has been divided into three zones [1]. In Figure 2, the topology of the building can be seen, with the respective three zones and the nine rooms (R1 to R9). In the bottom-right of Figure 2 is shown the detail of Zone 1. The zones of the building have been defined according to the sub-metering installed in the building. It matches the electrical switchboard coverage zones. In this way, the sensors data and consumption data are aggregated according to these zones. For this case study, the historical data of Zone 1 are selected. The selected historical data span the period from 22 May 2017 to 17 November 2019 with 5 min time intervals. It should be noted that the building is equipped with energy meters to record the consumption data and PV generation data as well. Additionally, there are different building sensors such as seven light power indicators, four movement sensors, three door status indicators, one air quality sensor, one temperature sensor, one humidity sensor, and one CO2 sensor.

The input data are a matrix structure composed of twelve columns evidencing attributes associated to specific five-minute periods. A total of 262,060 rows evidencing the total number of observations from 22 May 2017 to 17 November 2019 were separated by five-minute intervals. The historic dataset represented by 22 May 2017 to 8 November 2019 contains 260,054 rows while the target week represented by 11 to 17 November contains 2006 rows. The initial ten columns identify consumption values, while the remaining two identify additional values obtained from enhanced sensors data, more specifically CO2 and light intensity. The ten-input consumption featuring five-minute field values that precede the output counterpart corresponds to a period of fifty minutes. The CO2 and light intensity resemble a single value placed in the five minutes preceding the output consumption. This dataset has been categorized based on the weeks, so focused time period includes

130 weeks. Figures 3–5 show the building's present input data in 130 weeks, related to the power consumption, CO2 concentration, and intensity of lights, respectively. It means that each line represents the consumption data of one specific week in 2016 periods (5 min time interval).

**Figure 2.** Building zones.

**Figure 3.** Power consumption of building from 22 May 2017 to 17 November 2019 is categorized based on the weeks.

**Figure 4.** CO2 concentration data from 22 May 2017 to 17 November 2019 are categorized based on the weeks.

**Figure 5.** Light intensity data from 22 May 2017 to 17 November 2019 are categorized based on the weeks.

Several other environment data and parameters, such as the weather data, can impact the forecasting model's accuracy; the authors have discussed this in [1]. It has been concluded that, for the office building under study, as the researchers have a very specific routine, weather data do not contribute to improving the accuracy of the forecasting. This case study's main purpose is to forecast the consumption of 7 days based on the proposed training dataset. Additionally, 60 scenarios have been tested on different parameters such as number of entries, learning rate, number of neurons, clipping ratio, epochs, early stopping, and validation split on the forecasting results. Figure 6 shows the real consumption of 7 days of the test dataset. It should be noted that each day includes 288 periods (5 min interval), and each color represents one day.

**Figure 6.** Actual power consumption of 7 days of the week with 5-minute time intervals.

The CO2 concentration and intensity of lights have been presented in Figures 7 and 8, respectively, to propose the real data in the last week.

Table 1 introduces the characteristics of 60 scenarios with different parameters. Additionally, the calculated error of each forecasting can be seen on the right side of the table based on the ANN and KNN approaches. As shown in Table 1, the rank of calculated errors has been presented by dark color to bright color so that dark green cells show the lower error and white cells present the higher errors. To present the details of these error calculations, three scenarios (A, B, and C) have been selected to be illustrated by figures. The characteristics of these three cases can be seen in Table 1. The characteristics of scenarios A and C are equal. However, the applied techniques for the forecast are different.

**Figure 7.** CO2 concentration data of 7 days of the week with 5 min time intervals.

**Figure 8.** Light intensity data of 7 days of the week with 5 min time intervals.

**Table 1.** Error calculation based on artificial neural network (ANN) and K-nearest neighbour (KNN) approaches for 60 different scenarios.


\* Scenario A; \*\* Scenario B; \*\*\* Scenario C.

Each scenario focuses on seven days, shown by three figures based on the focused time. Figure 9 indicates 96 periods related to the 00:00 to 08:00 (5 min time interval), Figure 10 focuses on 108 periods from 08:00 to 17:00 (5 min time interval), and Figure 11 is related to the 84 periods from 17:00 to 24:00 (5 min time interval). The three referenced figures are related to scenario A. In Appendix A, the figures are presented related to scenario B (Figures A1–A3) and the figures related to scenario C (Figures A4–A6). The values selected for each parameter have been defined by the authors based on the experiments made on the ranges of each parameter that affect the results of forecasting. Additionally, the authors wanted to determine the influence of using the day-of-the-week information as input data to decide if it contributes or not to improving the accuracy.

Figure 9 presents the calculated SMAPE of scenario A in the first part of the day: 96 periods of 5 min are presented, related to the period between 00:00 and 08:00.

Each period of 5 min includes seven points in the graph, corresponding to the consumption for seven days of the week. Figure 10 presents the calculated SMAPE of scenario A in the second part of the day (from 08:00 to 17:00). Figure 11 presents the calculated SMAPE of scenario A in the third part of the day.

**Figure 9.** Forecast errors based on ANN approach in scenario A from 00:00 to 08:00.

**Figure 10.** Forecast errors based on ANN approach in scenario A from 08:00 to 17:00.

**Figure 11.** Forecast errors based on ANN approach in scenario A from 17:00 to 24:00.

The discussion of the results obtained will be presented in Section 4, focusing on the results already presented and Appendix A.

Regarding the error analysis in each day, Table 2 presents the SMAPE errors for each method. The data used in Table 2 relate to ten entries: learning rate (0.005), number of neurons in intermediate layers (64), clipping ratio (5.0), number of epochs (500), early stopping (20), validation split (0.2). The day of the week is not considered.

**Table 2.** SMAPE of ANN and KNN methods for each day.


It can be seen that for every single day, ANN is always providing a more accurate forecast. However, as can be seen in the period-by-period analysis, KNN can have better accuracy in specific periods of the day or week.

### **4. Discussion**

Looking at Figures 9–11 and Figures A1–A6 it is possible to see that the same method with the same parameters is not more accurate for all the periods. Focusing on the first period of the day, from 00:00 to 08:00, it can be seen that scenario C is the one with the highest dispersion of SMAPE for each period. Looking at Table 1, scenario C is the one with higher SMAPE between the three scenarios. However, for the period between 08:00 and 17:00, scenario C's results are not the worst ones, mainly compared with scenario A (Figures 9 and A2). Finally, regarding the third part of the day, from 17:00 to 24:00, scenario C is the worst one. Scenario B has a regular behavior along this period. However, scenario A is the best one at the end of this period (in the last third of this period). Comparing ANN and KNN, it can be seen that it is impossible to decide on the best one as scenario C is very accurate in a specific period of the day.

It has been found that, generally, the number of entries should be 10, as increasing the number of entries does not provide better results. Regarding the learning rate, it has been found that lower learning rates were more accurate in the results. The same comment applies to the number of neurons. Regarding the clipping ratio and the epochs, the early stopping, the validation split, and the days of the week, it is not possible to make a selection, as both values provide good results in different scenarios.

These results and discussion lead us to conclude that the definition of the ANN and KNN features must be done contextually, as different contexts bring different consumption patterns, and therefore, deserve different configurations in algorithms.

### **5. Conclusions**

This paper has presented a forecasting service used in an office building aiming to support decisions regarding energy management towards efficiency. Two algorithms for forecasting have been used, namely artificial neural network and K-nearest neighbor, testing different algorithms and data features. It has been found that, for different periods of the day, which means different contexts regarding consumption patterns, different algorithm parameters can have higher accuracy levels. This means that it is not possible to say that a single algorithm is more accurate for the office building under study. In other words, one should select KNN for some periods of the day and ANN for other periods of the day, as discussed in Section 4.

**Author Contributions:** Conceptualization, P.F., and Z.V.; methodology, P.F., Z.V.; software, D.R., M.K.; validation, D.R., P.F., Z.V.; formal analysis, D.R.; investigation, D.R., M.K., P.F., Z.V.; resources, P.F., Z.V.; data curation, D.R., M.K., P.F., Z.V.; writing—original draft preparation, D.R., M.K., P.F., Z.V.; writing—review and editing, D.R., M.K., P.F., Z.V.; visualization, D.R., M.K., P.F.; supervision, P.F., Z.V.; project administration, P.F., Z.V.; funding acquisition, P.F., Z.V. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work has received funding from FEDER Funds through COMPETE program and from National Funds through (FCT) under the projects UIDB/00760/2020, MAS-Society (PTDC/EEI-EEE/28954/2017) and CEECIND/02887/2017.

**Data Availability Statement:** The data used in this study are available in [1].

**Conflicts of Interest:** The authors declare no conflict of interest.

### **Appendix A**

This appendix presents six figures that are added to the results.

**Figure A2.** Forecast errors based on ANN approach in scenario B from 08:00 to 17:00.

**Figure A3.** Forecast errors based on ANN approach in scenario B from 17:00 to 24:00.

**Figure A4.** Forecast errors based on the KNN approach in scenario C from 00:00 to 08:00.

**Figure A5.** Forecast errors based on the KNN approach in scenario C from 08:00 to 17:00.

**Figure A6.** Forecast errors based on the KNN approach in scenario C from 17:00 to 24:00.

### **References**


### *Article* **A Model Predictive Control for the Dynamical Forecast of Operating Reserves in Frequency Regulation Services**

**Pavlos Nikolaidis \* and Harris Partaourides**

Department of Electrical Engineering, Cyprus University of Technology, P.O. Box 50329, 3603 Limassol, Cyprus; c.partaourides@cut.ac.cy

**\*** Correspondence: pavlos.nikolaidis@cut.ac.cy; Tel.: +357-25-002-041; Fax: +357-25-002-635

**Abstract:** The intermittent and uncontrollable power output from the ever-increasing renewable energy sources, require large amounts of operating reserves to retain the system frequency within its nominal range. Based on day-ahead load forecasts, many research works have proposed conventional and stochastic approaches to define their optimum margins for reliability enhancement at reasonable production cost. In this work, we aim at delivering real-time load forecasting to lower the operatingreserve requirements based on intra-hour weather update predictors. Based on critical predictors and their historical data, we train an artificial model that is able to forecast the load ahead with great accuracy. This is a feed-forward neural network with two hidden layers, which performs real-time forecasts with the aid of a predictive model control developed to update the recommendations intra-hourly and, assessing their impact and its significance on the output target, it corrects the imposed deviations. Performing daily simulations for an annual time-horizon, we observe that significant improvements exist in terms of decreased operating reserve requirements to regulate the violated frequency. In fact, these improvements can exceed 80% during specific months of winter when compared with robust formulations in isolated power systems.

**Keywords:** renewable energy sources; load forecasting; frequency regulation; artificial neural network; model predictive control

### **1. Introduction**

The power generation sector has seen rapid growth, mainly due to the increasing industrialization, domestic appliances and transportation demand [1]. The global challenge for modern power systems is to satisfy the growing electricity demand, whilst supplying uninterruptible and high-quality services. For several years now, this requirement has been fulfilled mostly by using fossil fuels because of their concentrated energy, which makes their output dispatchable and easy to adjust according to the load needs [2]. Based on well known load curves, the system operators could appropriately plan-ahead adequate operating reserves to allow for deviation corrections between the expected and actual load demand. However, the continuous burning of fossil-fuels poses a serious threat to the global environment and consequent climate change, calling for emission-free and renewable energy sources in the forthcoming years.

On the other hand, the introduction of renewable power generation produces a number of critical changes on the unit commitment and economic dispatch problem formulation. The intermittent and volatile behavior of renewable resources impose further variations on net demand and thus, the clarity of the operating reserves must be carefully scheduled. In addition, their uncontrollable and unpredictable power output increases the reserve requirements and probable deficits are reflected as frequency deviations between the nominal values. Consequently, the simultaneous increase in electricity demand and reduction in contributions of conventional sources create a lot of power integration and fluctuation issues, which undoubtedly disturb the overall system security, stability and reliability. Since the renewable energy sources do not contribute in flexibility, at a relatively low penetration

**Citation:** Nikolaidis, P.; Partaourides, H. A Model Predictive Control for the Dynamical Forecast of Operating Reserves in Frequency Regulation Services. *Forecasting* **2021**, *3*, 228–241. https://doi.org/10.3390/ forecast3010014

Received: 25 February 2021 Accepted: 16 March 2021 Published: 17 March 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

level, they are commonly treated as negative loads providing comparable fluctuations with the existing net load fluctuations. As their penetration level grows, the conventional generating units occur inadequate for load following [3]. Over the last decade, researchers have extensively applied conventional and stochastic optimization techniques to define the optimal operating reserve margins and enhance the overall system reliability at reasonable costs. Based on predefined load curves, the various approaches broadly used can be divided into robust, deterministic and stochastic. The deterministic formulations recommend constant shares to represent the forecast errors in load demand. Without investigating the comparative performance of different risk considerations, the deterministic approaches rely solely on a set of uncertain parameters, offering poor reliability/cost trade-offs. To strengthen the robustness, a conservative formulation may propose a 5% upward and downward deviation space, while more robust approaches involve up to 10% margins for islanded systems [4]. More recently, a variety of solutions have relied on stochastic mechanisms, distinguishing the formulations into random scenario reduction, distributionally robust and uncertainty-set classifications [5,6]. Aiming at the minimization of the expected cost over a probability distribution that is represented by scenarios, these frameworks are versatile [7,8]. However, they require significant computational efforts and it is difficult to retrieve temporal and spatial correlations within scenario-trees [9].

The vast majority of the literature in relating fields concentrates on household or small area level load forecasting (i.e., distribution transformer) due to the significantly limited availability of regular patterns. In their effort to address the imposed uncertainty, the existing methods can be divided into three main categories. The methods of the first category make use of clustering or classification techniques to correlate similar customers, day types or weather conditions, targeting on the reduction of uncertainty variance [10]. A second category focuses on the elimination of the imposed uncertainties at the meterlevel by utilizing aggregated smart-meter data [11], whereas the rest of the methods fall in the last category and refer to uncertainty separation within the regular patterns, relying on spectral analysis such as Fourier transformation, wavelet and empirical mode decomposition [12]. Beyond the aggregated level, load forecasting methods are based on sophisticated mechanisms and machine learning techniques. A tutorial review of probabilistic electric load forecasting is provided in [13]. The authors in [14] presented a comparison between hybrid and artificial intelligence models including support vector machines, expert systems, fuzzy logic, regression trees and artificial neural networks, while the notable time series models of long short-term memory (LSTM) systems, recurrent (RNN) and convolutional (CNN) neural networks combined with different regression techniques are discussed in [15]. Although highly flexible and effective, RNN-based approaches outperform traditional forecasting models in terms of root mean square error (RMSE) and mean absolute percentage error (MAPE) [16,17].

The existing methods aim at day-ahead forecasts or make use of RNN systems to only minimize the forecast error against the actual load. To the best of our knowledge, there has not yet been a comprehensive solution that targets real-time forecasts to improve the performance using updated input values. Most approaches utilize temperature as the only weather-dependent variable and no research work is targeted on the real-time estimation of reserve margins. In this work, we propose a radically different framework to determine the operating reserves based on a real-time load forecast. Identifying their vital role in day-ahead power optimization tasks, we aim at the dynamical update of the predefined daily demand based on a model predictive control. Specifically, we make use of independent input predictors to achieve the dependent target, namely the daily load. Based on annual data with respect to some selected predictors, we train a neural network via non-linear regression. During the particular day, the updated values of the predictors are assigned to the model, which assesses their impact and its significance on the output target and re-use them to estimate the new demand ahead. Together with the power balance, they constitute a system-wide constraint that affects the overall system security and total achieved production cost. The obtained results show that significant improvements exist

in terms of decreased operating reserve requirements. Considering the performance of the trained neural network, the determined operating reserves account for the mean squared error (MSE) and the actual deviation of the selected predictors. Based on real-time updates, the load forecasting can achieve lower costs, while the system security is preserved.

The rest of the paper is organized as follows. The following section includes the problem formulation and the importance of accurate reserve definition. Section 3 deals with the methodology followed to develop the proposed, real-time load forecast model. All precise descriptions in relation with the different models used are included. In addition, the considered test system is presented along with the main parameters used for predictions. In Section 4, the realizations of our solution are presented and their findings are discussed in detail, while the obtained improvements are listed by their relevance. Finally, the conclusions are drawn in Section 5.

### **2. Problem Formulation**

In order to achieve a comprehensive view regarding the impact of operating reserves on total generation cost, we first define the generic objective function of unit commitment task with the aid of Equation (1).

$$f = \min \sum\_{t=1}^{T} \sum\_{i=1}^{N} \left[ F \left( P\_i^t \right) + \left( 1 - \mathcal{U}\_i^{t-1} \right) SLI\_i \right] \mathcal{U}\_i^t \tag{1}$$

Denoting the total time intervals with *T* and the total number of available generating units with *N*, the power contribution of a generator *i* during the time slot *t* is expressed via *Pt <sup>i</sup>* ·*U<sup>t</sup> <sup>i</sup>* defines whether a generator is "on" or "off" during that interval, whereas the cost to start-up is represented by *SUi*. The power balance constraint is provided in Equation (2). In general, the summed power of the committed units must satisfy the load demand *Pd* [18]. Each deviation from the absolute power balance (zero equivalent) violates the nominal frequency (50 or 60 Hz) of the system according to Newton's Second Law of Equation (3).

$$\sum\_{i=1}^{N} l I\_i^t \cdot P\_i^t = P\_d^t \tag{2}$$

$$T\_m - T\_\varepsilon = J \frac{d\omega}{dt} \tag{3}$$

In case of an imbalance between the mechanical torque *Tm* and electrical torque *Te*, the rotating mass will experience an angular acceleration or deceleration *dω*/*dt*, which is reflected as a change in frequency. It is noted that the frequency change is smaller for a system with high inertia (*J*) compared to a system with low inertia [19]. To guarantee the system stability, different reserve types are needed according to their time of response. For clarification purposes, we express the equation of motion (4) in power terms so that *P* = *T* · *ω* is preserved.

$$P\_m - P\_\varepsilon = M \frac{d\omega}{dt} \tag{4}$$

where *M* = *J* · *ω* is the angular momentum of the rotating system. Turning to the specification of the minimum technical and operational characteristics that each user connected to the Transmission System must comply, the frequency range during normal conditions is stated between 49.8 and 50.2 Hz and it can be extended to 47–52 Hz during disturbances. A disturbance event is defined as an incident that causes deviations equal or greater than 0.5 Hz from the nominal *fo*. The operating reserves are separated into spinning and nonspinning. Spinning reserves are the first acting and derived from the synchronized units to the system [20]. They include the restraint and recovery reserves, which are available within 3 and 20 s and operable for 20 s and 20 min, respectively. Following are the supplemental and replacement reserves which need to be available for 6 h. A last category involves the contingency reserves that are operable within 6–24 h. These categories fall in

the non-spinning reserve classification. Day-ahead schedules must satisfy a further systemwide, coupling constraint, namely the spinning reserve margins *SR<sup>t</sup>* . The formulation of such inequality constraints (both upward *SRu p* and downward *SRdown*) is expressed via the following respective equations:

$$\sum\_{i=1}^{N} \mathcal{U}\_i^t \cdot P\_{i, \max}^t \ge \ \mathcal{P}\_d^t + \mathcal{S} \mathcal{R}\_{up}^t \tag{5}$$

$$\sum\_{i=1}^{N} \mathcal{U}\_i^t \cdot P\_{i,min}^t \le \|P\_d^t - SR\_{down}^t\|\tag{6}$$

where *Pi*,*min* and *Pi*,*max* denote the minimum and maximum capacity limits of each generator *i*. Assuming a robust formulation with SR margins in the order of 10% of the instant load, it is worth noting that this expensive requirement forces more generators to start-up, leading to sub-optimal unit commitment schedules and uneconomic power dispatch.

To lower the expensive spinning-reserve requirements, we propose the intra-daily forecast of load demand. In contrast to day-ahead estimations, which may deviate from realtime values, intra-daily forecast with 15 min updates of selected predictors may improve the accuracy and consequent required reserves. Electricity load follows daily patterns, which are repeated according to the human activity and weather conditions. In this regard, we exploit an accurate hours-ahead system for load forecast using neural networks. Our purpose is to enhance the system security and reliability, whilst minimizing the SR requirements by making use of a model predictive control, which performs updates every 15 min to supply the neural networkIn more detail, a number of predictors *x* are imported in the feed-forward network along with the target *y* to form our data set *xi*, *yi*|*i* = 1, . . . , *n*. The model is trained using the largest share of the historical data for training, while the rest is equally distributed for validation and testing. The developed model exploits a two-hidden-layer neural network employed as follows:

$$y\_1 = \sigma(\sum\_{k=1}^{K} w\_k x\_k + \beta\_1) \tag{7}$$

$$h\_2 = \sigma(\sum\_{l=1}^{L} w\_l h\_{1\_l} + \beta\_2) \tag{8}$$

$$y = \sum\_{m=1}^{M} w\_m h\_{2\_m} + \beta\_y \tag{9}$$

where *σ*(·) is the sigmoid activation function and *h* the output of the hidden layers. *K*, *L*, *M* are the number of predictors, neurons at the first and second hidden layer, respectively [21]. Figure 1 depicts a graphical representation of the proposed network.

During the realization of power dispatch, the selected predictors *x*˙(*t*) re-enter the forecast model at *t* and the remaining *T* − *t* sequence is updated based on the model predictive control explained as follows:

$$I = \sum\_{j=t}^{T} w\_{\mathbf{x}\_p} [r\_p(j) - \mathbf{x}\_p(j)]^2 + \sum\_{j=t}^{T} w\_y [\Delta y(j)]^2 \tag{10}$$

The predicted parameters *rj* constitute the reference of the model and each deviation from the actual values is recursively corrected to minimize *I*· Δ*y* indicates the impact of the actual deviation on the new, forecasted values when *xj* are reused for load forecast. The significance of Δ*y* is regulated by penalizing with *wy*, while *wx* reflects the importance of each selected predictor p. Finally, the equality constraint of Σ*w* = 1 must be preserved [22,23].

**Figure 1.** Proposed Neural Network.

### **3. Test System and Methodology**

The considered system concerns the isolated power community of the island of Cyprus. This is a representative, small-to-medium scale network consisting of 20 generators to supply a 1100 MW peak demand (usually occurred in July) with an annual load factor of 56% [24]. Due to its isolation, small area and remoteness, electricity supply for more than 875 thousand people inhabited in the island, mainly relies on imported fossil fuels, the price of which is 3–4 times higher than that in the mainland [4]. As a result, the extremely high SR requirements of up to 10% of the hourly load pose a critical increase on total production cost. To decide which predictors to include in our forecaster, we first tried to extract a physical relationship between them and our target, namely the load demand. Based on actual data obtained from the Cyprus Energy Regulatory Authority (CERA), we demonstrate the hourly load for a representative week for each season in Figure 2.

**Figure 2.** Weekly load demand per season.

Apart from the seasonality and human activity, similar patterns have been observed within the same periods of different years. This way, we choose to express the seasonality by the hour and date, whereas the human activity is represented through the day-type. The repetition of this activity is shown with the aid of three further predictors, such as the daily load of the previous day, week and year. These six predictors form our constant parameters. In Figures 3–5, we provide the fluctuation of temperature and relative humidity which are our further two, variable predictors. Figures 3 and 4 show an hourly histogram relating to the year 2019, while their seasonal values are offered in Figure 5. As can be seen, they both present non-linear relations with time and in order to make easy and

accurate predictions, a better resolution is needed. This can be achieved by performing week-to-week comparisons of their hourly variation during different seasons.

**Figure 3.** Annual variation of relative humidity.

**Figure 4.** Annual variation in temperature.

**Figure 5.** Seasonal variation of (**a**) relative humidity; (**b**) temperature.

Undoubtedly, ambient temperature affects the human comfort and their overall activity. However, relative humidity is the parameter that ultimately determines the rate with which heat is drawn away from the body and thus how does the absolute temperature

"feels like" by humans [25,26]. Figure 6 offers the most important values of temperature and relative humidity for the most energy-intensive weeks in 2019's winter and summer.

**Figure 6.** Winter and summer comparisons of hourly load demand and (**a**) relative humidity; (**b**) temperature.

The relative humidity possesses higher values, which tend to decrease during the daylight. On the other hand, the temperature shows an adverse trend, which during the summer shows a linear relationship with load but during winter, it is inversely proportional to the load demand. Therefore, it is obvious that both variables project a fluctuation to load forecast and consequently, they must be updated during the realization of power dispatch. Utilizing actual data from 2010–2019, we train a neural network based on nonlinear regression between the following predictors: (1) day (or date), (2) hour, (3) day-type (weekday = 0, weekend = 1, holiday = 2), (4) previous day load, (5) previous week 24hload, (6) previous year 24h-load, (7) relative humidity and (8) temperature, and the target of actual load demand. The respective settings of our network include 20 neurons per hidden layer. The forecasting model exploits 70% of the historical data for training, 15% for validation and 15% for testing.

Regarding the model used for predictive control, the selected predictors refer to the updated temperature and relative humidity forecasts for the intra-hour periods of 15-minutes, equally weighted by 25%. The remaining 50% is given to the change in the manipulated, depended variable Δ*y*. In contrast to traditional models that regulate their inputs to approximate the referenced values and minimize their impact, in our realization, we set the updated values as the predicted (reference) and we regulate the controlled temperature and humidity to estimate their impact through the forecaster. Then, the model is updated with the new values and dynamically accepts the updates to perform the next cycle until the end of the assessed day. We illustrate our proposed configuration in Figure 7.

**Figure 7.** The proposed real-time load forecast model.

### **4. Results and Discussion**

Aiming at the minimization of expensive SR margins for frequency regulation, we apply our proposed solution introducing the actual data obtained from CERA. We make use of a feed-forward neural network with two hidden layers of 20 neurons and a Levenberg– Marquardt algorithm for the curve fitting. This algorithm relies on the minimization of the squared sum of some imposed parameters *β* [27]. For a given set of *n* empirical pairs (*xi*, *yi*), this problem can be formulated as follows:

$$\beta = \underset{\beta}{\text{argmin}} \sum\_{i=1}^{n} [y\_i - f(\mathbf{x}\_{i\prime}\beta)]^2 \tag{11}$$

After the introduction of the predictor matrix *x* (of *nxp* dimensions) and the dependent target *y* into the model, the achieved performance of the forecaster is calculated in terms of MSE and presented in Figure 8.

$$MSE = \frac{1}{T} \sum\_{t=1}^{T} (y\_t - \mathfrak{g}\_t)^2 \tag{12}$$

As can be observed, the forecasting model shows high performance with R-values above 97.5% in each case and estimated MSE in the order of 2.388%. The regression plots displayed, show that the network outputs with respect to targets for training, validation, and test sets, fall along the 45-degree line, where the network outputs are equal to the targets. This verifies our views on the existence of lower SR requirements. For further verification of the network performance, we illustrate the error histogram in Figure 9.

The outliers' indication shows that most errors fall between −75 and +75. The respective training, validation and test error appear in Figure 10. Since the test set error presents similar characteristics with the validation set error, as well as the final mean squared error being small, the obtained result is quite reasonable.

**Figure 8.** Performance of the trained model for load forecasting.

**Figure 9.** The error histogram of the load forecast model.

**Figure 10.** A graphical representation of the training errors, validation errors, and test errors.

To gain a broader overview of the efficacy of our approach, we compare our proposed solution with a benchmark optimizer, namely Gradient Descent. Based on Equations (13) and (14), the achieved RMSE and MAPE are 10.6227 and 0.0105, respectively, when Levenberg–Marquardt is used, against Gradient Descent, which accounts for 168.4502 RMSE and 0.2875 MAPE. Figure 11 demonstrates the load forecast recommendation for the considered optimizers. Selecting Levenberg–Marquardt as the optimizer for curve fitting, we illustrate the performance of the proposed neural network over the alternative regression trees in Figure 12. Although the proposed solution almost perfectly fits the actual load demand, the alternative regression tree-based approach deviates considerably, providing the respective 68.8261 and 0.0907 RMSE and MAPE.

$$\text{RMSE} = \sqrt{\frac{1}{T} \sum\_{t=1}^{T} (y\_t - \hat{y}\_t)^2} \tag{13}$$

$$\text{MAPE} = \frac{1}{T} \sum\_{t=1}^{T} |\; \frac{y\_t - \hat{y}\_t}{y\_t}| \tag{14}$$

Applying daily simulations for the entire year of 2020, we estimate the deviation errors between the day-ahead, forecasted load and actual, real-time values during the assessed dates. The input of the model predictive control is updated using intra-hour (15-minutes sampling rates) data regarding the forecasted ambient temperature and relative humidity. The worst deviations are found to be during summer and their actual representation is shown in Figure 13. It is noted that there imposed 24 updates which represent the most prevalent of the 4 intra-hour ones. We depict the most relevant deviations which accounts for over 2% error.

**Figure 11.** Implications of different optimizers on the feed-forward neural network performance.

**Figure 12.** Performance of neural network against regression tree with best-fit optimizer.

**Figure 13.** Real-time deviations from the day-ahead forecast of (**a**) relative humidity; (**b**) temperature.

These deviations have a daily impact on the forecasted load, which is reflected as frequency violations. To correct the deviations, more generators are required to serve the varying demand or spinning reserves are called upon. Any generation deficits may lead into load interruptions, while excess generation can cause active power curtailment. In any case, the unexpected deviations increase the total production cost and force the system operators to plan-ahead bulk operating reserves to appropriately regulate the system frequency. In our paradigm, the SR minimization relies on the high-performance neural network and the real-time corrections based on the updated forecasts of temperature and humidity. In contrast to traditional alternatives, which associate the SR requirements solely with the forecaster performance, performing real-time, intra-hour load forecast, these requirements are reasonably mitigated.

We provide the realization of our proposed solution to an energy-intensive winter day in Figure 14. In this case, one can observe how the negative temperature deviations between 10:00 and 16:00 affect the hourly-load forecast. Considering that *E* = *P* · *t*, this deviation corresponds to a daily power of 146.867 MWh or 35.864 MW instant power equivalent in the worst case. To recover this imbalance, a spinning reserve of up to 4.67% would be adequate if planned ahead.

**Figure 14.** A realization of the real-time load forecast model for an energy-intensive winter day.

Finally, we depict similar configurations for the more mitigated load curves in spring and autumn, together with the most energy-intensive day in summer, in Figure 15. For completeness sake, we list the comparative results with respect to the achieved SR capacity per month in Table 1, considering the real-time weather impact and overall performance of our load forecasting model.

**Figure 15.** Real-time deviations from the day-ahead forecast concerning specific, energy-intensive days in spring, summer and autumn.


**Table 1.** Spinning reserve comparisons pertaining our proposed solution and robust alternatives.

### **5. Conclusions**

The continuous increase in the renewable energy contribution deteriorates the flexibility and stability of modern power systems calling for bulk spinning reserve margins. In this work, we proposed a dynamical forecaster to ameliorate the expensive requirements of spinning reserves based on real-time updates. Utilizing neural networks, we trained artificial models to forecast the load ahead with great accuracy, based on critical predictors and without using any model development structure to individuate and select the appropriate input parameters. Instead, we exploited eight predictors and distinguished them into constant and variable inputs by making use of a model predictive control. Apart from the most actively used data for historical load, seasonality and human activity, we also considered relative humidity as one of our main variable inputs. We performed real-time applications with the aid of a model predictive control, developed to update the recommendations intra-hourly and further correct the imposed deviations. Exploiting actual data regarding an isolated power system, the experimental results show that improvements exist in terms of decreased spinning reserve requirements to regulate the violated frequency. These findings strongly collaborate our claims and strengthen the arsenal of independent system operators with an effective tool for real-time load forecasting and total generation cost minimization.

As for future directions of research, we highlight the consolidation of more predictors correlated with renewable generation such as wind and solar. This way, a global forecaster could recommend the residual load target by making use of multi-input/multi-output neural networks. In addition, the fuel-dependent electricity prize may also take place as a real-time update, affecting both the human activity and hourly load demand.

**Author Contributions:** Conceptualization, P.N.; methodology, P.N.; software, P.N.; validation, H.P.; writing—original draft preparation, P.N. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Acknowledgments:** The authors would like to thank the Cyprus Energy Regulatory Authority for the provision of the needed annual data.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**


### *Article* **The Wisdom of the Data: Getting the Most Out of Univariate Time Series Forecasting**

**Fotios Petropoulos 1,\* and Evangelos Spiliotis <sup>2</sup>**


**\*** Correspondence: f.petropoulos@bath.ac.uk

**Abstract:** Forecasting is a challenging task that typically requires making assumptions about the observed data but also the future conditions. Inevitably, any forecasting process will result in some degree of inaccuracy. The forecasting performance will further deteriorate as the uncertainty increases. In this article, we focus on univariate time series forecasting and we review five approaches that one can use to enhance the performance of standard extrapolation methods. Much has been written about the "wisdom of the crowds" and how collective opinions will outperform individual ones. We present the concept of the "wisdom of the data" and how data manipulation can result in information extraction which, in turn, translates to improved forecast accuracy by aggregating (combining) forecasts computed on different perspectives of the same data. We describe and discuss approaches that are based on the manipulation of local curvatures (theta method), temporal aggregation, bootstrapping, sub-seasonal and incomplete time series. We compare these approaches with regards to how they extract information from the data, their computational cost, and their performance.

**Keywords:** information; combination; uncertainty; theta; temporal aggregation; bagging; subseasonal series

### **1. Introduction**

Univariate time series forecasting is the creation of extrapolations for a single variable based on past, time-ordered observations of the same variable. Despite the geometric increase in data availability, univariate forecasts are even today the basis for the decision making in many organisations. Improvements in the performance of such forecasts are crucial for reducing costs associated with operational, tactical, and strategic planning [1].

Nowadays, automatic time series forecasting can be easily achieved using dedicated forecasting software or open source packages. Examples include ForecastPro®, SAS Forecasting Server®, and the *forecast* package for R statistical software. Such software and packages offer tools for batch and automatic forecasting with minimal to zero manual input. They integrate families of models, like exponential smoothing [2] and autoregressive integrated moving average, ARIMA [3], that can capture a wide range of data patterns and produce extrapolations with ease. However, such families of models rely on assumptions that are barely met in practice, and struggle to select the most appropriate model for a given time series due to the uncertainties involved: identifying the optimal model form, estimating the optimal set of parameters, and dealing with the inherent uncertainty in the data [4].

The purpose of this article is to provide an overview of approaches that can be used to enhance the performance of univariate forecasting methods. There are four common characteristics that govern the approaches covered in this article. First, all approaches attempt to distil as much information from the original time series data as possible by exploring them through alternative lenses. This is achieved through amplification of specific time series features and transformation of the original time series. Second, the approaches

**Citation:** Petropoulos, F.; Spiliotis, E. The Wisdom of the Data: Getting the Most Out of Univariate Time Series Forecasting. *Forecasting* **2021**, *3*, 478–497. https://doi.org/10.3390/ forecast3030029

Academic Editor: Sonia Leva

Received: 17 May 2021 Accepted: 21 June 2021 Published: 23 June 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

build on the success of forecast combinations to offer improved forecasting performance while tackling uncertainties regarding model form and parameter specification. Third, all approaches are model-free in the sense that do not rely on a particular family or pool of models. Four, each of the approaches manage to handle at least one of the uncertainties associated with fitting forecasting models: model form, parameter, and data.

In summary, we consider, present, and discuss the following five approaches:


We should clarify that although the literature involves several other univariate approaches in addition to the aforementioned ones for extracting more information from the original data and mitigating the drawbacks that forecasts from single forecasting methods may involve, these are not considered in the present study as they are not characterised by the four key attributes discussed earlier. For instance, when the forecast errors of a method display strong auto-correlations (e.g., because the method fails to fully capture seasonality or trend), a common approach is to adjust the forecasts originally produced according to their expected error, specified using a second univariate forecasting method on the residuals of the first one. TBATS [20] exponential smoothing state space model with Box-Cox transformation, ARMA errors, Trend and Seasonal components, and Theta with ARMA errors [21], are just some examples of this approach which, although enhances forecasting performance, does not rely on combinations. Similarly, decomposition techniques that allow for complex, multiple seasonal patterns to be captured [22], can be regarded as "wisdom of the data" approaches, but do not involve combinations, depending also on particular models and, in many cases, explanatory variables.

The next five sections expand on each of the above approaches: We offer a summary of the related research studies, we describe how these approaches handle and manipulate the original time series data, and we discuss the advantages gained from their application. Section 7 offers a cross-comparison of the approaches, with an emphasis on the uncertainties that each handles, as well as their computational cost. Finally, Section 8 offers our conclusions and insights for future research.

### **2. Theta Method**

The theta method was the top-performing submission in the M3 forecasting competition [23]. Its name originates from the first letter of the Greek word for "temperature", *θ*. Similarly to how a decrease or increase in the temperature would result in contraction or expansion, the theta method amplifies or smooths the local curvatures of a time series, i.e., the distances between the points of the series with those of a simple linear regression line, computed over its observations against time. The result of this local-curvatures manipulation process is the creation of new series that are called "theta lines". The degree of amplification or reduction in the local variations is controlled by a parameter, *θ*, where a value of 1 corresponds to the original data with the original local curvatures. If *θ* > 1, then the local variations are amplified; if *θ* < 1, then the resulting theta line is smoother than the original data.

In its simplest form, the theta method decomposes the original data into two theta lines with parameters *θ* = 0 and *θ* = 2 [5]. The theta line with *θ* = 0 corresponds to linear regression on a time-trend indicator. This is a straight line that captures the long term trend of the data and has no local variations. The theta line with *θ* = 2 displays double the curvatures of the data. It is argued that this second theta line is able to better capture the short term variations in the data. Each of these two theta lines are extrapolated separately. Assimakopoulos and Nikolopoulos [5] used the forecasts of the linear regression on trend to extrapolate theta line with *θ* = 0 and the simple exponential smoothing (SES) method to produce forecasts for the other theta line (*θ* = 2). Once the forecasts from the two theta lines have been produced, then these are combined with equal weights to form the forecast for theta line with *θ* = 1 that corresponds to the data with the original curvatures.

The above process works directly on data that do not exhibit seasonality. However, if the original data are seasonal, then they need to be adjusted for seasonality before the application of the theta decomposition. Assimakopoulos and Nikolopoulos [5] proposed the use of the classical decomposition method with the assumption that the seasonal pattern is multiplicative in nature; a not unreasonable assumption for real life applications. As an alternative, Spiliotis et al. [9] proposed using shrinkage estimators of time series seasonal indices to avoid cases where their values are exaggerated. In both cases, a simple statistical test based on the autocorrelation coefficient with a lag that matches the periodicity of the data is used to decide on the existence of a (sufficiently strong) seasonal pattern, typically considering a confidence level of 90%. This test is described in detail in [8]. If the theta decomposition is applied on the seasonally adjusted data, then the resulting forecasts are not seasonal, and a seasonal re-adjustment is needed. This is simply done by multiplying the combined forecasts with the respective seasonal indices computed earlier by the decomposition method. A visual example of producing theta lines from seasonal time series data is presented in Figure 1.

**Figure 1.** An illustrative example of producing theta lines for the theta method. The original data (black line) are de-seasonalised (red line). Then, a linear regression on trend produces the theta line with *θ* = 0 (blue line). The theta line with *θ* = 2 (green line) has double the curvatures of the seasonally adjusted data.

When theta is restricted to the simple form of two theta lines (0 and 2) that are extrapolated by the linear regression line and SES, then its application on some seasonally adjusted data is mathematically equivalent to SES with drift [6]. However, it would be more appropriate if theta is seen as a decomposition framework rather than a forecasting method. One can decide on the number of theta lines, their theta parameters, the forecasting methods to be applied on each of them, and the combination weights, among other modelling choices. In fact, as explained by Spiliotis et al. [9], "*the advantage of theta derives exactly from (its) "divide and conquer" property: There is no single forecasting model capable of effectively capturing*

*all possible time series patterns. Yet, if the series is decomposed into multiple lines of a reduced amount of information, improvements in forecasting accuracy are possible even for the case of conventional models*".

Several studies have worked on expanding the theta method to the aforementioned directions. Petropoulos and Nikolopoulos [24] examine the use of equal and unequal weights for the combinations of the two theta lines forecasts, and conclude that optimally choosing the combination weights per series may result in performance benefits. Petropoulos [25] proposes the addition of a third theta line with *θ* = 1 that is extrapolated by the dampedtrend exponential smoothing method. He also suggests the addition of a second short term trend-line that is fitted on the most recent observations, which is closely connected with the concept of multiple starting points (see Section 6). Fioruci et al. [26] and Fiorucci et al. [8] offer generalised rolling origin evaluation methods and state space models for optimising the theta parameter of the second theta line, showcasing the benefits in the out-of-sample accuracy of the method. Thomakos and Nikolopoulos [27] expand the application of the theta method in a multivariate setting and show the conditions under which this is expected to work better than its standard, univariate implementation.

Two recent extensions on the theta method are particularly interesting. Following the work of Spiliotis et al. [9], Spiliotis et al. [10] offer a taxonomy of theta models that can capture several forecasting profiles regarding the type of trend (additive or multiplicative) and seasonality (none, additive, or multiplicative). This is a significance advancement since the original theta method was designed on the assumptions of a linear trend and multiplicative seasonality. The authors propose non-linear trends, but also alternative seasonal profiles in a framework that resembles that of the exponential smoothing family of models [6]. Moreover, they define a process for selecting an optimal theta method and offer a simple way to empirically estimate the prediction intervals. Their "AutoTheta" method shows improved performance over the standard theta method for both point forecast accuracy but also the estimation of uncertainty. Legaki and Koutsouri [28] deal with non-linear trends in an alternative fashion. They apply a Box-Cox transformation on the seasonally adjusted data prior to the theta decomposition and extrapolation. The value of the Box-Cox transformation parameter, *λ*, is selected so that the profile log-likelihood of a linear model fitted to the seasonally adjusted series is maximised, with the choice of *λ* being restricted in [0, 1]. A Box-Cox transformation allows the application of theta on data with non-linear trends but also results in stabilisation of the variance. The "Box-Cox Theta" was one of the solutions submitted in the M4 forecasting competition [29], resulting in very good point forecast accuracy with very low computational cost [30].

The theta method has performed well in a variety of settings that involve financial [31], tourism [32], and inventory forecasting [33]. It is not a surprise that nowadays it is considered to be one of the default time series forecasting benchmarks along with the automatic implementations of exponential smoothing and ARIMA [34], as showcased by the M4 forecasting competition [29]. The theta method is attractive for its simplicity, robust performance, and computational efficiency. The book of Nikolopoulos and Thomakos [35] exclusively focuses on the theory and applications of the theta method, highlighting the conditions under which it will outperform other forecasting methods. Several open source implementations of the theta method exist. We would like to spotlight the *forecTheta* package for R statistical software, as well as the functions thetaf() and theta() of the packages *forecast* and *tsutils*, respectively. Finally, Petropoulos and Nikolopoulos [36] offer a step-by-step tutorial of the standard theta method coupled with an implementation in just 10 lines of R code.

### **3. Multiple Temporal Aggregation**

The theta method extracts more information from the data by amplifying or deflating the local curvatures. In other words, the theta method manipulates the data on the vertical axis of a standard time series plot. The next approach we explore manipulates the data on the horizontal axis, i.e., the time. Temporal aggregation refers to a time series transformation where a higher frequency series is translated into a series of lower frequency see Section 2.9.2 in [37]. For example, a time series on the daily frequency can be converted into a weekly-frequency series when considering non-overlapping time buckets of 7 days each. Different levels of temporal aggregation result in new, shorter series where the high frequency components (seasonality and noise) are filtered out while level and trends are made easier to discern and model. Moreover, when temporal aggregation is applied on very granular, intermittent data, then we observe a decrease on the degree of intermittence, i.e., the number of zero observations included in the series, thus facilitating the overall forecasting process. An example of the temporal aggregation process applied on fast moving data is presented in Figure 2.

**Figure 2.** A visual example where multiple new temporally aggregated time series are created based on the original data. The monthly data (black line) are temporally aggregated to quarterly (red line), semesterly (blue line), and yearly (green line) data.

Although it is possible that one focuses on modelling a single aggregation level, even if this is not the original level on which the data are recorded [38–40], more benefits will usually arise from modelling multiple temporal aggregation (MTA) levels and combining the resulting forecasts. Kourentzes et al. [11] offer one of the first systematic studies to explore the beneficial effects of MTA. Focusing on exponential smoothing models [2], they propose that model selection should be applied on each temporally aggregated series separately. The exponential smoothing model components (level, trend, and seasonality) are estimated per aggregation level and their additive-transformed estimates are averaged across levels. The summation of the three average components is the final forecast. This approach is known as the "multiple aggregation prediction algorithm" (MAPA). The need for averaging at a component level rather than at a forecast level was driven by the fact that seasonality may not be possible to estimate in some levels (consider, for instance, monthly data and an aggregation level of five periods). Combining at a component level avoids the excessive shrinkage of the seasonal pattern [41,42]. MAPA showcased improved performance over the exponential smoothing benchmark that was applied on the original data only [11]. The improvements of MAPA over the benchmarks were more obvious on the longer forecasting horizons.

MAPA, as introduced by Kourentzes et al. [11] and implemented with exponential smoothing, is a great solution for amplifying and smoothing data patterns for fast moving series. However, when the series become intermittent, with the presence of many zeros among the non-zero demand observations, then the toolbox of forecasting models applied across the various aggregation levels can be updated to include specialised methods for intermittent demands. Such methods include the Croston's method [43] and the SyntetosBoylan approximation (SBA) [44]. Petropoulos and Kourentzes [13] suggest the use of multiple temporal aggregation levels for the case of slow-moving demand series, where a selection between the Croston's method and the SBA is made based on the degree of intermittence and the variability of the non-zero values [45]. Finally, if the average inter-demand interval becomes equal to unity (i.e., the intermittent data are sufficiently temporally aggregated to become non-intermittent), then Petropoulos and Kourentzes [13] suggest replacing specialised methods for intermittent demand with SES. The empirical results from the application of the MAPA version for intermittent demand data showed superior forecasting performance on a variety of metrics that included proxies for the inventory performance.

Another extension to MAPA was introduced by Kourentzes and Petropoulos [46] to allow the algorithm handle exogenous variables which are estimated as additional components. The concept is similar to how exponential smoothing models (ETS) are extended to include exogenous variables (ETSx). However, the multivariate version of MAPA (MAPAx) performs temporal transformation on the exogenous variables too. This, by turn, tackles the uncertainty associated with estimating not only the effects of such predictors, but also their timing, i.e., leading and lagging effects. Applied on demand volumes affected by promotions, MAPAx offered a performance that was better to either ETSx or ARIMAx (ARIMA models with exogenous variables), both in terms of accuracy and bias, across multiple planning horizons.

One of the most important milestones in the development of MTA has been its conceptualisation in a hierarchical fashion. This enabled to directly apply the advances of the rich hierarchical literature [47–50] to the MTA application, that includes the estimation of coherent forecasts from the base forecasts of each hierarchical node. In essence, each hierarchy consists of observations at the most granular frequency at the bottom level, which are then added up to higher hierarchical levels, with the top level usually being a full periodic cycle. For example, monthly observations are added to bi-monthly, quarterly, four-monthly, semesterly, and yearly. Temporal hierarchies were first proposed by Athanasopoulos et al. [14], who showed that such structures allow MTA to be applied to a wide range of forecasts that are not limited to exponential smoothing ones and could even include judgment. The authors performed a large simulation study to better understand why temporal hierarchies work better than simply modelling the original data. Finally, they discussed the managerial implications of MTA through a case study of accident and emergency demand data.

Since the work of Athanasopoulos et al. [14], there has been a spark of research studies around forecasting with temporal hierarchies (THIEF). We now provide some highlights. Spiliotis et al. [41] proposed three simple ways to improve performance of temporal hierarchies: (i) model combinations to the base forecasts prior reconciliation, (ii) additive and multiplicative bias adjustments to the base forecasts, and (iii) a selective application of temporal hierarchies so that unnecessary seasonal shrinkage is avoided for the time series that exhibit strong seasonality, closely related also to the work of Kourentzes et al. [51] on optimal selection of temporal aggregation levels. Jeon et al. [52] expanded temporal hierarchical forecasting from point forecast reconciliation to probabilistic coherent forecasts, showcasing its benefits on high frequency wind power production and electricity load data. Additionally, focusing on short term electricity load data, Nystrup et al. [53] showed that temporal hierarchical forecasting can be significantly improved when autoand cross-correlations are taken into account in the reconciliation stage of the base forecasts. Finally, Kourentzes and Athanasopoulos [54] applied temporal hierarchies on intermittent demand data, arguing that some data patterns (trend and seasonality) are difficult to discern on low levels of aggregation where the degree of intermittence is high. They selectively used Teunter-Syntetos-Babai (TSB) [55] method for intermittent demand or ETS based on an intermittence threshold, which acts as a hyperparameter. Generally, the accuracy improvements were higher for lower intermittence thresholds, i.e., TSB switches

to ETS when the intermittence is low, as investigated on 5000 time series depicting the demand of aerospace spare parts.

A fertile field for research is the integration of temporal aggregation forecasting with the more traditional cross-sectional one, towards what is dubbed as "cross-temporal forecasting". To the best of our knowledge, Spiliotis et al. [42] were the first to investigate this issue, focusing on hourly electricity consumption data from a bank, disaggregated into branches and further disaggregated into energy uses. They proposed a sequential process where a simplified version of MAPA is first applied on the seasonally adjusted data, followed by reseasonalisation of the temporally combined forecasts and consequent application of cross-sectional hierarchical forecasting for the production of coherent forecasts across all cross-sectional levels. Kourentzes and Athanasopoulos [56] approached cross-temporal aggregation from a hierarchical approach, instead of using MAPA. Although they defined full cross-sectional hierarchies, they still used a sequential approach where they first apply temporal hierarchies for each cross-sectional node followed by cross-sectional reconciliation at each aggregation level with the resulting forecast being combined using equal weights towards a "consensus reconciliation matrix". The authors showed that this approach resulted in improvements when applied on Australian tourism data. Yagli et al. [57] explored further the sequential implementation of cross-temporal hierarchies by comparing the appropriate order of application, i.e., spatial then temporal, or temporal then spatial. Using photovoltaic power generation data, they showed greater benefits when temporal aggregation is applied prior to cross-sectional (spatial), while they also provided evidence that temporal aggregation may not be needed at all levels of the cross-sectional hierarchy.

Overall, we can see a large number of studies over the last few years that focus on issues surrounding MTA. MTA is attractive as it offers significant performance improvements that are coupled with aligned decision making [14]. Forecasts are produced at different frequencies and are then reconciled, rendering them suitable for use in several functions within companies and organisations, including operational, tactical, and strategic planning. Although, normally, different teams and departments within organisations would produce their own sets of forecasts, MTA brings us one step closer to the concept of "one number forecast", where the same sets of forecasts can be used for logistics, manufacturing, scheduling, budgeting, etc. Various implementations of MTA are available in open source forecasting packages that include the mapa() function of the *MAPA* package (MAPA and MAPAx), the imapa() function of the *tsintermittent* package (MAPA for intermittent demand data), and the thief() function of the *thief* package (temporal hierarchies) for R statistical software.

### **4. Bagging**

The next approach that we investigate is called "bagging", which is short for "bootstrapping and aggregation". In brief, bagging is based on the resampling of the random component of a series towards the creation of new series with the same underlying patterns (trend and seasonality) but different remainder. Multiple forecasts are produced using the original and the bootstrapped series which are then aggregated (combined) to form the final forecast. In more detail, the steps for the bagging approach are as follows:


series has no periodicity (e.g., yearly data), then a Loess decomposition is applied to separate the series into two components: trend and remainder;


**Figure 3.** The original data (black line) together with 30 bootstrapped series (blue lines).

Effectively, bagging should be seen as a data augmentation (or oversampling) approach applied in univariate settings, in the sense that the amount of modified series added over the existing one for training the forecasting methods and producing the final forecasts are solely based on the series being predicted. This is a key difference compared to the multivariate data augmentation approaches used in the literature for successfully implementing "cross-learning" (or "global") forecasting methods [60], where the synthetic data share the underlying patterns of multiple series found in a broad set of series.

Bagging was first proposed by Bergmeir et al. [15], who applied it to improve the performance of exponential smoothing. They used the moving block bootstrapping MBB [61] algorithm to produce bootstrapped vectors of the remainder, and produced 99 bootstrapped series. The best ETS model was fitted on each of the original series and the 99 bootstrapped series in order to produce point forecasts. The final forecasts were obtained using the median operator, while the authors discuss that they also tried mean and trimmed means. Bagging on ETS offered improved performance over ETS simply applied on the original data. The authors also tried replacing Box-Cox and Loess decomposition with decomposition based on the components of the best ETS model fitted on the original data. The authors

also explored replacing MBB with the sieve bootstrap method [62]. However, both these modifications resulted in, overall, inferior results.

In a follow-up study, Petropoulos et al. [4] sailed to explore the reasons behind the good performance of the bagging approach. They argued that bagging succeeds in tackling, at the same time, three sources of forecast uncertainty: (*i*) the uncertainty in selecting the correct model form, (*ii*) the uncertainty in estimating the model's parameters, and (*iii*) the inherent uncertainty of the data. They devised three simple experiments to disintegrate the benefits of bagging:


Using the data from M and M3 forecasting competitions, the results of Petropoulos et al. [4] showed that, on average, tackling model uncertainty alone through bootstrap model combination offers benefits that are higher than bagging itself. Simply addressing the uncertainty in estimating the parameters of the applied model is overall worse than either bagging or bootstrap model combination but still slightly better than forecasting without bootstraps. Tackling only the data uncertainty does not offer notable gains. The authors went one step further towards generalising bagging by considering replacing the estimator (ETS) with ARIMA. The results were consistent, with bootstrap model combination being the best approach overall. Finally, they replaced MBB with two other bootstrapping approaches, circular block bootstrap CBB [63] and linear process bootstrap LPB [64], showing that the relative average ranks of the various approaches would not significantly change.

Although the last two studies focused on the performance of bagging when applied on families of models (ETS and ARIMA), bagging can also lead in improvements in the forecasting performance when applied on single methods. Dantas et al. [65] showed that bagging with the Holt-Winters method, an exponential smoothing method that is able to capture the trend and the seasonality in the data, results in better performance than either ETS, ARIMA, or bagged ETS when forecasting the demand for air transportation.

To control the effect of the covariance on the combination step of the bagging approach, Dantas and Cyrino Oliveira [16] proposed the use of clusters of similar forecasts. Instead of aggregating across all forecasts, a diverse set of forecasts are selected from each cluster and then these selected forecasts are combined across clusters. This simple trick leads to reduced variance of the forecasts and, as a result, in reduced forecast error. They tested

their cluster-modified bagging approach using ETS and data from the M3 forecasting competition and showed improvements in the point forecast accuracy.

Meira et al. [66] extended the previous works towards allowing the various bagging approaches to produce robust prediction intervals. They proposed "treating and pruning" strategies to selectively exclude models from the pool of candidate models such that models with explosive or outlying prediction interval values are not considered. This not only improved the performance of bagging and its variations, but also offered improvements upon the standard ETS. Overall, the authors demonstrated that bootstrap model combination offered very competitive performance compared to other bagging variations, both in terms of point forecast accuracy but also uncertainty estimation.

Research around bagging is much more scarce compared to theta or MTA. However, it is a robust alternative to deal with the various sources of uncertainty; arguably, though, an expensive one. Most published studies use between 50 and 100 bootstraps per series, with the computational cost need to fit all models and produce forecast being increased with the same rate. Open source implementations of the bagging approach include the functions baggedETS() and baggedClusterETS() of the R packages *forecast* and *tshacks*, respectively.

### **5. Sub-Seasonal Series**

Instead of transforming a series to another of lower frequency through temporal aggregation using all observations, the next approach we review applies sub-sampling such that the resulting series includes only some of the periods within a periodic cycle. Consider, for instance, the case of daily data and focus on the weekly periodic pattern (weekly seasonality) of length 7. One (the traditional) option would be to consider all observations and model a series with a seasonal cycle equal to 7 periods. However, we could also consider only the values for a specific day of the week (such as Monday) and create a new time series which will not be seasonal and model it independently; and we could repeat this for every single day. Expanding this idea, we could also consider pairs, triplets, quadruplets, etc., of adjacent days (such as Monday-Tuesday or Monday-Tuesday-Wednesday, etc.) and form even more series of varying degrees of periodicity. In other words, we do not do any transformation per se, but systematically remove (through subsampling) specific periods of the series to create new ones of lower periodicity. Figure 4 shows an example of this sub-sampling process assuming some data originally recorded in the monthly frequency.

Forecasting with sub-seasonal series (FOSS) allows for simplified modelling of the patterns in the original series as different seasons are excluded every time [17]. This offers a more robust estimation of the trends but also the seasonal patterns in the data, with FOSS serving as a "magnifying glass" to the forecasting models used for their extraction. FOSS uses combination, and its welcome side effects, to aggregate the forecasts produced using the sub-seasonal series. Assuming a time series with periodicity *s* (*s* = 7 of daily data; *<sup>s</sup>* = 12 for monthly data), then FOSS entails the creation and modelling of *<sup>s</sup>*<sup>2</sup> − *<sup>s</sup>* + 1 series. However, most of these series have periodicity that is much lower than *s* and are relatively short, so the increase in the computational cost is not linearly associated with the increase in the number of models to fit. Each set of series produced by FOSS that has the same periodicity is referred to as "level of information". In its simplest form, FOSS models all such levels of information and combines the forecasts with equal weights.

Li et al. [17] offer a large empirical evaluation of FOSS using data from the M3 and M4 forecasting competitions. They showed that FOSS acts as a self-improving approach for the state-of-the-art batch forecasting benchmarks ETS and ARIMA. The improvements achieved are amplified when the periodicity of the original series is higher but also when the forecast horizon increases, i.e., when forecasting becomes more challenging. In addition, the authors applied FOSS on double seasonality, high frequency load data, and showed that FOSS is also a useful tool in the presence of complex seasonal patterns.

FOSS is publicly available through the *foss* package for R. However, research in this area is still premature. We can see several avenues for future exploration that include the

selective use of levels of information, the use of unequal combination weights, and the creation of series using non-adjacent periods.

**Figure 4.** An illustrative example of producing sub-seasonal series by sub-sampling the original monthly data (first panel). In the second panel, we have produced a non-seasonal series that consists only of the periods in July of each year. The third and fourth panels show two more sub-sampled series with periodicity 2 and 3, respectively. Note that by considering particular subsamples, the level as well as other patterns change significantly.

### **6. Multiple Starting Points**

In the era of big data, retaining long histories of time series values is quite inexpensive. However, would using as many data as possible for producing forecasts warrant the best performance? Although increasing the number of the available observations is expected to lead to better accuracy, such a result is subject to a certain degree of determinism in the data. If the data exhibit structural changes (level shifts, changes in the trend and seasonal patterns, etc.) or contain outlying values, then it may be better to use the most recent window of the data that would not be subject to such data irregularities [67]. Another extreme way to handle changes in the structure of the data would be to only retain the most recent window that contains enough observations that are necessary to produce forecasts. For example, the "Demand Planning" functionality of the SAP APO retains only three years of monthly data, discarding the least recent history.

Determining the optimal window of data on which forecasting models are fitted is not a straightforward exercise. Instead, one can consider multiple windows. Assume that a time series consist of *n* observations. A first set of forecasts can be produced using all *n* observations. A second set of forecasts can be produced using the most recent *n* − 1 observations. This process can be repeated *m* + 1 times, such that *n* − *m* would still be enough data points for producing forecasts, i.e., at least two seasonal cycles for periodic data. Finally, the multiple sets of forecasts can be combined to obtain the final forecasts. This approach does not transform nor manipulate the original series, but simply trims the beginning of the data to produce multiple overlapping in-sample windows of different lengths based on which forecasts are produced. This approach is known as "forecasting using multiple starting points" (MSP). Figure 5 demonstrates the process of trimming the original series to create new series from multiple starting points.

**Figure 5.** An illustrative example of producing series from multiple starting points. The original data (first panel) are trimmed so that the periods from only the last two years (second panel), the last three years (third panel), or the last four years (fourth panel) are considered.

Research in this stream is limited. To our knowledge, Disney and Petropoulos [18] were the first to empirically examine the approach based on multiple starting points. They applied it on data from the M3 forecasting competition using simple averaging operators (mean, median, and mode), which resulted in improved forecasting performance especially for the yearly frequency. They showed that the improvements generally increase as the number of starting points also increases. They also presented a case study based on the demand of 23 different types of spare parts, showing that forecasting from multiple starting points improves the accuracy in about three-fourths of the cases, with average improvements of about 10%. Bai et al. [19] also empirically investigated this approach, comparing equal versus optimal weights when combining across the forecasts but also considering non-consecutive starting points for their in-sample windows.

We believe that there is scope for more research in this area. Future studies could focus on applying formal techniques for detecting structural changes, which then can be used to select the starting points in a more systematic manner. Another possibility for future investigation could be the application of the concept of multiple starting points within cross-sectional hierarchical structures, where it is usually assumed that every node in the hierarchy has the same number of historical observations. Finally, understanding the circumstances under which forecasting from multiple starting points works best is vital towards implementing it in practice. To our knowledge, there does not exist an open source implementation for forecasting from multiple starting points.

### **7. Cross-Comparison**

The five approaches that were described in the previous five subsections attempt to extract more information from the original time series by performing various forms of data modifications, adjustments, manipulations, and transformations. These can be summarise in three larger categories: *random component*, *frequency*, and *length*. Table 1 summarises how the extraction of information works for each of these five approaches. The theta method retains the frequency and length of the data, but amplifies the local curvatures which are represented as the residuals of a linear regression on trend. MTA transforms the original series through temporal aggregation to new shorter series of lower frequency; inevitably, the upsampling also results in lower noise [40]. Bagging is based on the bootstrapped series that are produced through re-sampling of the remainder from a decomposition process. FOSS focuses on the subsampling of the original series resulting, similar to MAPA, in new series that are shorter and have lower periodicity. Finally, forecasting from multiple starting points is based on trimming the original series by removing the least recent values, retaining the frequency and random component intact.

**Approach Random Component Frequency Length** Theta MTA Bagging FOSS MSP

**Table 1.** How does extraction of the information work?

In Table 2, we map the five approaches with regards to how they handle the three sources of uncertainty: data uncertainty, model form uncertainty, and model parameters uncertainty. Our mapping involves two levels: denotes full account of that type of uncertainty, while denotes partial account. The theta method handles the uncertainty in the data in the sense that the local curvatures are amplified or reduced to better identify short and long term movements in the data. MTA also handles data uncertainty as temporal aggregation results in smoothing the noise in the data [40]. However, MTA also addresses the uncertainty in the model form, as different models may be identified as optimal at different temporal levels: a dominating seasonal pattern may lead to the selection of a seasonal-only model at the lowest aggregation level. However, as seasonality is smoothed out by temporal aggregation, a trend pattern may become apparent in a higher aggregation level [11]. Even if the same models are identified as optimal in various temporal levels, then MTA is still likely to help by partially addressing parameters' uncertainty.


**Table 2.** How do the five approaches handle the sources of uncertainty?

Bagging is the only approach that is able to tackle all three types of uncertainty, something that was extensively discussed by Petropoulos et al. [4]. However, some bagging variations focus on particular sources of uncertainty, as discussed in Section 4. FOSS is the only approach that does not explicitly handle the data uncertainty, but directly focuses on the model form uncertainty (and the model parameters). Finally, MSP tackles data uncertainty in the sense that, by trimming series, outliers or structural changes are removed. However, the new (shorter) series might also result in alternative model forms and sets of parameters.

Next, we consider the computational cost required by each of the approaches to produce forecasts. For simplification, instead of recording computational time per se (as this would depend on length of the series, among others) we compare the various approaches in terms of models required to be fitted. As a benchmark, it is noteworthy that the ets() function of the *forecast* package for R statistical software fits 19 models (8 for non-seasonal data) before a final model is selected and its forecasts are produced. The theta method is arguably one of the most inexpensive robust time series forecasting methods. In its standard implementation, it requires the fitting of just 2 models, one for each theta line (a simple linear regression model and SES). Even theta variations that consider more than two theta lines, the number of models required is small. The robust implementation by Legaki and Koutsouri [28] that uses a Box-Cox transformation offered, arguably, one of the best trade-offs in performance versus cost in the M4 competition [30].

Compared to theta, all other approaches are more costly. MTA requires forecasts for each aggregation level: 12 for monthly data; 4 for quarterly data. However, this could be slightly reduced when one uses temporal hierarchies (6 for monthly; 3 for quarterly). It is common that in each level an automatic algorithm, like ETS or ARIMA, is used. This means that the number of models required to be fitted increases a lot. Using temporal hierarchies with ETS results in fitting 103 exponential smoothing models for a monthly time series (5 seasonal levels × 19 models + 1 non-seasonal level × 8 models). Empirical evidence https://kourentzes.com/forecasting/2014/10/31/guest-post-on-the-robustnessof-bagging-exponential-smoothing/ (accessed on 1 June 2021) has shown that Bagging's performance converges when at least 50 bootstrap series are aggregated—while most of the studies consider 100 bootstrap series. This means that Bagging with ETS requires fitting as little as 950 models (50 bootstraps × 19 models) for a single seasonal series and 400 models for a non-seasonal series, rendering it one of the most expensive approaches in this review study. Forecasting with sub-seasonal series is also very costly. From the *<sup>s</sup>*<sup>2</sup> − *<sup>s</sup>* + 1 series created, *s* of them have a periodicity of 1 with the potentially displaying seasonal patterns. Again assuming ETS, FOSS entails fitting and parametrising 165 models when modelling a series on the quarterly frequency ((*s*<sup>2</sup> − <sup>2</sup>*<sup>s</sup>* + <sup>1</sup>) × 19 models for the sub-series with *<sup>s</sup>* > 1, plus *s* × 8 models for the rest) rising to 2395 models for a monthly time series. The cost for the forecasting from multiple starting points heavily depends on the length of the series. Assuming a monthly time series (*s* > 12) with length *n* = 50, we would require at least 2*s* = 24 periods to produce forecasts, which allows us to consider at most 27 starting points, translating to fitting 513 models when using ETS.

Lastly, we consider the performance of the various approaches as published in various studies so far. We focus on the data used in two forecasting competitions, M3 [23] and M4 [29], and particularly the yearly, quarterly, and monthly frequencies. It is important to note that our summary results, presented in Table 3, are based on the empirical evidence presented on other studies, which are identified next to each numerical result. We also limit our results to the values of the symmetric mean absolute percentage error (sMAPE) as reporting the mean absolute scaled error (MASE) was not possible (different researchers apply the scaling differently). For some studies that only provided relative improvements over a benchmark, such as [14], did not differentiate between the results of each competition, such as [41,42], or were limited to one of the two competitions considered, such as [28], we have reproduced the results using the code provided by the corresponding authors. Overall, we observe that some of these approaches are more suited in forecasting nonseasonal patterns (see, for instance, the very good performance of the Box-Cox Theta on the yearly frequency), while others are better when the series are periodic (see, for instance, FOSS and MTA).


**Table 3.** The published average performance of the five approaches on the monthly data from the M3 and M4 competitions.

THIEF is applied using the "structural" reconciliation approach, while MAPA using the "hybrid" approach with a mean combination operator for aggregating the ETS components at different temporal aggregation levels. Optimised *θ* refers to the "Dynamic Optimised Theta Model". Results are replicated, where required, using the "thief", "MAPA", "forecast", and "forecTheta" packages for R, of versions 0.3, 2.0.4, 8.14, and 2.2, respectively.

> Given the high-representativeness of the data in the M3 and M4 datasets [68], we believe that the results can be safely generalised in other settings and contexts, where the presented approaches are expected to work well. However, we will highlight here some particular applications on different contexts. Nikolopoulos et al. [31] apply the theta method on finance data, demonstrating its good performance over other benchmarks. Athanasopoulos et al. [14] offer a case study for the application of MTA (in the form of temporal hierarchies) for forecasting the demand of the Accident and Emergency departments in the UK. Additionally, working with MTA, Yagli et al. [57] improved the performance of solar forecasts. De Oliveira and Cyrino Oliveira [69] demonstrate the effectiveness of the bagging approach on energy consumption data. Finally, the case study of Li et al. [17] also involves high-frequency energy consumption data and shows the good performance of FOSS when complex patterns exist. The application of MSP on different contexts is limited, as this approach has not been—to our knowledge—widely applied yet.

### **8. Conclusions and a Look to the Future**

Univariate time series forecasting can be challenging, especially since real life data do not comply with the assumptions and do not follow data generating processes usually assumed by models that can be found implemented in the forecasting support systems. At the same time, improving forecast accuracy can be crucial, as even a small decrease in the forecast error may translate to significant gains in terms of the utility of the forecasts see, for example, references [33,70], who discuss the case of forecasting for inventories. In this paper, we reviewed five approaches that can enhance the performance of univariate time series forecasting methods. These approaches are based on two basic principles: (*i*) manipulation of the original data to extract as much information from them as possible, and (*ii*) forecast combination which has been proved to be extremely beneficial in the forecasting field see, for example, references [71,72].

The five approaches that we presented can be applied on top of established time series forecasting models, such as ETS or ARIMA. In fact, we can argue that all these five approaches work as self-improving mechanisms to boost the performance of the underlying forecasting methods. Although the term "self-improving mechanism" was originally used by Nikolopoulos et al. [38] to describe the performance gained by applying temporal aggregation, we argue that this is a good descriptor for all the approaches discussed in our study. It is important to highlight that the improvements achieved by the application of these approaches do not entail the collection of additional data, such as explanatory variables, that usually come with an additional cost, as well as uncertainty in a sense that, in most cases, the future values of these variables must also be predicted for supporting forecasting methods in a regression fashion. The input for all approaches described is simply the past values of the dependent variable of interest.

When a large number of data are available, then empirical evidence from the latest forecasting competitions [29,73] shows that meta-learning and cross-learning approaches can be used to improve time series forecasting performance. Such "global" approaches are often based of time series features [74] or patterns [75] that may be prevalent and common across many time series. As a result, meta-learning and cross-learning approaches are relevant for companies that require to produce forecasts for myriads of data [76]. Large retailers, such as Walmart, Target, and Carrefour, are representative examples. However, many more companies and organisations are interested in forecasting only a few tens or hundreds of time series to support their operations, marketing, and other functions. As such, "local" solutions, like the ones covered in this study, that use information from singular time series only, are still very useful in practice. More importantly, if one needs to forecast only a small number of series, then it would make sense to invest in the additional computational resources required to handle the most demanding of the approaches (Bagging and FOSS). Regardless, we believe that analysts that wish to apply the approaches presented in this paper should decide based on their added-value across different sampling frequencies (see also the discussions in Section 7) balanced against their relative computational cost.

The various approaches that we presented in this paper have been so far studied in isolation. Although the applying of these approaches in a sequential fashion is entirely feasible, as it is the case with an MTA implementation—the thief() function—which offers theta as one of the methods to produce base forecasts, it would be even more interesting to see future studies that focus on the integration of the approaches described here. The only exception that we are aware of is the study by Wang et al. [77] that attempts to structurally integrate the concepts surrounding the theta method (and the manipulation of the local curvatures) with aspects of non-overlapping temporal aggregation. We believe that there is much scope for further research in integrating "wisdom of the data" approaches. For instance, one could consider defining a temporal hierarchy approach in which the base forecasts for the nodes of a certain aggregation level are not produced by considering the entire series consisting of all information at the same aggregation level, but each node is extrapolated separately using sub-seasonal series (FOSS). Another example would be the integration of bagging and multiple starting points approaches, since each of them focuses on a different way in extracting information from the data.

Another interesting path for future investigation would be to explore how these approaches can better support forecasting in practice. For example, consider the extension of these univariate-oriented approaches to fit within a hierarchical framework which contains several series that are cross-sectionally aggregated. Temporal hierarchies naturally extend to cross-temporal hierarchies, see [56], however this is not the case with all other approaches described here. For instance, when using bagging on a particular node of the hierarchy, the bootstrapping of the remainder could be informed by the remainder of the other nodes. Even more interestingly, a bootstrap model combination approach could be based on the models selected as optimal across hierarchical aggregation levels.

In conclusion, univariate time series forecasting benefits from looking the available data through different lenses, attempting to understand them better and model them more efficiently. This is achieved by tackling uncertainties associated with data itself and easing the identification of an 'optimal' model and its parameters. As such, we are looking forward to see more approaches that consider the "wisdom of the data" towards enhancing the forecasting performance.

**Author Contributions:** Conceptualisation: F.P.; methodology: F.P. and E.S.; software: E.S.; validation: E.S.; visualisation: F.P. and E.S.; formal analysis: F.P. and E.S.; investigation: F.P. and E.S.; resources: F.P. and E.S.; data curation: E.S.; writing—original draft preparation: F.P.; writing—review and editing: E.S.; supervision: F.P.; project administration: F.P.; funding acquisition: F.P. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data of the M3 and M4 forecasting competitions are publicly available and can be found at https://forecasters.org/resources/time-series-data/, accessed on 1 June 2021.

**Acknowledgments:** The first author thanks Evelyn and Freddy for their support.

**Conflicts of Interest:** The authors declare no conflicts of interest.

### **Abbreviations**

The following abbreviations are used in this manuscript:


### **References**


MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Forecasting* Editorial Office E-mail: forecasting@mdpi.com www.mdpi.com/journal/forecasting

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel: +41 61 683 77 34

www.mdpi.com

ISBN 978-3-0365-5572-0